Profitability Of Renting AI servers with different interconnects
December 6, 2024
Interconnect Technologies: An Overview
I'll start by laying out the technology used in AI, HPC and other server workloads to give you an understanding of how these technologies are used and later in this blog I will breakdown how this tech affects your bottom line. Explaining why old tech should not be used and why some new tech can be risky if you haven't enough knowledge to operate these technologies.
1. PCIe (Peripheral Component Interconnect Express)
PCIe is the most widely used interconnect for CPUs, GPUs, and peripherals in servers. Generational improvements in PCIe versions 3.0, 4.0, and 5.0 have brought higher bandwidth and lower latency:
- PCIe 3.0: Suitable for none demanding workloads which do not require high performance. Release date 2010 very dated.
- PCIe 4.0: Addressing mid-level performance requirements. Release date 2017 still in wide use but now become dated technology.
- PCIe 5.0: Enables better communication for high-performance systems like AI/ML training or HPC. Release date 2019, now in wide spread use with modern up to date servers.
PCIe 6 standards have already been set in 2022, but are not in widespread use yet and supported by very few devices - no CPU's currently sold by Intel or AMD to the open market. The same is also true of PCIe 7, which won't be released until 2025 and so far to date I have never seen or heard of a PCIe 7 device in the field. Most next gen GPU's and network adapters will support PCIe 5.
2. NVidia\Mellanox Interconnects Networks and Fabrics.
InfiniBand is a high-speed networking interconnect designed for low-latency, high-throughput communication across clustered systems. With speeds up to 800 Gbps (HDR) and direct RDMA support, it is widely used in HPC and enterprise environments. Features include:
- Scalability: Connects thousands of nodes in a supercomputing environment.
- GPUDirect RDMA: Facilitates GPU-to-GPU communication across nodes without CPU intervention.
- Seamless CPU-GPU Communication: Enables large-scale hybrid compute clusters.
- NVLink and NVSwitch
Developed by NVIDIA, NVLink is a GPU-to-GPU interconnect that provides ultra-fast communication with bandwidths of up to 1800 GB/s in its latest version. It creates unified memory pools for AI/ML workloads and is essential in systems like NVIDIA DGX for:
- Single-Node Performance: Enhances GPU communication within one server.
- Large GPU Clusters: Uses NVSwitch to connect multiple GPUs in a full-mesh topology for large-scale AI/ML or HPC tasks.
- https://www.nvidia.com/en-gb/data-center/nvlink/
4. XGMI (AMD Infinity Fabric)
XGMI is AMD’s proprietary interconnect for efficient CPU-to-GPU and GPU-to-GPU communication. Features include:
- Tight CPU-GPU Integration: Leverages AMD EPYC CPUs with Instinct GPUs for AI and HPC workloads.
- GPU to GPU communication: 896GB\s Bidirectional
- Scalable Clusters: Facilitates distributed workloads using Infinity Fabric.
- https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-platform-data-sheet.pdf
5. RDMA and RoCE
Remote Direct Memory Access (RDMA) allows memory access across nodes without CPU intervention, significantly reducing latency. RDMA can be implemented over InfiniBand or Ethernet-based technologies like RoCE (RDMA over Converged Ethernet):
- RoCE: Brings RDMA's low-latency benefits to Ethernet networks, enabling compatibility with existing data center infrastructure.
- InfiniBand RDMA: Offers the lowest latency and highest throughput, ideal for HPC.
- https://www.juniper.net/content/dam/www/assets/white-papers/us/en/2024/juniper-artificial-intelligence-data-center-comparison-of-infiniband-and-rdma-over-converged-ethernet.pdf
Performance Benefits of Advanced Interconnects
1. Clustered GPU and CPU Systems For AI and HPC
In clustered systems, interconnect technologies like InfiniBand, NVLink and RDMA are critical for ensuring efficient communication between compute elements:
- GPU Clusters: NVlink and xGMI create unified memory pools, while InfiniBand or RoCE connects GPUs across nodes with minimal overhead, enabling fast AI/ML training or large-scale simulations.
- Hybrid Clusters: InfiniBand and xGMI\Infinity Fabric - provide low-latency connections between CPUs and GPUs, optimising workloads that rely on both compute types.
2. AI ML Workloads
AI and ML workloads benefit significantly from advanced interconnects:
- NVLink accelerates GPU to GPU communication, reducing training times for AI models.
- InfiniBand RDMA allows GPUs across nodes to exchange data efficiently enabling distributed AI training.
- RoCE (RDMA Over Converged Ethernet) used as a Ethernet networking alternative to NVidia\Mellanox's Infiniband.
3. HPC Workloads
High-performance computing (HPC) demands high-speed, low-latency networking interconnects:
- InfiniBand: Essential for large-scale simulations, weather modeling and complex molecular dynamics.
- RDMA: Eliminates CPU bottlenecks in distributed systems, ensuring smooth data flow across nodes.
Profitability of AI Server Rentals: How Interconnects Make a Difference
1. Increased Performance and Utilisation
Faster interconnects reduce workload execution times, allowing servers to handle more tasks within a given period. Features like InfiniBand scalability and NVLink’s unified memory pools enable:
- Higher Throughput: More tasks completed per server.
- Improved Resource Utilisation: Minimising idle hardware increases revenue efficiency.
2. Differentiated Offerings in Data Centre Tiers
Advanced interconnects, storage solutions and memory technologies allow service providers to create premium pricing tiers, ensuring they can cater to a wide range of customer needs while maximising revenue opportunities. Here’s a breakdown of the key tiers and their potential for profitability:
Lowest Tier: PCIe 3 Systems
Servers using PCIe 3 interconnects are rapidly becoming obsolete. While they can still handle basic workloads, their earnings potential is extremely limited. Achieving ROI is highly unlikely unless:
- The equipment is purchased at a very low price.
- Operating costs, particularly electricity, are minimal (less than $0.10 USD per kWh).
Most rental platforms avoid offering multi-GPU servers with PCIe 3, even if the GPUs themselves are decent, due to these constraints.
Standard Tier: PCIe 4 Systems
PCIe 4-based servers are the current standard for general-purpose workloads and some AI tasks. However, they are gradually being replaced by PCIe 5 systems.
- Profitability: While still profitable in the short term, their earnings potential is diminishing over time.
- Use Cases: Best suited for rendering or AI inference workloads that run on single GPUs or multi-GPU setups without clustering.
Performance Range: PCIe 4 servers have a broad range of configurations, from entry-level systems with 8-core CPUs to high-performance dual AMD EPYC setups with up to 128 cores.
Medium to High-Performance Tier: PCIe 5 Systems
PCIe 5-based servers represent the next step in performance for bare-metal machines. These systems:
- Deliver strong earnings when paired with powerful GPUs.
- Are ideal for connecting devices on server nodes.
Storage Innovations: Advanced storage solutions, such as NVMe drives with PCIe 4 or 5 interfaces and storage-class memory (SCM), are often integrated at this tier to support demanding workloads. These provide:
- Low-latency performance, critical for real-time processing in AI, HPC or some high-frequency trading applications.
- High throughput, ensuring that even large datasets can be handled efficiently without bottlenecks.
OpenZFS File System:
- OpenZFS is increasingly adopted in high-performance environments at this tier for its advanced features, including data integrity checks, deduplication, and efficient snapshots.
- It allows seamless integration with NVMe and scalable storage pools, making it an attractive choice for maintaining data reliability and performance.
CXL (Compute Express Link) Interconnects:
- CXL 2.0 and 3.1 interconnects, built on PCIe 5, allow systems to share memory and expand their capacity seamlessly.
- These technologies enable the pooling of high-speed memory, which can be dynamically allocated across CPUs and GPUs.
- For memory-intensive workloads like AI training, CXL dramatically increases system flexibility and scalability, reducing bottlenecks.
This tier also serves as the foundation for servers in the higher-performance tiers, which incorporate additional advanced technologies.
High-Performance Tier: InfiniBand and RoCE Systems
Servers equipped with InfiniBand or RoCE (RDMA over Converged Ethernet) excel in distributed system setups. They offer:
- Earnings Potential: Good to very good returns when utilised effectively.
- Requirements: These systems demand specialised knowledge, infrastructure, and moderate to significant investment.
Storage Innovations: Distributed storage systems, such as Ceph or BeeGFS, are commonly deployed in these configurations. These enable:
- Scalable, high-speed storage, critical for workloads like machine learning and HPC.
- Data redundancy and fault tolerance, ensuring system reliability.
OpenZFS\ZFS File System:
- OpenZFS excels in this tier due to its scalability and advanced features such as multi-level caching (ARC and L2ARC) and compression, which enhance performance while optimising storage usage.
- Its ability to handle petabyte-scale datasets makes it ideal for AI, HPC, and large-scale data analytics.
CXL and Memory Expansion:
- Systems in this tier benefit significantly from CXL-enabled memory expansion, which allows for larger memory pools that can be shared between nodes.
- Memory pooling and tiering using CXL reduces overall system costs while enhancing performance for distributed AI and HPC tasks.
Premium GPU Clusters: NVLink and xGMI with InfiniBand or RoCE
The highest-performance tier involves NVLink or xGMI GPU clusters, paired with InfiniBand or RoCE, making them ideal for AI/ML or HPC applications. These systems:
- Require large-scale investment in premium GPUs, CPUs, and infrastructure.
- Demand skilled staff for setup, maintenance, and operation.
- Need a strong sales pipeline to secure renters or contracts to ensure high utilisation.
Storage Innovations: Premium GPU clusters incorporate cutting-edge storage solutions to complement their high-performance computing capabilities:
- All-flash arrays (AFAs) provide ultra-low latency and high IOPS, essential for the massive data throughput demands of HPC and AI training.
- Parallel file systems such as Lustre or IBM Spectrum Scale (GPFS) enable rapid data access and processing across distributed clusters.
- Tiered storage architectures, integrating SCM or NVMe alongside more cost-effective storage, optimise both performance and cost.
OpenZFS, ZFS Or other Distributed Files Systems:
- OpenZFS is a popular choice for managing complex storage needs in premium clusters. Its ability to handle data compression and adaptive caching enhances both performance and efficiency.
- With support for replication and snapshots, OpenZFS ensures data security and facilitates system recovery in high-stakes environments.
CXL for High-Performance Memory:
- CXL enables the creation of memory fabrics, allowing GPUs, CPUs, and accelerators to access shared memory pools without traditional bottlenecks.
- This capability enhances scalability for AI/ML training workloads and complex simulations, making it a critical component in premium clusters.
Risk and Profitability:
- Highly profitable when fully utilised, but operational costs are significant.
- Risky for inexperienced investors unless a skilled team and customer contracts are in place before deployment.
Maximising Revenue Across Tiers
Each tier caters to specific customer needs, from budget-conscious workloads to cutting-edge applications requiring premium hardware, storage, and memory. The inclusion of advanced interconnects like CXL, alongside robust file systems like OpenZFS, further enhances performance and scalability, making these solutions even more attractive to diverse markets. By offering comprehensive server configurations, service providers can ensure increased revenue as they scale up the performance and complexity of their offerings.
3. Enhanced Customer Satisfaction and Retention
Customers benefit from faster workload completion and reliable performance:
- Reduced Latency: Technologies like NVLink\Infinity Fabric and RDMA improve user experience for latency-sensitive tasks.
- Seamless Scalability: InfiniBand and RoCE ensures smooth expansion for growing workloads.
Satisfied customers are more likely to renew contracts and recommend services, increasing lifetime value (LTV).
4. Operational Cost Savings
Efficient interconnects reduce costs associated with energy and infrastructure:
- Lower Power Consumption: Faster job execution reduces energy costs per workload.
- Optimized Hardware Requirements: High-speed interconnects minimise the number of nodes required for large-scale jobs.
5. Market Competitiveness
Providers offering advanced interconnect technologies stay ahead in the competitive server rental market:
- AI/ML and HPC Attractiveness: InfiniBand and NVLink cater to the growing demand for these workloads.
- Branding as a Premium Provider: High-performance interconnects elevate the perceived value of services.
Challenges and Considerations
- Higher Initial Costs: Advanced interconnects like InfiniBand and NVLink require significant investment.
- Customer Education: Providers must educate customers on the value of interconnect technologies to justify premium pricing.
Conclusion
Interconnect technologies like PCIe 5, InfiniBand, RoCE, NVLink, xGMI Infinity Fabric CXL, and RDMA are transforming server infrastructure, particularly in clustered CPU-GPU systems. Their ability to enhance performance, optimise resource usage, and reduce costs makes them indispensable for modern server rentals. By offering advanced interconnect-equipped servers, rental providers can:
- Accelerate workload execution for AI, ML, and HPC tasks.
- Maximise resource utilisation and server profitability.
- Differentiate their offerings and charge premium prices.
- Build customer loyalty and strengthen market positioning.
Incorporating these interconnect technologies into server rental strategies ensures long-term profitability and competitiveness in a rapidly evolving market.
This blog is for anyone interested in earning passive income by renting out servers they've purchased as part of a new and exciting asset class. If you're considering entering this market but need guidance, feel free to reach out to me, James Walsh (Sytronix) or Alex Moody (Sytronix) via LinkedIn. I’ll also share this blog there for easy access.
Don’t worry if you're unfamiliar with the technologies discussed here just ask!
Whether you connect with me, one of my team members, or our trusted partners Panchaea in Europe and Hydra Host in the USA we’ll be happy to explain the tech and its potential in simple, straightforward terms.
For those looking to explore the investment opportunities in this space further, our data center partner, Daylight Compute (based in the UK), can provide additional insights. Let’s get started!