One of our strengths as mankind is the ability to focus. We can isolate our thoughts, efforts and even our physical selves in the pursuit of a single task. With this singular focus we can perform great achievements, but we may also miss out on the collective intelligence offered through collaboration with a group of fellow travelers. The purpose of this post is to highlight the benefits of public forums and gatherings where we can collaborate as an industry toward the progress of a common goal.
CXL SIG #
The CXL Special Interest Group (SIG) held a special session hosted by Pankaj Mehra at UC Santa Cruz on Wednesday, February 7th 2024. The agenda for this session was to review two recent industry papers with apparently conflicting views about the value of CXL based memory pooling. We attended the session to participate in the discussion as well as observe the CXL community provide collective feedback to the presenters.
The Google Paper #
The first presenter at this SIG meeting was Philip Levis from A Case against CXL Memory Pooling published in Hotnets'23 by authors: Philip Levis, Kun Lin, and Amy Tai (all from from Google which earned it the nickname “The Google Paper”). “A case against CXL” was an investigative work to evaluate the viability of CXL attached DRAM as a way to reduce DRAM spend in the data center. The primary findings from this work are that CXL based memory pooling will have 3 primary challenges to adoption:
- The cost of pooling DRAM over CXL will outweigh the cost savings by reducing DRAM spend in the data center,
- The latency of memory accesses over CXL is substantially higher than CPU attached DRAM and will have material impact on applications performance,
- And utilization of CXL DRAM will require unattractive software applications changes.
The Azure Paper #
The second presenter was Daniel Berger of Design Tradeoffs in CXL-Based Memory Pools for Public Cloud Platforms published in IEEE Micro by authors: by Daniel Berger, et al. (several of the authors are from Microsoft Azure which earned this paper the name “The Azure Paper”). This paper asserted that for Azure workloads, DRAM stranding was a dominant form of memory waste and was significant as it resulted in up to 30% of DRAM memory being stranded. 1 As prior papers (such as Pond: CXL-Based Memory Pooling Systems for Cloud Platforms) have claimed that DRAM accounts for up to 50% of the Bill of Materials (BOM) cost of a server, stranding 30% of memory results in substantial financial waste and any reduction in this waste would materially improve data center economics. With this objective in mind, the authors set out to evaluate if CXL attached memory pooling would be a viable approach to reduce DRAM cost. In summary:
- Some workloads (26% out of 158 evaluated) were minimally performance affected (<1%) by the additional latency incurred with CXL attached memory, while some workloads (21% of the 158) were severely affected with performance degradation exceeding >25%,
- Of the two primary architectural approaches, (1) CXL Switch attached memory and (2) Multi-Headed Device (MHD) CXL memory controller2, the MHD approach would induce less latency overhead that may be acceptable for some workloads while CXL switch attached memory would result in unacceptably high latency,
- Large memory pool sizes will likely have a negative ROI when considering them as a solution for stranded DRAM, while small memory pools connected to between 8 to 16 sockets would be ROI positive.
Commentary #
While the papers appear to come to different conclusions about CXL attached memory, neither is holistically wrong. They simply have different assumptions and environments. With different constraints, it isn’t surprising to come to different conclusions.
For example, the authors of the Google Paper noted that for Google workloads, the Virtual Machines (VMs) were typically small when compared to the amount of resources on a typical host and were therefore easy to place (or ‘pack’) with minimal resource stranding.
This isn’t that much of a surprise.
Google’s internal application workloads are comprised of a suite of web applications that are designed to sustain a workload of 100’s of Millions of concurrent users. To handle such a large number clients, applications must be widely distributed across many many nodes to avoid congestion. Small VMs evenly distributed and easily packed cross many servers therefore isn’t a surprising environment to evaluate. It however, may be different–not conflicting–with the assumed workloads of the Azure team.
Where both papers seemed to agree was that there was an acceptable limit to how much additional latency could be tolerated with CXL attached memory. That seemed to be around 2x the latency of local CPU attached memory or about 240us. Beyond that, the software applications would suffer material performance degradation. This requirement essentially claims that CXL Switch attached DRAM will not meet performance goals as a solution to mitigate DRAM stranding but that the MHD approach appears to meet those performance limits.
Taking A Step Back #
When considering the findings of the two papers, we have to first ensure we clearly identify the problem that is being addressed and avoid the mistake of extrapolating the findings to all other problems that could be addressed with CXL. Both of the papers sought to evaluate CXL attached memory with the objective of addressing DRAM stranding of virtualized workloads in the data center. The objectives of this evaluation were to reduce DRAM spend without materially affecting performance of the running workloads. With these objectives, it was found that a CXL Multi-Headed Device would both reduce DRAM spend and incur acceptable impacts to performance. But this conclusion doesn’t necessarily apply to other workloads and challenges that can be addressed with CXL.
There are some workloads that would significantly benefit with having greater memory capacity within one server (e.g. in-memory databases, ‘pointer chasing’, etc.). These types of workloads are constrained by the memory capacity of a single server today and are required to split up the workload across multiple servers just to get enough total DRAM to store the data set. Accessing the data in memory on another node requires a network traversal which incurs–at best–around 4 or 5 us of latency. A single server with several TBs of main memory attached over CXL would eliminate the need to access the memory of another node over a traditional Ethernet network. Yes, the large quantity of CXL attached memory would need a CXL switch to fan out to all the DRAM devices and would have materially longer access latency (e.g. ~500ns instead of 120ns). But it would be an order of magnitude faster than going out over the network.
Summary #
Different challenges will have different recommended solutions. What was found by Google researchers isn’t expected to match what was found by Azure. And that is ok. CXL as a technology can be used in a variety of ways to address multiple challenges with different requirements. As a solution for one workload may not be the right solution for another, it is important to take a step back and listen, invite commentary from industry participants in forums like this, and listen to the collective intelligence find consensus.
- 
DRAM Stranding: This is when all the CPU cores of a server have been rented to clients, but due to the selected DRAM-to-core ratio of the VMs, there remains some amount of DRAM unrented on the host. This DRAM cannot be used for any other purpose or shared outside the compute node and is therefore ‘stranded.’ ↩︎ 
- 
CXL Multi-Headed Device (MHD): This is a CXL DRAM module with multiple CXL (PCIe) ports. Similar to how NVMe SSDs can have dual ports, CXL DRAM devices can have 2 or more ports, which enables them to connect directly (without a switch) to 2 or more hosts. ↩︎ 
 
        