Thoughts on Ponds

Academic Corner - This article is part of a series.

Part 1: Contiguitas is Hard to Say, But a Pretty Neat Idea

Part 2: This Article

Part 4: Pointer Chasing and The Boundary of a Host

These days the line between industry and academia are heavily blurred for computing. As CXL becomes more mature, the need for an ‘academic corner’ will likely diminish on this blog, but for now we’re going to spend some time here and there talking about cool papers and articles that are CXL related and CXL adjacent. Follow the anthropomorphic rabbit at the unsettling typewritter and let’s dive in:
Pond: CXL-Based Memory Pooling Systems for Cloud Platforms

Summary
#

Author(s) present a scheme comprised of software and hardware which claims to reduce the cost of DRAM in Azure datacenters by 7% while sacrificing only 1-5% of performance as compared to the default system configuration currently deployed.

What makes this paper stand out?
#

Personally speaking, what I think stands out in this paper is not the pond proposal, but the sheer amount of work in profiling 158 distinct workloads within Azure’s cloud infrastructure for their memory and latency requirements. The proposal to build wide laned CXL devices to cut down on needed CXL fabric hops through switches is somewhat straight-forward, and “machine learning” to profile how resources should be pre-allocated for classes of VMs feels unnecessary.
‘Straight-forward’ from the standpoint that wide and short means no oversubscribed device bandwidth and less memory latency, which is good when trying to be as close to numa latency as possible.
‘Unnecessary’ from the standpoint that the authors do not discuss the value of ML for their data analysis, just that they used it. The number of features discussed as fed to their trainer are not very high, I think just hard data analysis would arrive at a discrete solution rather than a ML approach. It could however just be the jaded nature of this reviewer and the growing ubiquity of the magic pachinko machine over trying to grok and then optimize the problem space with a traditional algorithm implementation, given the decision tree on figure 13, this feels tractable.

How is this any different from [insert idea]?
#

For a holistic CXL-based system, this paper is a first. The authors do reference the underpinning related topics of hypervisor memory management and scheduling, memory stranding and efforts to reduce it, as well as some papers on the management and distribution of far memory to applications.

There are however a number of papers on or regarding far memory pools and how to manage these resources in a variety of ways. The zNUMA pool mechanism doesn’t have much description of how it is implemented, just the function it performs. The authors make a brief reference to the mechanism biasing the guestOS to using local memory over the zNUMA pool because of its default scheduling mechanisms, but that’s about all.

workloads considered
#

The authors consider a broad set of workloads (158). 13 are Azure proprietary, YCSB (A-F) on Redis and Volt, HiBench on Spark, graph algorithm platform bench suite, TPC-H, SPEC CPU 2017, PARSEC, and SPLASH. Basically spanning cloud to graph to HPC workloads. Very Impressive.

Take-aways for me
#

Narrow focused constraints on VM performance that likely align to the QoS guarantees of Azure cloud to its customers provides for modest savings in mixed use, opaque vm workload scheduling. The paper’s focus is myriad however and I wish that this was actually at least three papers instead of just this one paper. I.e. I would enjoy reading

A workload study paper for the applicability of far memory in the wide study of workloads
A paper on just zNUMA/EMC and its design rationale and trade-offs, as well as
A systems paper that delved deeper into Pond

rather than all these topics glommed into a single ~12 page work.

Some things that I would like to hear addressed in follow on pond work are

How does VM performance (on average) change with a variable PDM greater than 1-5%, and does that larger PDM afford more DRAM savings/reduced stranded memory?
Slabs are locked at 1GB (with the one time exception) what are the effects of a variable pool slab size allocation?
Why are randomwalk for insensitivity training and GBM for untouched memory the appropriate ML model choices for training, i.e. do they just converge the fastest over the 75day eval window? Best train time to inference time tradeoff? Something else?
How does Pond handle failure of hardware (does it)?

Figures 2, 4, 7, 11, 13, and 21 were the most interesting for me.

How do I do this?
#

Hi there, we’re Jackrabbit Labs (JRL for short). We build open source CXL things then help you use them. If you’re finding yourself scratching your head on day 2 of your CXL deployment, please reach out. We’d love to get to know you.

Academic Corner - This article is part of a series.

Part 1: Contiguitas is Hard to Say, But a Pretty Neat Idea

Part 2: This Article

Part 3: Thoughts on Pools

Part 4: Pointer Chasing and The Boundary of a Host