Global OCP Summit: Kubernetes on CXL

OCP 2024 - This article is part of a series.

Part 1: This Article

Come watch our talk from OCP 2024
#

This year Jackrabbit Labs partnered with Uber, SK hynix, and Intel under the CMS¹ group banner at OCP to bring the community something really really special, Extra Memory^TM². In the video below, we explain how CXL memory expanders like the SK hynix Niagara MHD³ or others can transparently plug in to Kubernetes for use today.

Have a recent linux kernel? Run Kubernetes? Want extra memory on your servers so you don’t have to buy more servers? This video is gonna make you happy. Kubernetes and CXL with no special sauce, just our orchestration platform and knowhow, enjoy!

*Small side note, the audio volume varies greatly from mic to mic. You may need to up the volume in the second half of the video

Transcript
#

[Intro/Agenda]
#

All right hello everyone this is Matthew (SK hynix). I’m happy to share our great open collaboration thanks to Grant (JRL), Vikrant (Uber), and Reddy (Intel).

So what’s this? In each layer of the system stack including Uber’s applications and usage, Jackrabbit labs cloud orchestration software, and SK hynix CXL-based memory solution considering Intel’s recent CPU architecture. All of our members play a critical role to prove the feasibility and value in their professional sectors so, you know, here at OCP is the right place that we bring technologies into a real system.

Today we are introducing about composable memory architecture with kubernetes fabric attached memory orchestration. Today’s agenda promises an exciting blend of groundbreaking solution and insightful messages regarding composable memory systems, especially focusing on multihost based CXL memory solution. Based on the OCP CMS logical and physical architecture we would implement a feasible hardware and software solution for memory pooling through this collaboration and showing now.

Open source software based composable memory systems environment definitly paves the way for straightforward and simple to use and efficient data movement of CXL fabrics. Through this type of infrastructure, by leveraging CXL 2.0 and 3.x beyond the CXL 1.1, memory pooling by which stranded memory are mitigated is practical. That multiple hosts can dynamically allocate and deallocate their CXL memory portions according to each node’s memory usage by means of CXL spec features like Dynamic capacity device(DCD).

[CMS – Logical Architecture Overview]
#

Actually this work is already done in the OCP CMS workstream and also this is captured from the OCP CMS white papers. You know CXL is a well recognized interconnect for composable memory architecture with current and future expected usages. In a higher level hosts with firmware, operating systems, and virtual machine manager that is essentially consuming local DRAM and remote CXL direct attach DRAMs and multi-headed memories. Last but not least, data center memory fabric manager that stitches all of these pieces together and works with Orchestrator to provision and de-provision the memory. You can take a look into more about the logical architecture in the OCP CMS white papers.

[Logical Architecture – Fabric Manager (DCMFM)]
#

CXL 2.0 provides that device memory can be allocated to multiple host through CXL switch, and for better connectivity CXL 3.x provides that directed peer-to-peer device memory access through multi level CXL switches and it can remove a bottleneck that we don’t have to go through the host anymore (p2p endpoint communication). Another thing is data center memory fabric manager, also known as CMS platform orchestrator, focuses on supporting existing data center scale resource scheduler via familiar fabric manager apis, and CCI for handling of composable memory operations.

[CMS Example – Pooled Memory & Kubernetes]
#

well, what is the idea of this collaboration and what is the impact of this collaboration? For getting the answer, we can think about the reasoning via six principles like 5w 1h. Under Uber’s data center usage you know that stranded memory is a real pain point in kubernetes environment. So as an one solution to solve this pain point we built up the composable memory systems with SK hynix FPGA-based, real CXL pooled memory prototype and Jackrabbit Labs cloud orchestration software for memory pooling, under Uber’s data center usage.

[Details of Kubernetes Cluster Environment]
#

This is a high system diagram composed of multi-host servers as kubernetes workers, orchestration server as kubernetes master, and CXL pooled memory solution. We have dubbed SK hynix’s CXL pooled memory prototype as Niagara. Niagara can connect to maximum eight host and can support a maximum one terabyte capacity using four channels. To support the CXL compatible DCD functionality, Niagara consist of two parts such as pooled memory manager and pooled memory controller. According to the separated orchestrator request pooled memory manager of Niagara can send the DCD commands to pooled memory controller and then pooled memory controller plays an important role in supporting DCD functionality to allocate and deallocate memory blocks. While we use FPGA based CXL pooled memory prototype to support DCD functionality, we would implement software stacks interface for host, CMS platform orchestrator and pooled memory management controller. I will hand it over to Grant who will talk about the mapping of this architecture to kubernetes with cloud orchestration software and FPGA based CXL pooled memory prototype. It’s your turn Grant.

Yeah thank you, and I want to thank Uber for their help with talking about how kubernetes would like to you know interface with a composable memory architecture, like, ‘what’s a good way to expose this type of platform?’, and then SK hynix for providing the Open Innovation Lab and the access to the Niagara and stuff like that so we can show a real functional demo of how these technologies can be useful for application end users.

[DCMFM – Kubernetes PoC]
#

So, Vikrant made a comment about how his exposure here is proportional to the number of years he does things. Last year, he had a presentation here in this form and he said ‘wouldn’t this be cool’ if kubernetes could talk to DCMFM and schedule pods on CMS without having to worry about how the underlying memory technology works. And, so we did that. (Audience) nice (/audience) This is in the Innovation Village now, we’ve been running it since Tuesday. It’s essentially, it’s bone stock kubernetes, vanilla kubernetes, running on a vanilla Linux kernel running workloads on a CXL expander. (Audience applause) thanks. So no off the-shelf modifications, the Niagara does have DCD functionality and there’s some special kernel bits with that to get that functional because the hardware is not available, but the demo itself is stock. There’s, we’re not using those, we can use those mechanisms there’s no reason why we can’t use mechanisms, but we are using just standard Linux kernel stuff, if it’s 6.3 or newer you’re good to go.

The implementation bits are pretty straightforward. We have a few things running on the host themselves. We have a couple of orchestrator daemons that just manage the memory that’s visible to the hosts, there’s a control point for those daemons running on the different worker nodes, but then the rest of it is just deployed into kubernetes. So there’s a monitor pod, that has a kubernetes life cycle, it’s robust blah blah blah restarts, and it serves as the communication point between The daemons running on the bare metal hardware and the kubernetes scheduler itself, and then your applications your pods run unmodified.

Init pods are a very familiar thing with kubernetes, you just, it’s the setup that you need to do for your container applications to run and so you just add a few things that say, ‘Hey I need remote memory, I need CMA memory’ and then it just schedules it. It (Kubernetes) doesn’t know how it’s being scheduled, it doesn’t care because it shouldn’t have to right? You provide a couple YAML files everything is good to go. Like I said, come check it out it’s live. Bring a workload we’ll run it, it’s awesome.

[Composable Memory System for K8s Cluster]
#

This is like the larger overview, so this is the setup that we’re running right now. We’ve got two Emerald Rapids platforms and the Niagara system running. They’re both connected through a (PCIe) Gen 4 Link x8, the cluster orchestration mechanism is kubernetes, and then our bits that are running in between the application and the EMRs, and then that’s pretty much it. So, we’ve kind of covered this right?

[CMS Demo – kubectl apply composable memory]
#

I was going to subject you to a video and then I decided I would subject you to this instead. We’ll run through this, it’s a lot. Essentially, you deploy a familiar looking yaml file, and what this does is it pulls our orchestrator, it pulls the orchestrator off the container, loads it in, creates a namespace in kubernetes. You add a little bit to your pod deployment for the init to ask for the claim and then the highlighted bit in the top, you’ll see it’s sitting at zero zero memory for the numa node.

[CMS Demo – CMS deployed]
#

You ‘kube apply’ like you would always and memory shows up. You can check the status of the pods everything’s running everything’s deployed, and you just have memory on your server now. Easy peasy, everyone who’s used kubernetes can do this.

[CMS Demo – Pod deployed with CXL memory]
#

And then you know, you can check the status of all your claims from the bare metal side from the orchestrator to make sure that everything’s connected, everything’s running, everything’s happy. You can see that the orchestrator understands the pods and the life cycle of the pods so it can garbage collect these memory resources when the pods exit or get terminated.

[CMS Demo – Benchmarks]
#

Last thing for this is that we did run like, actual workflows on it not just some test stuff. Like I said, it’s a (PCIe) Gen 4x8 interface to ddr4 memory. We ran two benchmarks that Uber found interesting for this work. Gobench from CloudFlare, and then a Java benchmark which is a higher bandwidth transactional processing workload that kind of mimics an e-commerce store. Go bench ran parody. So we ran gobench on the ddr5 memory and then we ran gobench pinned to the Niagara, and because the go kernel is so small it was basically cached in running in the CPU and even on the Niagara all the go bench benchmarks like the compression, the decompression, the regex stuff, the HTML string parsing ran pretty much the same.

And then the Java Benchmark, it was constrained by the interface. But you know, you can do simple spherical cow math and say that if it wasn’t constrained by the interface remote memory would have been suitable for these workloads. So we’re pretty happy with where we are on this. [pass the mic to Reddy]

[Takeaways – Why Composable Memory Fabrics]
#

So, as Grant and Matthew mentioned, we started this last year. We basically said let’s put together a solution that essentially ‘works’ in kubernetes. We didn’t have many moving parts. We didn’t have CXL buffer and working software stack, and silicon being ready, the server being ready, server memory buffer working, software being ready. Here we are with deep collaboration among four different companies, plus lots of others in OCP kind of highlights you know the open and the innovation that can actually bring us together. So we have a working kubernetes end to end solution that not only does the control plane but also does the data plane, and you are able to actually launch the workloads. And this whole thing is very transparent, as far as the kubernetes is concerned it is completely transparent. That’s what we aim to.

So our goal is to essentially build upon this success, to drive more and more memory pooling use cases, and we also want to focus on high availability as well on top.

[Summary and Call to Action]
#

So I strongly urge you guys to actually participate in the the development of what the team has done already within the context of kubernetes, contributions to the kubernetes community, as well as building the orchestration memory orchestration layer, end specs, white paper, implementation proof points, collaterals and everything in everything that you can think of all the way from development to production implementation. We strongly request you guys to join, thank you!

Composable Memory Systems, a subgroup of the OCP Server group ↩︎
Turns out can download more dram now ;) ↩︎
Multi-headed Device, an external memory device that can connect to multiple servers simultaneously ↩︎

OCP 2024 - This article is part of a series.

Part 1: This Article

Come watch our talk from OCP 2024 #

Transcript #

[Intro/Agenda] #

[CMS – Logical Architecture Overview] #

[Logical Architecture – Fabric Manager (DCMFM)] #

[CMS Example – Pooled Memory & Kubernetes] #

[Details of Kubernetes Cluster Environment] #

[DCMFM – Kubernetes PoC] #

[Composable Memory System for K8s Cluster] #

[CMS Demo – kubectl apply composable memory] #

[CMS Demo – CMS deployed] #

[CMS Demo – Pod deployed with CXL memory] #

[CMS Demo – Benchmarks] #

[Takeaways – Why Composable Memory Fabrics] #

[Summary and Call to Action] #