SuperComputing and CXL

Just this last week was a great SC show in Denver where the new Top500 list was released. Frontier remains in first, but new systems Aurora(The first half at least) and Eagle have knocked the number 2 Fugaku cluster down to 4th place. New systems and more cores, such is the churn of the Top500 prestige. However, for more than a decade the HPC community has argued that the linpack benchmark is no longer representative of the workloads these machines are built for.

Chasing pointers
#

The HPCG (High Performace Conjugate Gradient) benchmark stresses the performance of the system collectively, rather than just raw compute performance. It’s not yet the primary metric for rankings on the Top500, but it has at least been a component of the disclosed metrics since 2017. Here’s a really topical example of why CXL fabrics are going to be awesome.

Frontier has a sustained compute performance of 1.194 EFlops/s, which is bananas. But this score is derived by running linpack, which is highly highly partitionable, meaning that you’re benchmarking how fast cores/sockets can do math. The more interesting HPC problems of this decade are not so partitionable, hence the introduction of the HPCG benchmark. With HPCG, Frontier’s sustained performance drops to 14.05 PFlops/s, an 85x reduction in performance, which is a different kind of bananas.¹

Once an application has to chase a pointer outside of the local DRAM there are few choices doing it in a tolerable way, infiniband, ethernet, or some proprietary memory fabric. Infiniband is great, once you have it set up finally. Ethernet is ethernet, reliable but weird sometimes. Proprietary memory fabrics, woo who doesn’t love vendor lock-in of a technology that isn’t widely proven out?

CXL to the rescue
#

Compute Express Link (CXL) is positioned to disrupt these choices in a big way. There are already plenty of articles and publications lauding the benefits of this open memory fabric standard, and all of the big names have signed on to contribute. CXL will not replace infiniband or ethernet in HPC, but it will augment. A significant portion of those RDMA usecases will suddenly become far memory accesses, which is much nicer to program for and generally has orders of magnitude better latency per access. The point is, no more 85x drops in effective performance, CXL won’t solve it all but it will make things much better.

How do I do this?
#

Hi there, we’re Jackrabbit Labs (JRL for short). We build open source CXL things then help you use them. If you’re finding yourself scratching your head on day 2 of your CXL deployment, please reach out. We’d love to get to know you.

The difficulties of remote memory laid bare ↩︎

Chasing pointers #

CXL to the rescue #

How do I do this? #

Chasing pointers
#

CXL to the rescue
#

How do I do this?
#