Did IBM Just Preview The Future of Caches?
by Dr. Ian Cutress on September 2, 2021 9:00 AM ESTAt Hot Chips last week, IBM announced its new mainframe Z processor. It’s a big interesting piece of kit that I want to do a wider piece on at some point, but there was one feature of that core design that I want to pluck out and focus on specifically. IBM Z is known for having big L3 caches, backed with a separate global L4 cache chip that operates as a cache between multiple sockets of processors – with the new Telum chip, IBM has done away with that – there’s no L4, but interestingly enough, there’s no L3. What they’ve done instead might be an indication of the future of on-chip cache design.
Caches: A Brief Primer
Any modern processor has multiple levels of cache associated with it. These are separated by capacity, latency, and power – the fastest cache closest to the execution ports tends to be small, and then further out we have larger caches that are slightly slower, and then perhaps another cache before we hit main memory. Caches exist because the CPU core wants data NOW, and if it was all held in DRAM it would take 300+ cycles each time to fetch data.
A modern CPU core will predict what data it needs in advance, bring it from DRAM into its caches, and then the core can grab it a lot faster when it needs it. Once the cache line is used, it is often ‘evicted’ from the closest level cache (L1) to the next level up (L2), or if that L2 cache is full, the oldest cache line in the L2 will be evicted to an L3 cache to make room. It means that if that data line is ever needed again, it isn’t too far away.
An example of L1, L2, and a shared L3 on AMD's First Gen Zen processors
There is also the scope of private and shared caches. A modern processor design has multiple cores, and inside those cores will be at least one private cache (the L1) that only that core has access to. Above that, a cache may either be a private cache still local to the core, or a shared cache, which any core can use. An Intel Coffee Lake processor for example has eight cores, and each core has a 256 KB private L2 cache, but chip wide there is a 16 MB shared L3 between all eight cores. This means that if a single core wants to, it can keep evicting data from its smaller L2 into the large L3 and have a pool of resources if that data wants to be reused. Not only this, but if a second core needs some of that data as well, they can find it in the shared L3 cache without having to write it out to main memory and grab it there. To complicate the matter, a 'shared' cache isn't necessarily shared between all cores, it might only be shared between a specific few.
The end result is that caches help reduce time to execution, and bring in more data from main memory in case it is needed or as it is needed.
Tradeoffs
With that in mind, you might ask why we don’t see 1 GB L1 or L2 caches on a processor. It’s a perfectly valid question. There are a number of elements at play here, involving die area, utility, and latency.
The die area is an easy one to tackle first – ultimately there may only be a defined space for each cache structure. When you design a core in silicon, there may be a best way to lay the components of the core out to have the fastest critical path. But the cache, especially the L1 cache, has to be close to where the data is needed. Designing that layout with a 4 KB L1 cache in mind is going to be very different if you want a large 128 KB L1 cache instead. So there is a tradeoff there – beyond the L1, the L2 cache is sometimes a large consumer of die space, and while it (usually) isn’t as constrained by the rest of the core design, it still has to be balanced with what is needed on the chip. Any large shared cache, whether it ends up as a level 2 cache or a level 3 cache, can often be the biggest part of the chip, depending on the process node used. Sometimes we only focus on the density of the logic transistors in the core, but with super large caches, perhaps the cache density is more important in what process node ends up being used.
Utility is also a key factor – we mostly speak about general purpose processors here on AnandTech, especially those built on x86 for PCs and servers, or Arm for smartphones and servers, but there are lots of dedicated designs out there whose role is for a specific workload or task. If all a processor core needs to do is process data, for example a camera AI engine, then that workload is a well-defined problem. That means the workload can be modelled, and the size of the caches can be optimized to give the best performance/power. If the purpose of the cache is to bring data close to the core, then any time the data isn’t ready in the cache, it’s called a cache miss – the goal of any CPU design is to minimize cache misses in exchange for performance or power, and so with a well-defined workload, the core can be built around the caches needed for an optimum performance/cache miss ratio.
Latency is also a large factor in how big caches are designed. The more cache you have, the longer it takes to access – not only because of the physical size (and distance away from the core), but because there’s more of it to search through. For example, small modern L1 caches can be accessed in as little as three cycles, whereas large modern L1 caches may be five cycles of latency. A small L2 cache can be as low as eight cycles, whereas a large L2 cache might be 19 cycles. There’s a lot more that goes into cache design than simply bigger equals slower, and all of the big CPU design companies will painstakingly work to shave those cycles down as much as possible, because often a latency saving in an L1 cache or an L2 cache offers good performance gains. But ultimately if you go bigger, you have to cater for the fact that the latency will often be larger, but your cache miss rate will be lower. This comes back to the previous paragraph talking about defined workloads. We see companies like AMD, Intel, Arm and others doing extensive workload analysis with their big customers to see what works best and how their core design should develop.
So What Has IBM Done That is So Revolutionary?
In the first paragraph, I mentioned that IBM Z is their big mainframe product – this is the big iron of the industry. It’s built better than your government-authorized nuclear bunker. These systems underpin the critical elements of society, such as infrastructure and banking. Downtime of these systems is measured in milliseconds per year, and they have fail safes and fail overs galore – with a financial transaction, when it is made, it has to be committed to all the right databases without fail, or even in the event of physical failure along the chain.
This is where IBM Z comes in. It’s incredibly niche, but has incredibly amazing design.
In the previous generation z15 product, there was no concept of a 1 CPU = 1 system product. The base unit of IBM Z was a five processor system, using two different types of processor. Four Compute Processors (CP) each housed 12 cores and 256 MB of shared L3 cache in 696mm2 built on 14nm running at 5.2 GHz. These four processors split into two pairs, but both pairs were also connected to a System Controller (SC), also 696mm2 and on 14nm, but this system controller held 960 MB of shared L4 cache, for data between all four processors.
Note that this system did not have a ‘global’ DRAM, and each Compute Processor had its own DDR backed equivalent memory. IBM would then combine this five processor ‘drawer’, with four others for a single system. That means a single IBM z15 system was 25 x 696mm2 of silicon, 20 x 256 MB of L3 cache between them, but also 5 x 960 MB of L4 cache, connected in an all-to-all topology.
IBM z15 is a beast. But the next generation IBM Z, called IBM Telum rather than IBM z16, takes a different approach to all that cache.
IBM, Tell’em What To Do With Cache
The new system does away with the separate System Controller with the L4 cache. Instead we have what looks like a normal processor with eight cores. Built on Samsung 7nm and at 530mm2, IBM packages two processors together into one, and then puts four packages (eight CPUs, 64 cores) into a single unit. Four units make a system, for a total of 32 CPUs / 256 cores.
On a single chip, we have eight cores. Each core has 32 MB of private L2 cache, which has a 19-cycle access latency. This is a long latency for an L2 cache, but it’s also 64x bigger than Zen 3's L2 cache, which is a 12-cycle latency.
Looking at the chip design, all that space in the middle is L2 cache. There is no L3 cache. No physical shared L3 for all cores to access. Without a centralized cache chip as with z15, this would mean that in order for code that has some amount of shared data to work, it would need a round trip out to main memory, which is slow. But IBM has thought of this.
The concept is that the L2 cache isn’t just an L2 cache. On the face of it, each L2 cache is indeed a private cache for each core, and 32 MB is stonkingly huge. But when it comes time for a cache line to be evicted from L2, either purposefully by the processor or due to needing to make room, rather than simply disappearing it tries to find space somewhere else on the chip. If it finds a space in a different core’s L2, it sits there, and gets tagged as an L3 cache line.
What IBM has implemented here is the concept of shared virtual caches that exist inside private physical caches. That means the L2 cache and the L3 cache become the same physical thing, and that the cache can contain a mix of L2 and L3 cache lines as needed from all the different cores depending on the workload. This becomes important for cloud services (yes, IBM offers IBM Z in its cloud) where tenants do not need a full CPU, or for workloads that don’t scale exactly across cores.
This means that the whole chip, with eight private 32 MB L2 caches, could also be considered as having a 256 MB shared ‘virtual’ L3 cache. In this instance, consider the equivalent for the consumer space: AMD’s Zen 3 chiplet has eight cores and 32 MB of L3 cache, and only 512 KB of private L2 cache per core. If it implemented a bigger L2/virtual L3 scheme like IBM, we would end up with 4.5 MB of private L2 cache per core, or 36 MB of shared virtual L3 per chiplet.
This IBM Z scheme has the lucky advantage that if a core just happens to need data that sits in virtual L3, and that virtual L3 line just happens to be in its private L2, then the latency of 19 cycles is much lower than what a shared physical L3 cache would be (~35-55 cycle). However what is more likely is that the virtual L3 cache line needed is in the L2 cache of a different core, which IBM says incurs an average 12 nanosecond latency across its dual direction ring interconnect, which has a 320 GB/s bandwidth. 12 nanoseconds at 5.2 GHz is ~62 cycles, which is going to be slower than a physical L3 cache, but the larger L2 should mean less pressure on L3 use. But also because the size of L2 and L3 is so flexible and large, depending on the workload, overall latency should be lower and workload scope increased.
But it doesn’t stop there. We have to go deeper.
For IBM Telum, we have two chips in a package, four packages in a unit, four units in a system, for a total of 32 chips and 256 cores. Rather than having that external L4 cache chip, IBM is going a stage further and enabling that each private L2 cache can also house the equivalent of a virtual L4.
This means that if a cache line is evicted from the virtual L3 on one chip, it will go find another chip in the system to live on, and be marked as a virtual L4 cache line.
This means that from a singular core perspective, in a 256 core system, it has access to:
- 32 MB of private L2 cache (19-cycle latency)
- 256 MB of on-chip shared virtual L3 cache (+12ns latency)
- 8192 MB / 8 GB of off-chip shared virtual L4 cache (+? latency)
Technically from a single core perspective those numbers should probably be 32 MB / 224 MB / 7936 MB because a single core isn’t going to evict an L2 line into its own L2 and label it as L3, and so on.
IBM states that using this virtual cache system, there is the equivalent of 1.5x more cache per core than the IBM z15, but also improved average latencies for data access. Overall IBM claims a per-socket performance improvement of >40%. Other benchmarks are not available at this time.
How Is This Possible?
Magic. Honestly, the first time I saw this I was a bit astounded as to what was actually going on.
In the Q&A following the session, Dr. Christian Jacobi (Chief Architect of Z) said that the system is designed to keep track of data on a cache miss, uses broadcasts, and memory state bits are tracked for broadcasts to external chips. These go across the whole system, and when data arrives it makes sure it can be used and confirms that all other copies are invalidated before working on the data. In the slack channel as part of the event, he also stated that lots of cycle counting goes on!
I’m going to stick with magic.
Truth be told, a lot of work goes into something like this, and there’s likely still a lot of considerations to put forward to IBM about its operation, such as active power, or if caches be powered down in idle or even be excluded from accepting evictions altogether to guarantee performance consistency of a single core. It makes me think what might be relevant and possible in x86 land, or even with consumer devices.
I’d be remiss in talking caches if I didn’t mention AMD’s upcoming V-cache technology, which is set to enable 96 MB of L3 cache per chiplet rather than 32 MB by adding a vertically stacked 64 MB L3 chiplet on top. But what would it mean to performance if that chiplet wasn’t L3, but considered an extra 8 MB of L2 per core instead, with the ability to accept virtual L3 cache lines?
Ultimately I spoke with some industry peers about IBM’s virtual caching idea, with comments ranging from ‘it shouldn’t work well’ to ‘it’s complex’ and ‘if they can do it as stated, that’s kinda cool’.
149 Comments
View All Comments
FunBunny2 - Thursday, September 2, 2021 - link
"although I don't know if this is the best solutions either "mostly because I worked on them decades ago, even after they were antiques, the TI-990 machines had a really radical design - no instruction or data registers in the cpu, everything done in memory. the justification at the time was that processor and memory cycle times were closer enough that load/store (no recollection if that term even existed then) was sub-optimal.i
the next logical step, of course, is to eliminate memory as temporary store altogether. Nirvana.
while Optane, either in absolute performance or marketing, hasn't gotten traction as a direct single-level datastore (i.e. all reads and writes are to durable storage), a baby step of consolidating caches may hint that some folks in the hardware side are looking in that direction.
OSes will need to be modified, perhaps heavily, to work with such hardware. another olden days machine that provides some guidance is the OS/400 as originally designed. it operated a SQL-ish database as datastore without a filesystem protocol (a filesystem was later grafted on). an 'object' datastore without the filesystem protocol eliminates more impedance matching.
the future may look a lot different.
Threska - Thursday, September 2, 2021 - link
I imagine Pmem is going to have to deal with security.https://www.snia.org/education/what-is-persistent-...
mode_13h - Thursday, September 2, 2021 - link
Memory-to-memory architectures made more sense when CPUs ran so slowly that DRAM accesses only took a handful of clock cycles. These days, memory-to-memory would be completely uncompetitive, unless your CPU is doing some kind of memory-renaming and internally remapping them to registers.You can't just forego DRAM and use Optane instead. It would need to be several orders of magnitude more durable than it currently is.
However, Intel has been working on using Optane from userspace (i.e. without kernel filesystem overhead). But, that's in *addition* to DRAM - not as a complete substitute for it.
mode_13h - Thursday, September 2, 2021 - link
> since AMD and Intel moved to the large L3 designs, i've felt that it wasn't ideal.Do you want lots of cores, with lots of IPC, running at high clocks? If so, then you need bandwidth. And you need it to scale faster than DRAM has been. Scaling the speed and size of caches is the way to do that.
If not, AMD wouldn't have been able to tout such impressive gains, in real-world apps, by simply scaling up L3 from 32 MB to 96 MB.
Of course, another way to do it is with in-package HBM-type memory, which comes at some latency savings and increased bandwidth vs. DDR memory sitting on DIMMs.
A yet more radical approach is to reduce the burden on caches, by using a software-managed on-chip memory. This is something you'll find in some GPUs and more specialty processors, but places a lot more burdens and assumptions on the software. Going to a private, direct-mapped memory avoids the latency and energy tax of cache lookups and maintaining cache coherency.
FunBunny2 - Thursday, September 2, 2021 - link
Question for Ian (or anyone at least as close to Big Blue):In the olden days, IBM monitored its installs, seeking the most used instructions, data flows, and the like; mostly to optimize the machine for COBOL (the 360 meme, one machine for science and business, died almost immediately) applications.
is this 'radical' cache structure the result of customer monitoring, or (like Apple) IBM is telling the customer base 'we don't care what you think, this is what you need'?
Tomatotech - Thursday, September 2, 2021 - link
"In the olden days, IBM monitored its installs, seeking the most used instructions, data flows, and the like; mostly to optimize the machine for COBOL (the 360 meme, one machine for science and business, died almost immediately) applications.is this 'radical' cache structure the result of customer monitoring, or (like Apple) IBM is telling the customer base 'we don't care what you think, this is what you need'?"
I suspect Apple has far greater insight into the codepaths* its customers run than IBM ever did. The iOS phones are absolute miracles of OS and hardware working together in lockstep to achieve more with less resources (or consume less battery power) than any other phone maker on the planet.
Apple looks to be repeating this achievement with their M1 Macs.
*staying well away from the whole CSAM issue. We're just talking about IBM / Apple tweaking their OS/ hardware to maximise the efficiency of their customers' most highly used codepaths / dataflows.
Oxford Guy - Friday, September 3, 2021 - link
Miraculous voicemail that won’t delete and auto-defect that can’t be disabled. Want to turn off voicemail transcription? You may be able to use a kludge but there is no actual normal setting to give users direct control. This is normal for Apple in recent times. Strip away user control and be hailed for efficient program CPU utilization. Stripping features from a program is one way to speed it up but what about the efficiency of the user’s workflow?I’d gladly trade a bit of battery life for a system that has more respect for the user. But Apple has other plans. It even wants to play cop with people’s phone. Warrantless surveillance that Scientific American warns will lead to the abuse of a vulnerable minority. It also warns that this is a brainwashing of young people to believe they have no right to privacy. Big brother Apple is going to monitor you, not just government agencies. The corporate-government complex’s dissolving of the line between corporation and government continues at speed. A $2 trillion valuation doesn’t happen without gifts to those in power.
GeoffreyA - Sunday, September 5, 2021 - link
No matter what Apple does, people will still worship before the Fruit Shrine.Oxford Guy - Tuesday, September 7, 2021 - link
The same goes for MS.Windows 11 offers consumers entropy rather than value. It will be a success for MS nonetheless.
GeoffreyA - Tuesday, September 7, 2021 - link
Oh, yes. Lost faith in Microsoft a long time ago, though they have pulled up their socks in the last few years, somewhat. As for 11, I'm just going to stick with 10 as long as I can. If I've got to use it for another decade, I've got no problem with that.