Question DDR5's impact on CPU design

Carfax83 · Apr 22, 2020

With DDR5 on the horizon, we will be seeing a very significant, if not massive boost in available bandwidth for CPUs. Correct me if I'm wrong, but a dual channel DDR5 system should easily eclipse 100GB/s bandwidth. Quad channel memory systems and greater will have even more; over 200GB/s.

So the question, how do you think this is going to impact CPU design? We already have CPUs with tons of cores already that perform very well given the more limited bandwidth they have. What sort of enhancements would CPU architects make to be able to utilize the much higher bandwidth available with DDR5?

More cores is the obvious one, but what about more SIMD units? Or more SMT threads?

VirtualLarry · Apr 22, 2020

Are we intentionally leaving out the question of iGPUs here?

Carfax83 · Apr 22, 2020

VirtualLarry said:
Are we intentionally leaving out the question of iGPUs here?

Kind of yeah. I know that GPUs have an insatiable appetite for bandwidth due to their workloads. But I am more interested in seeing how, or whether CPUs can take advantage of the significant bump in available bandwidth from DDR5.

More cores is the other obvious, but we already have plenty of cores; at least on the consumer side. So I am wondering what CPU architects can do to increase memory bandwidth utilization, with microarchitectural tweaks in particular.

CPUs already have arguably more sophisticated memory performance solutions than GPUs with multi level caches. It would be interesting to see whether DDR5 changes anything.

DrMrLordX · Apr 23, 2020

Carfax83 said:
So the question, how do you think this is going to impact CPU design?

Not that much. Memory bandwidth is not as much of a limiting factor on consumer sockets running consumer workloads. Latency is a much bigger deal.

If we step into the HPC realm, it might affect some workloads. I have pretty limited experience in that department wrt bottleneck profiling for HPC applications. Among the not-exactly-HPC benchmarks that PC nerds like to run on their, I still can't think of many that are heavily bandwidth-dependent.

Soulkeeper · Apr 23, 2020

Over the years I've come to the belief that bandwidth is far more important than people would like to think
Ultimately everything is a matter of reading/writing. A certain percentage of everything is based on how fast things can be read/written. CPU caches and smart designs have hidden much of the fact that memory bandwidth has fallen well behind cpu/gpu design over the past many years.
To answer your question, I doubt it'll impact the cpu design much ouside of the memory controller and a few tweaks here and there. It will help however, Signs of bandwidth holding back the latest 16 thread+ ryzens are already showing.
Comparing memory that is 10% faster than another mem of the same type is completely different than instantly having double the mem bandwidth to play with. I think people will be surprised by certain benchmark results once we see systems with ddr5. Also the changes with burst length, refresh, and bank design should make ddr4->ddr5 much bigger than ddr3->ddr4 was.

Carfax83 · Apr 23, 2020

DrMrLordX said:
Not that much. Memory bandwidth is not as much of a limiting factor on consumer sockets running consumer workloads. Latency is a much bigger deal.

I remember seeing a few posts around here that said that AVX2/AVX-512 was bandwidth limited. I remember expressing incredulity because from what I could see, AVX2/AVX-512 seemed to perform exceptionally when an application was properly optimized with it. Maybe they were referring to AVX2/AVX-512 being bandwidth limited only in high core count CPUs?

So perhaps DDR5 will result in SIMD being more effective and performant, especially in high core count CPUs.

Among the not-exactly-HPC benchmarks that PC nerds like to run on their, I still can't think of many that are heavily bandwidth-dependent.

The only consumer software I can think of that is bandwidth limited is compression software. At any rate, I'm sure DDR5 will deliver some latency improvements as well, just not nearly as much as with bandwidth.

darkswordsman17 · Apr 23, 2020

Will we actually be seeing those bandwidths soon though? It often takes years to see the theoretical bandwidth new generation of DDR offers. And hasn't there been no decrease in latency since DDR2?

Personally I think it'd make more sense to go HBM. The costs are high, but I think it has significant benefits that would make it worthwhile, and most of all it further helps shrink the overall system size and reducing external chip connections also is important for many consumer systems (and Enterprise can more easily afford the HBM cost). Which, I've wondered if they couldn't repurpose RAM slots for SSDs that could offer somewhat close bandwidth but at orders of magnitude larger size plus being non-volatile (with software like OSes being tweaked to limit writes on the SSDs). For enterprise, maybe they move to shared DRAM pools (think of it as like system level I/O die, where you have and an add-in card with multiple sockets sharing access to it, where it serves as cache for inter-chip communication and further improvement to unified memory space). Or maybe they just tweak SSDs to put the RAM there where it serves as cache/buffer for storage. Laptops and OEM systems could shrink. All platforms should see big gains to bandwidth but also latency. You're talking 100-200GB, HBM can bring 1000-2000GB/s. And while there's exceptions, I'd guess that most systems could be tailored for a certain memory size. 16-32GB should be plenty for laptops and desktops for probably another 5 years. With changing the overall system architecture you could probably get away with a relatively modest amount of memory too, so they might be able to stay at 16-32GB HBM for longer period.

My guess is though we see stacked DRAM and traditional DRAM mix. I won't be surprised to see that laptops start losing memory slots entirely (with some stacked and then a single channel soldered on), and maybe see lowering some (Threadripper goes back to quad channel only). And I think that's what DDR5 mostly will do, is let them not have to increase memory channels while other technology develops.

darkswordsman17 · Apr 23, 2020

Soulkeeper said:
Over the years I've come to the belief that bandwidth is far more important than people would like to think
Ultimately everything is a matter of reading/writing. A certain percentage of everything is based on how fast things can be read/written. CPU caches and smart designs have hidden much of the fact that memory bandwidth has fallen well behind cpu/gpu design over the past many years.
To answer your question, I doubt it'll impact the cpu design much ouside of the memory controller and a few tweaks here and there. It will help however, Signs of bandwidth holding back the latest 16 thread+ ryzens are already showing.
Comparing memory that is 10% faster than another mem of the same type is completely different than instantly having double the mem bandwidth to play with. I think people will be surprised by certain benchmark results once we see systems with ddr5. Also the changes with burst length, refresh, and bank design should make ddr4->ddr5 much bigger than ddr3->ddr4 was.

I'm not so sure. I honestly think RAM in general is kinda overstated for most users, and/or lots of software isn't doing much to optimize its memory use as its just relying on outsized memory resources. Which, I'm not saying its not important, and I'd agree, but I'd say that makes HBM or stacked DRAM more important as it'll help latency while boosting overall system throughput and will be key if we want to really take advantage of things like superfast SSDs and even external system communication - think things like Thunderbolt and external GPU.

I think possibly the biggest thing about DDR5 is that it'll seemingly come with ECC by default. Beyond that I think stacked memory and other things are much more interesting as they'll obliterate even DDR5, while enabling benefits like smaller physical system.

I think going HBM (or something offering its benefits) might actually be the key to x86 staying relevant. I don't think its something that ARM could compete with. I think going stacked DRAM like mobile might even be a mistake that hastens ARM taking over more x86 space. I also think that if they don't in enterprise it might give ARM chips the opportunity to do so and eclipse x86.

Carfax83 · Apr 23, 2020

darkswordsman17 said:
Will we actually be seeing those bandwidths soon though? It often takes years to see the theoretical bandwidth new generation of DDR offers. And hasn't there been no decrease in latency since DDR2?

The increase in memory bandwidth should be instantaneous. Even at the same clocks, DDR5 has a significant bandwidth advantage over DDR4. As for latency, it has improved over time with faster memory through speed. DDR4-3200 has lower overall latency than DDR3 2133 for instance. Faster memory speeds have even lower latencies.

Personally I think it'd make more sense to go HBM.

The thing about HBM though is that it's optimized more so for throughput/bandwidth than latency. Latency seems to be more important than bandwidth for consumer applications, so using HBM would be counterintuitive.

And I think that's what DDR5 mostly will do, is let them not have to increase memory channels while other technology develops.

This is a good point. DDR5 will definitely allow more flexibility when it comes to designing memory controllers.

DisEnchantment · Apr 23, 2020

If we exclude RAS and power efficiency gains and concentrate only on performance,
DDR5 will bring big gains to multicore CPUs due to the fact that there are two channels per DIMM. This means that each DIMM can handle two memory access request concurrently. So with a capable IMC it is possible to have quad channel with 2 Memory modules.
The increased clocks will drive down latency considerably provided timings don't increase too much and of course big increase in bandwidth.

HBM and DDR suffer the same issues in terms of too many cycles involved before actual data transfer can happen, too many cycles involved in CAS, RAS and a multitude of constraints like tiny delays after write or read.
HBM has advantage over DDR because it is much wider and has lots of channels. But in terms of absolute latency for very small amounts of data DDR is better due to much higher clocks, but loses when it comes to bulk latency.
HBM is still good for GPUs though because of the overall smaller latency when moving large chunks of data feeding the SIMD units.
However the large number of channels in HBM means than it is better to handle a massive number of cores. If they can ramp up HBM clocks it could be viable in the future.

Atari2600 · Apr 23, 2020

Soulkeeper said:
Over the years I've come to the belief that bandwidth is far more important than people would like to think

There is very little evidence to support that with regards general software.

As others have said, latency is more important to most general applications that display any memory sensitivity.

Back when AMD went from DDR2 to DDR3, the memory controller on the AMD K10 allowed for direct comparisons such as:

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

https://images.anandtech.com/reviews/cpu/amd/phenom2/955/ddr3vsddr2.jpg

So that's a 25% bandwidth improvement delivering an average 2% performance improvement.

Now, bear in mind the context - that was DDR2-1066 MHz & DDR3-1333 MHz servicing a quad core Ph II. DDR4 is routinely running at 3000 MHz now, so your looking at a bandwidth ~2.25x higher. Of course, core counts and core processing power is up, so memory demand is higher - but your only likely to see the 3950X or perhaps also 3900X see the benefit of additional bandwidth - and that will very much depend on application (general consumer applications).

[Of course, move to HPC applications and chuck that out the window - gimme more bandwidth yesterday please!!]

TheELF · Apr 23, 2020

Carfax83 said:
With DDR5 on the horizon, we will be seeing a very significant, if not massive boost in available bandwidth for CPUs. Correct me if I'm wrong, but a dual channel DDR5 system should easily eclipse 100GB/s bandwidth. Quad channel memory systems and greater will have even more; over 200GB/s.

So the question, how do you think this is going to impact CPU design? We already have CPUs with tons of cores already that perform very well given the more limited bandwidth they have. What sort of enhancements would CPU architects make to be able to utilize the much higher bandwidth available with DDR5?

More cores is the obvious one, but what about more SIMD units? Or more SMT threads?

Adding additional ports for data ,additional load/store units and things that compute addresses is what intel/amd will do anyway whenever they feel appropriate.

Intel Sunny Cove Microarchitecture Details

At Intel Architecture Day 2018, we got fresh Intel Sunny Cove microarchitecture details which show why Intel is about to make a much larger jump in per clock performance than we have seen in many generations

www.servethehome.com

Intel Sunny Cover will go from 4-wide to 5-wide allocation while increasing the execution port count from 8 to 10. One of these extra ports is dedicated to storing data (P9) to take advantage of larger caches, the other is for memory access (P2, P8, P3, P7.) Beyond this, there are additional capabilities being added such as another SIMD shuffle (now two) and LEA units (now four.) LEA is often used for computing addresses and general purpose math

moinmoin · Apr 23, 2020

I'm surprised nobody mentioned PCIe so far. When balancing 16 PCIe lanes per channel as Zen systems do up to now, DDR4 is already close at its limit with PCIe 4. PCIe 5 essentially requires DDR5 to not hit a massive bottleneck.

Consumer platforms are not as affected by that so far since AM4 doesn't expose all possible 32 lanes to begin with, leaving some bandwidth headroom.

VirtualLarry · Apr 23, 2020

Atari2600 said:
There is very little evidence to support that with regards general software.

As others have said, latency is more important to most general applications that display any memory sensitivity.

There are DC (distributed computing) science applications, that are bandwidth-limited on some platforms, and with certain core counts. Though, this is basically in the same boat as the HPC crowd, and consumer AM4 platform only being dual-channel, and still supporting 16C/32T CPUs, does get a little bit suffocated at times in terms of memory bandwidth. Hence the reason for ThreadRipper and 4-channel, and later, TRX80 with 8-channel memory bandwidth. (Was TRX80 ever released? I don't remember seeing boards.)

More common applications, such as PC Gaming, are hardly bandwidth-limited, and are instead, generally-speaking, far more latency-bound.

DrMrLordX · Apr 23, 2020

Carfax83 said:
I remember seeing a few posts around here that said that AVX2/AVX-512 was bandwidth limited.

It can be, but it can also be latency limited. It really has to do with how well your CPU and/or platform can effectively prefetch data. Your best performance always comes from loading data in chunks from cache. If the CPU is doing its job well, it has the data you need in L1d by the time the CPU gets around to carrying out calculations, which it can do very quickly thanks to SIMD. So your main memory is only responsible for feeding your prefetch unit. It's not like you're doing one huge aligned sequential read from RAM.

I remember expressing incredulity because from what I could see, AVX2/AVX-512 seemed to perform exceptionally when an application was properly optimized with it. Maybe they were referring to AVX2/AVX-512 being bandwidth limited only in high core count CPUs?

On modern x86 CPUs, it looks like you need enough system bandwidth to feed your prefetch units.

The only consumer software I can think of that is bandwidth limited is compression software. At any rate, I'm sure DDR5 will deliver some latency improvements as well, just not nearly as much as with bandwidth.

It's not that hard to figure out. Take a 3700x system running DDR4-3466 or faster and run some benchmarks with one DIMM in single channel versus normal dual channel and record results. We actually have a lot of notebook benchmarks out there from single-channel systems as well.

Carfax83 · Apr 23, 2020

VirtualLarry said:
More common applications, such as PC Gaming, are hardly bandwidth-limited, and are instead, generally-speaking, far more latency-bound.

As a long time high end PC gamer, when I first read this comment my instinct was to agree with you. But it has been years since I looked at any benchmarks for single channel vs dual channel vs quad channel memory in games. The last time I looked was back in the Sandy bridge days when I had a 3930K, and games were decidedly much more different then being that Xbox 360 and PS3 were still around.

Since that time, games have undergone a revolution with full 64 bit, low level APIs and multithreaded optimization. So I just checked on YouTube for comparisons between dual channel and quad channel memory, and this video really stuck out. He uses a 6800K with either 2x8GB or 4x4GB of DDR4 3000 at CL16, and he tested BF5 and AC Odyssey, two recent popular games. The results astonished me! I'd estimate a 10 to 30% difference in average framerate between dual channel and quad channel modes. That's basically akin to a GPU upgrade in terms of performance.

The 1% lows are even greater, showing more than a 100% increase at times!

Now granted he tested at 1080p with a 1080Ti so he was more CPU limited than anything, but the fact that there is such a huge difference really took me by surprise. More than any other consumer application, modern games tend to push our hardware more than anything else.

So now bandwidth matters for gaming, and not just latency. I've always had in the back of my mind why my PC performs so well in current games despite my aging platform. My 6900K at 4.3ghz with 32GB quad channel DDR4 3400 CL15 CR1 is still more than powerful enough to drive my Titan Xp and make sure everything GPU bottlenecked.

Carfax83 · Apr 23, 2020

DrMrLordX said:
It's not that hard to figure out. Take a 3700x system running DDR4-3466 or faster and run some benchmarks with one DIMM in single channel versus normal dual channel and record results. We actually have a lot of notebook benchmarks out there from single-channel systems as well.

If you look at that YouTube video above, it appears some of the really big modern games are sensitive to bandwidth as well. Very astonishing to me! But I guess I shouldn't really be surprised, as games these days are so much bigger and more parallel than ever before, so memory bandwidth is bound to be more important than it used to be.

DrMrLordX · Apr 23, 2020

Carfax83 said:
If you look at that YouTube video above, it appears some of the really big modern games are sensitive to bandwidth as well. Very astonishing to me! But I guess I shouldn't really be surprised, as games these days are so much bigger and more parallel than ever before, so memory bandwidth is bound to be more important than it used to be.

I would like to see that replicated elsewhere on a modern CPU. I seriously have my doubts as to why quad channel would matter on games that can't even use all the cores of a 6900k. It may have more to do with the number of addressable ranks of memory than actual bandwidth.

For correct testing, I think we should look at a modern system with the same number of memory ranks but a different number of active channels. Like 16GB two ranks single channel vs 32GB two ranks dual channel. It definitely looks like the YT vid didn't take memory ranks into account. He doubled the number of addressable memory ranks going to 4x4GB. Those 8GB DIMMs are single-rank.

Carfax83 · Apr 23, 2020

DrMrLordX said:
I would like to see that replicated elsewhere on a modern CPU. I seriously have my doubts as to why quad channel would matter on games that can't even use all the cores of a 6900k. It may have more to do with the number of addressable ranks of memory than actual bandwidth.

For correct testing, I think we should look at a modern system with the same number of memory ranks but a different number of active channels. Like 16GB two ranks single channel vs 32GB two ranks dual channel.

BF5 uses Frostbite 3 engine which scales up to 8 cores, and AC Odyssey uses AnvilNext which can do up to 10 cores. He also tested with a 6800K, which has 6 cores. So the CPU is being heavily utilized for both games.

Could such a large difference be explained by the number of addressable ranks of memory? I saw a more than 100% difference in 1% lows at times in BF5, which is humongous!

Games are just much more parallel than they used to be, so it makes sense that they would lean on bandwidth more than they used to when they used to be predominantly single threaded.

DrMrLordX · Apr 23, 2020

Carfax83 said:
BF5 uses Frostbite 3 engine which scales up to 8 cores, and AC Odyssey uses AnvilNext which can do up to 10 cores. He also tested with a 6800K, which has 6 cores. So the CPU is being heavily utilized for both games.

Yeah, but unless the CPU is heating up like Prime95 SmallFFTs or whatever, it isn't using all available execution resources. Only other thing I can think of is whether or not the game is doing a metric ton of texture swaps to/from the GPU. At those resolutions, I'm thinking "no".

Could such a large difference be explained by the number of addressable ranks of memory?

Potentially. I first noticed this phenomenon on Kaveri where increasing memory ranks actually improved performance. It was shown again on Summit Ridge where in SOME (but not all) applications, 2x16GB DDR4-2666 performed as well as 2x8GB DDR4-3200 (4 ranks vs 2 ranks).

I saw a more than 100% difference in 1% lows at times in BF5, which is humongous!

That's the big question I have, since 1% and .1% lows are almost guaranteed to be latency-based.

Ajay · Apr 23, 2020

DisEnchantment said:
If we exclude RAS and power efficiency gains and concentrate only on performance,
DDR5 will bring big gains to multicore CPUs due to the fact that there are two channels per DIMM. This means that each DIMM can handle two memory access request concurrently. So with a capable IMC it is possible to have quad channel with 2 Memory modules.
The increased clocks will drive down latency considerably provided timings don't increase too much and of course big increase in bandwidth.

But aren't those 1/2 width fetches? So, not really an advantage on desktop (but would be on low power SoCs). I think the timings will get pushed out, as is typical for each new gen. Since we don't have numbers yet, not sure whether there will be any gain.

Carfax83 · Apr 23, 2020

DrMrLordX said:
Yeah, but unless the CPU is heating up like Prime95 SmallFFTs or whatever, it isn't using all available execution resources. Only other thing I can think of is whether or not the game is doing a metric ton of texture swaps to/from the GPU. At those resolutions, I'm thinking "no".

The CPU decompresses/streams a lot of assets, similar to what happens when a game is being installed. Could that cause it?

I have BF5, and I can attest from playing it, it's very hard on the CPU. It also uses AVX quite heavily, so that could also be another factor. AC Odyssey also uses AVX. Many of the modern big 3D engines have a lot of SIMD optimization.

Potentially. I first noticed this phenomenon on Kaveri where increasing memory ranks actually improved performance. It was shown again on Summit Ridge where in SOME (but not all) applications, 2x16GB DDR4-2666 performed as well as 2x8GB DDR4-3200 (4 ranks vs 2 ranks).

So the performance differences you were seeing, were they similar to what was shown in the YouTube video?

That's the big question I have, since 1% and .1% lows are almost guaranteed to be latency-based.

Well, I'm just as shocked as you are to be quite honest. I'd like to think that it's because games and 3D engines have advanced so far in the past few years, and due to being more parallel in nature, they are becoming more reliant on memory bandwidth for performance.

But perhaps it may be something more simple as you suggest...

DrMrLordX · Apr 23, 2020

Carfax83 said:
So the performance differences you were seeing, were they similar to what was shown in the YouTube video?

Sometimes yes, sometimes no. Nothing in the range of 30%. It would require further testing.

But perhaps it may be something more simple as you suggest...

It would also be helpful to see package power numbers during gameplay to see exactly how many of the CPU's execution resources are engaged during gameplay. As you suggested, bandwidth requirements aren't necessarily related to the need to keep CPU execution resources fed. It may be pulling data out of main memory for the GPU as well. Maybe.

DisEnchantment · Apr 23, 2020

Ajay said:
But aren't those 1/2 width fetches? So, not really an advantage on desktop (but would be on low power SoCs). I think the timings will get pushed out, as is typical for each new gen. Since we don't have numbers yet, not sure whether there will be any gain.

Yes. But it is not a real issue, because it has twice the burst length and also much higher transfer speed.
The bigger advantage is that the DIMM can handle two concurrent requests at different addresses. (Concurrent request to same address are addressed at UMC/cache level)
This means that a thread waiting for data has a lesser chance of having to wait for a current request from another core to finish thereby resulting in overall increase in .... "IPC"

Also we have to consider that the data transfer part of the DRAM access is a just a part of the overall access time. There are also the delays for CAS, RAS, delay after read and write beside the others. Everytime an address is accessed there has to be a small delay before another address can be accessed. Therefore the 2 channels will help.

For AMD, they made some optimizations to the LS subsystem to fuse memory access operations from multiple instructions into a single bigger block transfer. This will help too in keeping memory subsystem slightly less loaded.

Carfax83 · Apr 23, 2020

DrMrLordX said:
Sometimes yes, sometimes no. Nothing in the range of 30%. It would require further testing.

Doesn't increased parallelism result in a greater dependence on memory bandwidth than on latency though? Serial tasks tend to be more latency dependent as I recall, which is why GPUs care much less about latency than CPUs.

It would also be helpful to see package power numbers during gameplay to see exactly how many of the CPU's execution resources are engaged during gameplay. As you suggested, bandwidth requirements aren't necessarily related to the need to keep CPU execution resources fed. It may be pulling data out of main memory for the GPU as well. Maybe.

The MSI Afterburner overlay he had didn't list any power numbers for the CPU, but it did for the GPU. The GPU power numbers were higher with the 4x4 config, but that is easily explainable due to the higher framerates. What I don't get though, is that the 4x4 also had significantly higher RAM usage than the 2x8. I wonder if that is also a result of the higher performance?

Question DDR5's impact on CPU design

Diamond Member

No Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Lifer

Diamond Member

Golden Member

Golden Member

Diamond Member

Diamond Member

No Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Lifer

Golden Member

Diamond Member