I was curious about how easy or how difficult it is to measure RAM speed (and to saturate RAM bus), so I wrote my own RAM benchmark and spent a couple evenings optimizing it.
First, about my PC. It's a Lenovo Thinkcentre M920q with Core i5-8500T and 2x16 GB of DDR4-2666 RAM that decided to work at 2400 because the modules are not identical (at least I think this is why). DDR4-2400 has the theoretical transfer speed of 19200 MiB/s = 18.75 GiB/s. But this is single channel, and I have two.
Here are the results of my bench:
For the purpose of talking about RAM speed we should take the best of the two values in each category, and we see that the real speed is good for the "write" workload (84% of theoretical max), but only slightly better than theoretical in the "copy" and especially "read" workloads. Is this typical? If yes, why can't we get the advertised transfer rates? Or is there a problem with my particular system?
For completeness, the standard `memcpy` function provided by MSVC achieves 26 GiB/s (and does the same job as my "copy" workload).
Here is AIDA64 result:

As you can see, it's better in all workloads, I wonder how they did it. But still below the theoretical numbers.
And here's MaxxMem2:

Interestingly, it agrees very well with my own bench, and falls short in the "write" department. This is probably because they didn't use streaming AVX / SSE store instructions (that don't put data into the caches) and instead used the regular store instructions that do go through the cache. I get the same write speed results if I use regular stores, and it's more indicative of real-world workloads.
So what do you guys think?
Also, can someone explain why streaming stores are that much faster than regular stores? It shouldn't be the case, for all I know about CPU architecture. Firstly because cache should be transparent to the CPU cores and shouldn't be holding them back, and second because RAM speed is still way less than CPU speed, the CPU should have enough free cycles to update the cache correctly in between RAM transfer completions.
P. S. Here's my benchmark if anyone is interested. You can download the .exe and run it yourself, it should only require MSVC++ redistributable to run. The C++ source code is also there.
First, about my PC. It's a Lenovo Thinkcentre M920q with Core i5-8500T and 2x16 GB of DDR4-2666 RAM that decided to work at 2400 because the modules are not identical (at least I think this is why). DDR4-2400 has the theoretical transfer speed of 19200 MiB/s = 18.75 GiB/s. But this is single channel, and I have two.
Here are the results of my bench:
Code:
---------------------------------------------------
Write Read Copy
---------------------------------------------------
AVX2 31.6 GiB/s 22.4 GiB/s 23.6 GiB/s
SSE2 29.5 GiB/s 20.1 GiB/s 24.9 GiB/s
For the purpose of talking about RAM speed we should take the best of the two values in each category, and we see that the real speed is good for the "write" workload (84% of theoretical max), but only slightly better than theoretical in the "copy" and especially "read" workloads. Is this typical? If yes, why can't we get the advertised transfer rates? Or is there a problem with my particular system?
For completeness, the standard `memcpy` function provided by MSVC achieves 26 GiB/s (and does the same job as my "copy" workload).
Here is AIDA64 result:

As you can see, it's better in all workloads, I wonder how they did it. But still below the theoretical numbers.
And here's MaxxMem2:

Interestingly, it agrees very well with my own bench, and falls short in the "write" department. This is probably because they didn't use streaming AVX / SSE store instructions (that don't put data into the caches) and instead used the regular store instructions that do go through the cache. I get the same write speed results if I use regular stores, and it's more indicative of real-world workloads.
So what do you guys think?
Also, can someone explain why streaming stores are that much faster than regular stores? It shouldn't be the case, for all I know about CPU architecture. Firstly because cache should be transparent to the CPU cores and shouldn't be holding them back, and second because RAM speed is still way less than CPU speed, the CPU should have enough free cycles to update the cache correctly in between RAM transfer completions.
P. S. Here's my benchmark if anyone is interested. You can download the .exe and run it yourself, it should only require MSVC++ redistributable to run. The C++ source code is also there.