Question Actual RAM transfer rate in a dual-channel system is 1.2x to 1.5x that of a single-channel, is this normal?

Alexium

Junior Member
Aug 18, 2017
13
3
81
I was curious about how easy or how difficult it is to measure RAM speed (and to saturate RAM bus), so I wrote my own RAM benchmark and spent a couple evenings optimizing it.
First, about my PC. It's a Lenovo Thinkcentre M920q with Core i5-8500T and 2x16 GB of DDR4-2666 RAM that decided to work at 2400 because the modules are not identical (at least I think this is why). DDR4-2400 has the theoretical transfer speed of 19200 MiB/s = 18.75 GiB/s. But this is single channel, and I have two.

Here are the results of my bench:
Code:
---------------------------------------------------
        Write           Read            Copy
---------------------------------------------------
AVX2    31.6 GiB/s      22.4 GiB/s      23.6 GiB/s
SSE2    29.5 GiB/s      20.1 GiB/s      24.9 GiB/s

For the purpose of talking about RAM speed we should take the best of the two values in each category, and we see that the real speed is good for the "write" workload (84% of theoretical max), but only slightly better than theoretical in the "copy" and especially "read" workloads. Is this typical? If yes, why can't we get the advertised transfer rates? Or is there a problem with my particular system?
For completeness, the standard `memcpy` function provided by MSVC achieves 26 GiB/s (and does the same job as my "copy" workload).

Here is AIDA64 result:
1678829601918.png

As you can see, it's better in all workloads, I wonder how they did it. But still below the theoretical numbers.

And here's MaxxMem2:
1678829692189.png

Interestingly, it agrees very well with my own bench, and falls short in the "write" department. This is probably because they didn't use streaming AVX / SSE store instructions (that don't put data into the caches) and instead used the regular store instructions that do go through the cache. I get the same write speed results if I use regular stores, and it's more indicative of real-world workloads.

So what do you guys think?

Also, can someone explain why streaming stores are that much faster than regular stores? It shouldn't be the case, for all I know about CPU architecture. Firstly because cache should be transparent to the CPU cores and shouldn't be holding them back, and second because RAM speed is still way less than CPU speed, the CPU should have enough free cycles to update the cache correctly in between RAM transfer completions.

P. S. Here's my benchmark if anyone is interested. You can download the .exe and run it yourself, it should only require MSVC++ redistributable to run. The C++ source code is also there.
 

Soulkeeper

Diamond Member
Nov 23, 2001
6,731
155
106
It is possible that the fact that they are not identical is the cause.
How many ranks does each dimm have ?
The mem controller will split the accesses between memory channels and ranks. If one dimm is dual rank and the other single rank, the memory controller might not be making optimal use of the memory.
 

Alexium

Junior Member
Aug 18, 2017
13
3
81
Thanks for the info, so ranks act somewhat like channels, if the module has two ranks?
But my modules are both single rank and they have identical number of RAM chips.
 

Soulkeeper

Diamond Member
Nov 23, 2001
6,731
155
106
Ok. I'm not the best person to dive into the highly technical aspects of the Intel platform, but another factor to consider is the benchmark itself. A single core/thread might not be able to use all that bandwidth effectively, if you threaded all the benchmark functions to run an instance on each available core you would likely get much better performance. Also modern CPU's are very complex and difficult to get benchmark results near theoretical maxes.
 

Alexium

Junior Member
Aug 18, 2017
13
3
81
I tried running two parallel benchmark instances, and got exactly twice lower score in each. And it makes sense for all I know, since memory is a lot slower than CPU, so even a single core should have no problem saturating it as long as it doesn't spend too much time processing the values it's reading or writing.