jandersonlee
Junior Member
I'm trying to get good memory bandwidth for multiple, independent single-threaded batch tasks running on Scalable Xeon processors. One possible configuration would use 12x 16GB DR DDR4 dimms per socket at 2667MHz with a Xeon 6126 (Skylake 12C@2.6GHz/2667MHz) while another significantly cheaper configuration would use 6x 32GB DR DDR4 per socket at 2400MHz with a Xeon 4214 (Clearlake 12C@2.2GHz/2400MHz). Depending on whether they are CPU limited or memory limited, one could posit that the applications might run somewhere between 10% (100%-(2400MHz/2667MHz)=10.0%) and 16% (100%*(1-2.2GHz/2.6/GHz)=15.4%) slower on the second configuration. However there is also the issue of having 2xDR vs 1xDR dimms per channel.
In its page on Memory rank, Wikipedia suggests that:
"Subject to some limitations, ranks can be accessed independently, although not simultaneously as the data lines are still shared between ranks on a channel. For example, the controller can send write data to one rank while it awaits read data previously selected from another rank. While the write data is consumed from the data bus, the other rank could perform read-related operations such as the activation of a row or internal transfer of the data to the output drivers. Once the CA bus is free from noise from the previous read, the DRAM can drive out the read data. Controlling interleaved accesses like so is done by the memory controller."
Would a system gain benefit by having 12x16GB versus 6x32GB dimms in order to have more ranks open at a time, or would the system behave the same with 2x16GB DR vs 1x32GB DR per channel? I.e. would using 2x16GB dimms effectively act like having 4 ranks (QR) instead of two (DR)? The 4214 system has a maximum of 8 DIMMs per socket, so 12x16GB per socket is not an option, but I could save a bit on the 6226 config by substituting 6x32GB dimms if there is no gain for having more dimms. The Skylake processor can have up to 128 memory operations in-flight, and I'm assuming that the Clearlake version is at least the same. Given that the tasks are largely single-threaded and independent I'd like to spec the memory to give better bandwidth at similar price-points.
In its page on Memory rank, Wikipedia suggests that:
"Subject to some limitations, ranks can be accessed independently, although not simultaneously as the data lines are still shared between ranks on a channel. For example, the controller can send write data to one rank while it awaits read data previously selected from another rank. While the write data is consumed from the data bus, the other rank could perform read-related operations such as the activation of a row or internal transfer of the data to the output drivers. Once the CA bus is free from noise from the previous read, the DRAM can drive out the read data. Controlling interleaved accesses like so is done by the memory controller."
Would a system gain benefit by having 12x16GB versus 6x32GB dimms in order to have more ranks open at a time, or would the system behave the same with 2x16GB DR vs 1x32GB DR per channel? I.e. would using 2x16GB dimms effectively act like having 4 ranks (QR) instead of two (DR)? The 4214 system has a maximum of 8 DIMMs per socket, so 12x16GB per socket is not an option, but I could save a bit on the 6226 config by substituting 6x32GB dimms if there is no gain for having more dimms. The Skylake processor can have up to 128 memory operations in-flight, and I'm assuming that the Clearlake version is at least the same. Given that the tasks are largely single-threaded and independent I'd like to spec the memory to give better bandwidth at similar price-points.