Does the memory speed scale with the CPU frequency?

hshen1 · Jun 17, 2013

Hi,
I have an Intel i7 i3770K. I have generated a benchmark by myself. Generally, it will use memcpy function to move a big chuck of data in memory from one place to another then go to sleep for a fixed amount of time.
I pin the benchmark on one core. Then I scale the core's frequency from 3.2GHz to 1.6GHz. What I have found is that the core's utilization also scales from 73% to 84% from 'top' command. If the memory speed is fixed(I believe the DIMM in my machine is 1600MHz), the core's utilization should not change much even the core's frequency is changed. Does that mean the memory speed also scale dynamically?Is there any reference talking about how it scales?
This is quite different from previous arch. I have tried to do the same thing on my Core 2 machine. The utilization generally doesn't change.

:'(

Concillian · Jun 17, 2013

Is the chunk of data big enough to fit into any of the CPU caches? Cache optimization is one thing that has seen optimization since Core2. They've added a 3rd tier and I know they've done a lot to improve how the caches are utilized. Cache does scale with clock speed.

I do not know if there are dynamic latency changes in cache with clock speed, but that's something to look into.

Memory shouldn't scale with any clock changes but BCLK.

hshen1 · Jun 18, 2013

Concillian said:
Is the chunk of data big enough to fit into any of the CPU caches? Cache optimization is one thing that has seen optimization since Core2. They've added a 3rd tier and I know they've done a lot to improve how the caches are utilized. Cache does scale with clock speed.

I do not know if there are dynamic latency changes in cache with clock speed, but that's something to look into.

Memory shouldn't scale with any clock changes but BCLK.

Thanks for your reply. The data moved is 20000*20000 int type. I believe it should be much bigger than the L3 cache,right? What I thought before was also that cache would scale while memory would not

Cerb · Jun 18, 2013

Concillian said:
Is the chunk of data big enough to fit into any of the CPU caches? Cache optimization is one thing that has seen optimization since P6

FTFY

. The P3 didn't have much in the way of speculation, but their chipset RAM controllers did, instead (VIA and nVidia, too). The CPU directly got in on the speculation action with the P4 (earlier, the Athlon, from AMD).

The speed of each core does affect memory-bound performance.
http://www.anandtech.com/show/4503/sandy-bridge-memory-scaling-choosing-the-best-ddr3/7
The memory itself is not running at a different speed. However, it's not so simple as accessing memory. The L3 also effectively acts as a write buffer for RAM, so you should be able to get near-ideal speeds with large memcpy operations, which will be affected by CPU speed. Unless you can turn them entirely off,

Short version: if it doesn't fit in L1, and it's obviously limited by some on-chip functional unit, it's complicated on any x86 CPU from the last decade. Rules of the language being used could also affect things. Without profiling and inspecting the compiled code, you probably won't get anywhere near even figuring out what you're looking at.

hshen1 · Jun 18, 2013

Cerb said:
FTFY . The P3 didn't have much in the way of speculation, but their chipset RAM controllers did, instead (VIA and nVidia, too). The CPU directly got in on the speculation action with the P4 (earlier, the Athlon, from AMD).

The speed of each core does affect memory-bound performance.
http://www.anandtech.com/show/4503/sandy-bridge-memory-scaling-choosing-the-best-ddr3/7
The memory itself is not running at a different speed. However, it's not so simple as accessing memory. The L3 also effectively acts as a write buffer for RAM, so you should be able to get near-ideal speeds with large memcpy operations, which will be affected by CPU speed. Unless you can turn them entirely off,

Short version: if it doesn't fit in L1, and it's obviously limited by some on-chip functional unit, it's complicated on any x86 CPU from the last decade. Rules of the language being used could also affect things. Without profiling and inspecting the compiled code, you probably won't get anywhere near even figuring out what you're looking at.

Hi Cerb

. What you said about the cache frequency scaling makes sense.

However,I have done some more experiment. Generally, I have two programs , one of which use busy loops and sleep to achieve 50% utilization on one core, another of which use memcpy function and sleep to achieve 50% utilization, both at 3.2GHz. I run them one by one and scale the frequency from 3.2GHz to 1.6GHz. I have found that the two programs' utilization both scale to about 66%!! This is not true on my old core 2 machine where only cpu loop tasks will scale a lot.

So this means that the CPU frequency change has the same effect on CPU bound task and memory bound task.... This doesn't make sense if the memory speed is fixed, I think....

Even considering the cache frequency scaling, the memory bound task should have less scalability than cpu intensive task during the core frequency scaling, in my opinion.

PS: In my program, 14500*10000 int is transferred by every memcpy function

Cerb · Jun 18, 2013

I wonder if the L3 cache is also slowing down? I don't recall off-hand if that was introduced in Ivy or the new Haswell, TBH. I also have no clue what effects the clock speed would have on the memory controller itself.

I honestly wonder if a Thuban might make for a good control machine CPU. Based on a reply in the other thread, those CPUs can have each core's clock adjusted (I don't have any, and have never tried it, just read of others doing it), and I know they can have their L3/IMC speed changed separately from the CPUs themselves (but get a decent aftermarket cooler, because increasing the NB speed makes them run hotter than usual).

lamedude · Jun 19, 2013

Memory speed did scale with CPU speed on K8.
Ph1 that could run each core at a different speed but Ph2 can't.

hshen1 · Jun 19, 2013

Cerb said:
I wonder if the L3 cache is also slowing down? I don't recall off-hand if that was introduced in Ivy or the new Haswell, TBH. I also have no clue what effects the clock speed would have on the memory controller itself.

I honestly wonder if a Thuban might make for a good control machine CPU. Based on a reply in the other thread, those CPUs can have each core's clock adjusted (I don't have any, and have never tried it, just read of others doing it), and I know they can have their L3/IMC speed changed separately from the CPUs themselves (but get a decent aftermarket cooler, because increasing the NB speed makes them run hotter than usual).

Thanks:\ I will try to do more experiments to verify. Anyway, I think the memory bound task and CPU bound task should have different response to the CPU frequency scaling.

sm625 · Jun 19, 2013

hshen1 said:
The data moved is 20000*20000 int type. I believe it should be much bigger than the L3 cache,right?

That should equate to 1.6GB or 1.6GiB, whichever it is is pretty much irrelevant. It is definitely way more than what would fit in the cache.

If all you are doing is simply copying memory then your cpu utilization should not be changing so much. But you're probably not just copying memory. You probably have a nested loop and a large array of preset data which itself needs to be copied in order to be copied. Depending on coding, these operations can take an extremely variable amount of time which is affected by cpu clock rate.

It is actually amazing how many cpu cycles some basic operations can take. For example, on my machine it takes 70 nS to simply set a CString to a single word? That same operation takes just 14 nS using strcat & strcpy. CPU utilization at different clock rates would vary greatly depending on which of these methods I use to fill my memory. With using just simple integers it shouldnt be as variable, but even then it can still vary. Maybe you should post your source code for the critical section that does the memory copy.

hshen1 · Jun 19, 2013

sm625 said:
That should equate to 1.6GB or 1.6GiB, whichever it is is pretty much irrelevant. It is definitely way more than what would fit in the cache.

If all you are doing is simply copying memory then your cpu utilization should not be changing so much. But you're probably not just copying memory. You probably have a nested loop and a large array of preset data which itself needs to be copied in order to be copied. Depending on coding, these operations can take an extremely variable amount of time which is affected by cpu clock rate.

It is actually amazing how many cpu cycles some basic operations can take. For example, on my machine it takes 70 nS to simply set a CString to a single word? That same operation takes just 14 nS using strcat & strcpy. CPU utilization at different clock rates would vary greatly depending on which of these methods I use to fill my memory. With using just simple integers it shouldnt be as variable, but even then it can still vary. Maybe you should post your source code for the critical section that does the memory copy.

Glad that you are willing to take a look at my source code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

long long data1[20000][10000];
long long data2[20000][10000];

void run(){
long long i=0,j=0,k=0;
while(i++<1)
while(j++<1)
memcpy(data1,data2,14500*10000);
return;
}

int main(){

struct timespec tim, tim2;
tim.tv_sec=0;
tim.tv_nsec=10000000;

while(1){
run();
nanosleep(&tim,&tim2);
}

return 1;
}

I pin the task on specific core and scale the frequency of that core.
I have also ran a benchmark called lmbench. I use command

bw_mem 2000000000 bcopy

(not sure whether you know this benchmark or not) to test the memory bandwidth. I do see the reported bandwidth dropped from 8000 to 6000 during the frequency scaling

sm625 · Jun 19, 2013

So you dont fill the arrays at all. You just create two global arrays and then copy them repeatedly. This operation should be totally memory-bandwidth limited.

Are you sure nanosleep is not itself consuming large numbers of cpu cycles? Perhaps you can write a function to test nanosleep to make sure an endless loop of nanosleeps is consuming 0%. I think you will find that it is nanosleep which consumes a varying number of cycles based on clock rate, on your newer system.

hshen1 · Jun 19, 2013

sm625 said:
So you dont fill the arrays at all. You just create two global arrays and then copy them repeatedly. This operation should be totally memory-bandwidth limited.

Are you sure nanosleep is not itself consuming large numbers of cpu cycles? Perhaps you can write a function to test nanosleep to make sure an endless loop of nanosleeps is consuming 0%. I think you will find that it is nanosleep which consumes a varying number of cycles based on clock rate, on your newer system.

No. I just commented the run() function in main and I have found the CPU core utilization is 0% from top command

. And as I mentioned before, the benchmark was scaling as expected on my Core 2 Dou(E8400) Dell workstation while the behavior is totally different on my new i3770K machine.

Blandge · Jun 19, 2013

How many memory channels are populated? It sounds like when the CPU is at 3.2 you are memory bandwidth limited, and at 1.6 you are CPU limited. If the CPU is running at half the speed you should see half the throughput. What compiler flags are you using? Use -P 2 flag in bw_mem and bind to two cores and see what the max bandwidth is. If it's roughly equal to your single core 3.2 bandwidth then you are memory BW bottlenecked.

hshen1 · Jun 19, 2013

Blandge said:
How many memory channels are populated? It sounds like when the CPU is at 3.2 you are memory bandwidth limited, and at 1.6 you are CPU limited. If the CPU is running at half the speed you should see half the throughput. What compiler flags are you using? Use -P 2 flag in bw_mem and bind to two cores and see what the max bandwidth is. If it's roughly equal to your single core 3.2 bandwidth then you are memory BW bottlenecked.

I personally do not think the memcpy() function will be CPU limited even when the CPU frequency is 1.6GHZ....:\ What's the role of the CPU during memory chuck copy? It should not require a lot of CPU computation,right?

Blandge · Jun 19, 2013

hshen1 said:
I personally do not think the memcpy() function will be CPU limited even when the CPU frequency is 1.6GHZ....:\ What's the role of the CPU during memory chuck copy? It should not require a lot of CPU computation,right?

Writes are blocking, so the CPU cannot issue a write until the previous one completes. Decreasing the CPU frequency (and as a result, the uncore frequency as well, because they share the same clock domain on Ivy Bridge) increases memory latency which lowers bandwidth. If you do pure reads I would guess that the decrease in bandwidth would be less.

sm625 · Jun 19, 2013

Well on a core 2 the memory controller is external to the cpu. Copy operations might be totally managed by the PCH. So the cpu is probably just sending the memory copy command over the fsb and then going idle until it is done (0% cpu). But on nehalem/sandy/ivy/haswell, the memory copy operations are surely managed by the cpu much more intimately and may not count as idle time.

Maybe you should do something with the data instead of just copying it. Something simple that takes 1 clock cycle on any architecture. That way the operation cant just be shoveled off onto the PCH on a core 2 system.

hshen1 · Jun 19, 2013

Blandge said:
Writes are blocking, so the CPU cannot issue a write until the previous one completes. Decreasing the CPU frequency (and as a result, the uncore frequency as well, because they share the same clock domain on Ivy Bridge) increases memory latency which lowers bandwidth. If you do pure reads I would guess that the decrease in bandwidth would be less.

Tried to read by bw_mem command.Similiar results.....

hshen1 · Jun 19, 2013

sm625 said:
Well on a core 2 the memory controller is external to the cpu. Copy operations might be totally managed by the PCH. So the cpu is probably just sending the memory copy command over the fsb and then going idle until it is done (0% cpu). But on nehalem/sandy/ivy/haswell, the memory copy operations are surely managed by the cpu much more intimately and may not count as idle time.

Maybe you should do something with the data instead of just copying it. Something simple that takes 1 clock cycle on any architecture. That way the operation cant just be shoveled off onto the PCH on a core 2 system.

Thanks. I will try to add that simple tasks. But FYI, for the memory access, it's definitely not "idle". The OS will attribute it to be as utilized. However, for the disk access, it is attributed as "idle". I am pretty sure about this

Blandge · Jun 19, 2013

sm625 said:
Well on a core 2 the memory controller is external to the cpu. Copy operations might be totally managed by the PCH. So the cpu is probably just sending the memory copy command over the fsb and then going idle until it is done (0% cpu). But on nehalem/sandy/ivy/haswell, the memory copy operations are surely managed by the cpu much more intimately and may not count as idle time.

This is incorrect. The CPU issues a write to memory. If the cache line is not in it's cache the CPU sends read request for the cache line down to the MCH. the PCH is not involved at all. The CPU is now free to continue fetching and execution instructions that don't conflict with that write. If another write instruction reaches the pipeline before the read completes, it blocks. The MCH completes the read and the CPU writes the data to the cacheline, but the data may not get written back to memory until the cache line is evicted depending on the microarchitecture's writeback policies.

Also, I said this "If the CPU is running at half the speed you should see half the throughput." I take that back. It is inaccurate.

sefsefsefsef · Jun 19, 2013

Is the goal of this exercise to just max out your memory bandwidth? Do you not care if this is read or write bandwidth? If so, then don't bother with memcpy. Instead, just write a single byte to each cache line in your giant data arrays. I assume that the caches in Core 2 are write-allocate, which means that writing even just one byte will cause the line to be read from DRAM, filled into the cache, then the store instruction will happen. Then when it's time for that line to be evicted, it will be written back to DRAM, because it's dirty.

If that alone doesn't max out your memory bandwidth (because of too high CPU utilization), then do some loop unrolling.

Blandge · Jun 19, 2013

Try something similar to Stream: http://www.cs.virginia.edu/stream/FTP/Code/

Triad:
for (j=0; j<STREAM_ARRAY_SIZE; j++) a[j] = b[j]+scalar*c[j];

hshen1 · Jun 20, 2013

Blandge said:
Try something similar to Stream: http://www.cs.virginia.edu/stream/FTP/Code/

Triad:
for (j=0; j<STREAM_ARRAY_SIZE; j++) a[j] = b[j]+scalar*c[j];

I tried. It's the same thing. Get a little confused.... On my machine, the memory bound task seems to scale pretty well with the CPU frequency .... This must have something to do with the design

hshen1 · Jun 20, 2013

sefsefsefsef said:
Is the goal of this exercise to just max out your memory bandwidth? Do you not care if this is read or write bandwidth? If so, then don't bother with memcpy. Instead, just write a single byte to each cache line in your giant data arrays. I assume that the caches in Core 2 are write-allocate, which means that writing even just one byte will cause the line to be read from DRAM, filled into the cache, then the store instruction will happen. Then when it's time for that line to be evicted, it will be written back to DRAM, because it's dirty.

If that alone doesn't max out your memory bandwidth (because of too high CPU utilization), then do some loop unrolling.

I tried everything with lmbench. No matter read, write or block copy, the tasks all scale very well with the CPU frequency...:'(

hshen1 · Jun 21, 2013

I have checked my memory. It's 64bit DDR3 1600MHz DIMMs(4GB X 2). I think even older DDR memory can read or write multiple data word in the same clock cycle,right?(Because it can transfer at both rising and falling clock edge). So DDR3 should be very fast. In one second , it should be able to transfer more than 64X2X1600M bit data.

So maybe even the CPU is running at a high frequency(e.g.,3.5GHz) with a memory bound workload(as shown in my program), the memory will not become the bottleneck(especially considering that I only run workload on one core on an 8 logical core machine). The CPU turns out to be the bottleneck and that's the reason why the performance will scale down with the CPU frequency scaling down.

Above is my guess and I am not sure whether I am correct or not. So I think maybe I should try to run the same memory bound workload on all the cores to try to congest the memory bandwidth and then observe the scaling effect.

hshen1 · Jun 22, 2013

hshen1 said:
I have checked my memory. It's 64bit DDR3 1600MHz DIMMs(4GB X 2). I think even older DDR memory can read or write multiple data word in the same clock cycle,right?(Because it can transfer at both rising and falling clock edge). So DDR3 should be very fast. In one second , it should be able to transfer more than 64X2X1600M bit data.

So maybe even the CPU is running at a high frequency(e.g.,3.5GHz) with a memory bound workload(as shown in my program), the memory will not become the bottleneck(especially considering that I only run workload on one core on an 8 logical core machine). The CPU turns out to be the bottleneck and that's the reason why the performance will scale down with the CPU frequency scaling down.

Above is my guess and I am not sure whether I am correct or not. So I think maybe I should try to run the same memory bound workload on all the cores to try to congest the memory bandwidth and then observe the scaling effect.

Finally I got it

It is just because of the reason I mentioned above. The DDR3 memory is so fast that only one core's memory bound workload can not saturate it and CPU resource is always the bottleneck.
However, when I try to run 8 memory bound workloads on my 8 cores at the same time, changing the core's frequency will not affect the CPU utilization! This is like what I expected

Does the memory speed scale with the CPU frequency?

Member

Diamond Member

Member

Elite Member

Member

Elite Member

Golden Member

Member

Diamond Member

Member

Diamond Member

Member

Member

Member

Member

Diamond Member

Member

Member

Member

Senior member

Member

Member

Member

Member

Member