Hi all,
I have a Core i7-950 with DDR3-1066 RAM in triple channel mode. Hardware sites indicate that SiSoft Sandra reports about 20GB/s of memory bandwidth. I compiled and ran the STREAM benchmark (see below) and I was seeing more like 13GB/s (with icc) and 10GB/s (with gcc).
The STREAM results seem more consistent with what I've seen in my application. The memory-bound part of that is series of matrix-vector multiplies (of double); the peak floating point performance there is about 3.2 GFLOPS. For a matrix-vector product, the # of flops is about 2*m*m. So 3.2 GFLOPS corresponds to 3.2/2*8 = 12.8GB/s bandwidth usage, I think.
Can anyone help me understand why there's such a big discrepancy between Sandra and STREAM results? Like the STREAM "copy" benchmark just executes this:
for (j=0; j<N; j++)
c[j] = a[j];
I have N set at 5 million doubles. That seems as straightforward as it gets but the result is quite a bit smaller than what Sandra reports. But the result is more in line with what I see from my application (or what I think I'm seeing anyway).
STREAM: (includes their very simple source code)
http://www.cs.virginia.edu/stream/ref.html
Thoughts?
-Eric
edit: oddly, STREAM, my matrix-vector test, and Sandra all report about the same 6.5-7GB/s for a Core2 (at 3GHZ) testbed.
I have a Core i7-950 with DDR3-1066 RAM in triple channel mode. Hardware sites indicate that SiSoft Sandra reports about 20GB/s of memory bandwidth. I compiled and ran the STREAM benchmark (see below) and I was seeing more like 13GB/s (with icc) and 10GB/s (with gcc).
The STREAM results seem more consistent with what I've seen in my application. The memory-bound part of that is series of matrix-vector multiplies (of double); the peak floating point performance there is about 3.2 GFLOPS. For a matrix-vector product, the # of flops is about 2*m*m. So 3.2 GFLOPS corresponds to 3.2/2*8 = 12.8GB/s bandwidth usage, I think.
Can anyone help me understand why there's such a big discrepancy between Sandra and STREAM results? Like the STREAM "copy" benchmark just executes this:
for (j=0; j<N; j++)
c[j] = a[j];
I have N set at 5 million doubles. That seems as straightforward as it gets but the result is quite a bit smaller than what Sandra reports. But the result is more in line with what I see from my application (or what I think I'm seeing anyway).
STREAM: (includes their very simple source code)
http://www.cs.virginia.edu/stream/ref.html
Thoughts?
-Eric
edit: oddly, STREAM, my matrix-vector test, and Sandra all report about the same 6.5-7GB/s for a Core2 (at 3GHZ) testbed.
Last edited: