memory bandwidth benchmark question

eLiu · Feb 28, 2011

Hi all,
I have a Core i7-950 with DDR3-1066 RAM in triple channel mode. Hardware sites indicate that SiSoft Sandra reports about 20GB/s of memory bandwidth. I compiled and ran the STREAM benchmark (see below) and I was seeing more like 13GB/s (with icc) and 10GB/s (with gcc).

The STREAM results seem more consistent with what I've seen in my application. The memory-bound part of that is series of matrix-vector multiplies (of double); the peak floating point performance there is about 3.2 GFLOPS. For a matrix-vector product, the # of flops is about 2*m*m. So 3.2 GFLOPS corresponds to 3.2/2*8 = 12.8GB/s bandwidth usage, I think.

Can anyone help me understand why there's such a big discrepancy between Sandra and STREAM results? Like the STREAM "copy" benchmark just executes this:
for (j=0; j<N; j++)
c[j] = a[j];
I have N set at 5 million doubles. That seems as straightforward as it gets but the result is quite a bit smaller than what Sandra reports. But the result is more in line with what I see from my application (or what I think I'm seeing anyway).

STREAM: (includes their very simple source code)
http://www.cs.virginia.edu/stream/ref.html

Thoughts?
-Eric

edit: oddly, STREAM, my matrix-vector test, and Sandra all report about the same 6.5-7GB/s for a Core2 (at 3GHZ) testbed.

Schmide · Feb 28, 2011

You need to prefetch and maximize sse register usage using prefetchnta, movdqa and movntdq in an unrolled loop.

eLiu · Mar 1, 2011

Schmide said:
You need to prefetch and maximize sse register usage using prefetchnta, movdqa and movntdq in an unrolled loop.

Well I know intel provides a memcpy implementation as a builtin function; I think it's called "intel_fast_memcpy" or something similar to that. At O2 and higher (i compiled O3), icc can recognize simple for loops that just copy one array into another, and replace those with inline'd calls to builtin functions.

I would assume that the intel-implemented memcpy() function should do all the necessary things to achieve peak performance? But I guess not?? Do you know of any open-source memory benchmarks in Linux that will achieve the Sandra results? Now I'm just kind of curious as to what that code might look like. I don't really have time now to try it myself... maybe this summer.

But I will try setting "-opt-streaming-stores always" in addition tomorrow to try and force it to use cache-controlled moves like movntdq and see what happens.

Schmide · Mar 1, 2011

This guy seems to have done a good job.

http://arcticos.googlecode.com/svn-history/r226/trunk/Core/string/memcpy.c~

Or an non AT&T style one

http://williamchan.ca/portfolio/assembly/ssememcpy/source/viewsource.php?id=ssememcpy.c

but I don't have anything on hand for what you want.

If you google prefetchnta, movdqa movntdq you'll find plenty of examples.

eLiu · Mar 1, 2011

Schmide said:
This guy seems to have done a good job.

http://arcticos.googlecode.com/svn-history/r226/trunk/Core/string/memcpy.c~

Or an non AT&T style one

http://williamchan.ca/portfolio/assembly/ssememcpy/source/viewsource.php?id=ssememcpy.c

but I don't have anything on hand for what you want.

If you google prefetchnta, movdqa movntdq you'll find plenty of examples.

When I try to compile the first one with gcc, I get:
stream.c:511: Error: symbol `.memcpyA' is already defined
my compile line is simple: gcc -O3 -std=gnu99 stream.c -o stream

I have no idea what's going on there... no familiarity with the AT&T syntax so if you could poke me in the right direction I'd appreciate it.

The second one (after I tell it to use 64bit registers, haha) is a solid 20% faster than intel's memcpy implementation, which is pretty cool considering how simple that code is. Though I guess when do you really need to copy so much data that you want cache-controlled writes...

Also I realized something stupid I wasn't doing--multicore. Enabled -fopenmp and now the intel compiler generates code for a*x, x+y and x+a*y with memory throughput that is comparable to the Sandra's results. Copying is still relatively slow though--not quite catching up to the serial asm code (like 12.5GB/s vs 13.4GB/s).

memory bandwidth benchmark question

eLiu

Diamond Member

Schmide

Diamond Member

eLiu

Diamond Member

Schmide

Diamond Member

eLiu

Diamond Member

TRENDING THREADS