• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

help collect benchmark results

vnv

Junior Member
Hello, I am developing a low-level benchmark that measures the latency and throughput of each instruction for x86 processors: http://mubench.sourceforge.net/. It is GPL'ed and runs under Linux. I need results from more different processors. Would you be willing to run it and contribute results?

It is very simple, just download the tarball unpack it and run "perl mubench.pl --no-pairs" for a fast benchmark or just "perl mubench.pl" if you have a lot of time. Thanks!

There are also some results already there: http://mubench.sourceforge.net/results.html. Might be of interest to compare processors, or if writing assembly code. 🙂

--Alex
 
How exactly are you measuring the latency? I have a hard time believing inc r32 is a different latency from add r32, r32 on K7 and K8.

For what it's worth, optimization guides published by processor vendors tend to include some of this information. Also, with superscalar and out of order processors, raw instruction latencies have less meaning than you might think. The K7 optimization guide has lots of info (latencies and more) in Appendix F... the K8 optimization guide has the same info in Appendix C. (I think this document is the same as the previous one).

edit: You might find AMD's CodeAnalyst interesting if you're trying to write highly-optimized code - it can do things like pipeline simulation and give you per-instruction performance statistics.
 
CTho9305 - yes, optimization guides do include some of this information. There are links to them from the mubench page actually 😉 Intel's are pretty thin on info recently: nothing on Core2, nothing on (for example) the difference between Prestonia/Nocona/Paxville Xeons (which is condierable)...

How does it measure latency? Well, it's all in the code, but basically it generates a long chain of instructions where each one depends on the result of the previous one. Run with --include='^add ' and then examine the test.c file, you will see exactly how it performs the measurement for "add" for example. You're right, the latency measurement for single operand instructions is now incorrectly the same as the throughput. That said, throughput for add and inc is definitely different so I don't think they run in a similar way.
 
CTho9305 - yes, optimization guides do include some of this information. There are links to them from the mubench page actually 😉
Yeah, I saw that after I posted. Sorry.

You're generating bad code for the add test if you care about throughput.

"add %%ebx, %%eax\n"
"add %%rcx, %%rbx\n"
"add %%edx, %%ecx\n"
"add %%rsi, %%rdx\n"
"add %%edi, %%esi\n"
"add %%r8, %%rdi\n"
These 6 ops are dependent on each other, so they will take 6 cycles to execute, giving you a throughput of 1 and a latency of 1.

"inc %%ebx\n"
"inc %%ecx\n"
"inc %%edx\n"
"inc %%esi\n"
"inc %%edi\n"
"inc %%eax\n"

These are independent, and since K7 is a 3-issue machine, the first 3 will execute in one cycle, and the second 3 will execute in the next cycle, giving you a latency of 1 and a throughput of 3.

However, if you had used independent operands for the add instructions, you would have also found a throughput of 3 for add. I'd suggest testing add throughput with:
add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx
et cetetra

This will give you a latency of 1 / throughput of 3.
 
Hmm, what you have is the code for mixed 64 and 32-bit adds (since the bench tries all mixes as well as individual instructions, the last test it ran was "add r32, r32 add r64, r64" and you looked at the test.c from that). I'm not sure the generated code always properly takes into consideration the fact that eax and rax are not independent 🙂 But in the particular throughput example above it is ok: the destination is the second register, so the 6 ops are not dependent on each other... "add ebx, eax" uses ebx, and then "add rcx, rbx" destroys the value in ebx (after it is used as source in the previous instruction).

Have a look at the code when it doesn't mix 64/32, I think it's ok. Try --include='^add r32' and look at test.c again . Throughput is measured like this:
"add %%ebx, %%eax\n"
"add %%ecx, %%ebx\n"
"add %%edx, %%ecx\n"
"add %%esi, %%edx\n"
and latency like this:
"add %%eax, %%ebx\n"
"add %%ebx, %%ecx\n"
"add %%ecx, %%edx\n"
"add %%edx, %%esi\n"

You might say it is weird to "recycle" a register by using it as the destination immediately after it is used as a source in another instruction, but it doesn't seem to make any difference. Also real code will often do this.

Anyway, this is basically why I posted it, you caught two bugs already 🙂 (which I will fix very shortly).

Now how about posting some results? 🙂 (with --no-pairs so it doesn't take forever).
 
Oh, I'm used to reading some other assembly format which is add dest, src. You really should get a throughput of 3 adds per cycle then (at least on K7 and K8).
 
C:\DOCUME~1\Chris\LOCALS~1\Temp\New Folder (2)\mubench-0.2.1>perl --version

This is perl, v5.8.6 built for cygwin-thread-multi-64int

Copyright 1987-2004, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'. If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

C:\DOCUME~1\Chris\LOCALS~1\Temp\New Folder (2)\mubench-0.2.1>gcc --version
gcc (GCC) 3.3.3 (cygwin special)
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

mubench 0.2.1
running 315 tests
saving results to mubench-results-20061001T163152.xml
error: cannot compile psignb xmm, xmm: /cygdrive/c/DOCUME~1/Chris/LOCALS~1/Temp/
ccsppCP2.s: Assembler messages:
/cygdrive/c/DOCUME~1/Chris/LOCALS~1/Temp/ccsppCP2.s:75: Error: bad register name `%xmm8'
/cygdrive/c/DOCUME~1/Chris/LOCALS~1/Temp/ccsppCP2.s:76: Error: bad register name `%xmm9'
/cygdrive/c/DOCUME~1/Chris/LOCALS~1/Temp/ccsppCP2.s:77: Error: bad register name `%xmm10'
/cygdrive/c/DOCUME~1/Chris/LOCALS~1/Temp/ccsppCP2.s:78: Error: bad register name `%xmm11'
/cygdrive/c/DOCUME~1/Chris/LOCALS~1/Temp/ccsppCP2.s:79: Error: bad register name `%xmm12'
/cygdrive/c/DOCUME~1/Chris/LOCALS~1/Temp/ccsppCP2.s:80: Error: bad register name `%xmm13'
/cygdrive/c/DOCUME~1/Chris/LOCALS~1/Temp/ccsppCP2.s:81: Error: bad register name `%xmm14'
/cygdrive/c/DOCUME~1/Chris/LOCALS~1/Temp/ccsppCP2.s:82: Error: bad register name `%xmm15'
/cygdrive/c/DOCUME~1/Chris/LOCALS~1/Temp/ccsppCP2.s:95: Error: no such instruction: `psignb %xmm1,%xmm0'
Pages more of basically the same thing.
 
Thanks. I'm surprised it gets that far on windows 😉 Looks like your system or gcc is not 64bit. Try giving it --mhz=4000 (or whatever is the correct number for you), and also --no-64bit , that should reduce the amount of errors. The errors are basically harmless, although they should not be going to the terminal, and aside from that it looks like things are working. Check the contents of the .xml file after it has run for a minute?
 
Back
Top