O.K. Adul and others that are curious: P4 RC5 & OGR benches...

fkloster

Diamond Member
Dec 16, 1999
4,171
0
0
Adul, bless his heart, pm'd me to do some RC5 & OGR tests on my P4. I thought they should be public. I know P4's prolly get spanked pretty bad in this stuff and thats why Adul wanted me to run the benches but the truth needs to be known and no one has tested this yet that I know of so here goes...:

P4 @ 1800
PC-800 @ 480mhz (PC-960) ECC off TURBO on

RC5 benchmark....

#0 1,745,101
#1 2,245,685
#2 2,367,190

OGR benchmark....

#0 5,968,074
ect.
ect.

...anyways, how bad are these #'s guys? I know nothing about RC5 or OGR except that they are NOT SSE2 optimized right Adul?
 

NFS4

No Lifer
Oct 9, 1999
72,636
47
91
I remember that my Athlon 750 used to get around 2,200,000 in RC5
 

fkloster

Diamond Member
Dec 16, 1999
4,171
0
0
It figures. I knew the Pentium 4 would get slaughtered in this test. Pure raw digit crunching is where the Athlon reigns supreme.
 

fkloster

Diamond Member
Dec 16, 1999
4,171
0
0
I wonder why Adul wanted me to run the tests when P4 results are already in the data base?
 

ugh

Platinum Member
Feb 6, 2000
2,563
0
0


<< RC5 benchmark....

#0 1,745,101
#1 2,245,685
#2 2,367,190
>>



WTH? This P3 800 I'm using is already crunching at [2,238,709.03 keys/sec]. There's something REALLY wrong here..
 

MadRat

Lifer
Oct 14, 1999
11,965
278
126
RC5 is fpu dependent but is not optimized for SSE because its decryption cannot be synthesized in SIMD. Isn't SIMD more or less optimized for repetitive operations, not dynamic?
 

dowxp

Diamond Member
Dec 25, 2000
4,568
0
76
[Jun 21 05:52:45 UTC] Automatic processor detection found 2 processors.
[Jun 21 05:52:45 UTC] Loading crunchers with work...
[Jun 21 05:52:45 UTC] Loaded RC5 2*2^28 packet 3FF25AA9:00000000 (57.60% done)
[Jun 21 05:52:45 UTC] Loaded RC5 2*2^28 packet 3FF29686:10000000 (37.60% done)
[Jun 21 05:52:45 UTC] Summary: 21 RC5 packets (41*2^28 keys)
0.00:41:48.51 - [4.38 Mkeys/s]


argh.
 

ugh

Platinum Member
Feb 6, 2000
2,563
0
0


<< actually, ugh, that's nearly exactly what it should be at. >>


It is for my machine, but for a P4 @ 1800 which is getting something similar to my machine is rather odd.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0


<< RC5 is fpu dependent >>

...*scratches head*
RC5 is entirely integer code (I'm looking through the assembly source code now :)). SIMD ops should help for 128-bit adds...the G4 is nearly architecturally identical to the G3, but the G4 gets nearly 3 times as many keys/clock due to heavy Altivec optimizations. I don't remember what the official word on the P4 is, but they haven't gotten around to doing SSE2 opimizations yet (if they ever plan to). Let me look through the source a little more, I'm trying to figure out what causes the P4 to do so poorly using the P6 core (it may take a while...it's not helping that most of my assembly experience is in MIPS-RISC and not x86 :)).
 

fkloster

Diamond Member
Dec 16, 1999
4,171
0
0


<< You should change your computer's name to &quot;Crunching Monster&quot;! >>



Change my computers name from...'what'? I didn't start this thread to be teased. I started it so we could discuss the P4's relatively low performance in RC5 &amp; OGR...
 

Eug

Lifer
Mar 11, 2000
24,000
1,620
126
Hmmm... My Celeron at about half that speed destroys that P4.

Intel Celeron 920 Windows 98 2.8010 RC5 2,622,739
Intel Celeron 920 Windows 98 2.8010 OGR 6,359,670

Note also that I was using an older client at the time too. I am now running 880 (less than half the speed), but it still beats that P4@1800. Celeron 533A@880 on BX (110 FSB) CAS 2-2-2.
 

BurntKooshie

Diamond Member
Oct 9, 1999
4,204
0
0
if you're talking about RC5, that's because the P4's bitwise rotate-left instruction has a high latency. It's not a commonly used instruction, so why waste time optimizing the chip for it? RC5 uses it heavily, and thus, the P4 suffers.
 

Noriaki

Lifer
Jun 3, 2000
13,640
1
71
I believe it has something to do with the bitwise rotate instructions performing poorly on the P4 (but it's a very rarely used instruction anways)....but don't quote me there.

Ed: Note to self: Read the whole thread silly, BurntKooshie beat your ass to that one ;)

But the P4 isn't so hot in raw number crunching...but the P3 had less than half the FPU power of the Athlon to, and in most things it performs pretty even clock for clock. Raw FPU power isn't everything, so he can't crack RC5...big deal. I have better things to use my CPU cycles for anyways....(and I'm not &quot;just saying that&quot; I have a Duron @ 1000, I could do some major RC5 crunching, but it's just a waste of time and electricity IMO).

The P4 has lots of good things about it's architecture. Look at the Quake3 scores. Now I know everyone is going to say &quot;Well that's just one engine&quot;, maybe it is, but it's one of the most advanced engines we currently have, and I think that future engines are going to show advantage to the P4 as well.

 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Okay, here's my educated guess:

I looked through the source code (which you can get here)...the RC5 core is actually quite small, only around 500 lines of x86 assembly. Much of the core is three procedures of &quot;key expansion,&quot; which contain mostly adds, moves, and rotates using general purpose registers...no problem here. At the end are three procedures: next_iter, next_iter2, and next_inc...within the 50 or so lines of code, there are 10 conditional branches...I think this is the problem.

edit: apparently it's the rotate instructions. ;)

edit2: you guys are right:


<< Why are PowerPC-based and (most) Intel-based computers so much faster than other platforms on RC5-64?
Integral to the mathematics of the RC5 algorithm are 32-bit rotate operations.
For whatever reason, the designers of the IA32 (32bit Intel x86) and the PowerPC architectures decided to implement the rotate function as a hardware instruction.

Many other CPUs do not have built-in hardware rotate instructions and must emulate the operation by (at the very least) two shifts and a logical OR. This handicap is why many non-32bit-Intel [1] and non-PowerPC computers run RC5 slower than one might expect based on real-world benchmarks. It is also the main reason why the RC5 client is a poor benchmark to use in determining the speed or performance of a particular CPU.

[1] The IA32 architecture is that used by the Intel 80386, 80486, Pentium, Pentium Pro, Pentium II, Pentium III and Pentium 4 processors. The Pentium 4 does not however have a hardware rotate instruction.


That's weird....it's not like rotater is hard to implement, using a barrel shifter with multiplexed shifting (even my lowly 16-bit RISC CPU that I designed for my comp architecture class had a hardware rotater)...

 

Duvie

Elite Member
Feb 5, 2001
16,215
0
71
I applaude you FKloster...No one should tease you for wanting to understand anything...

Adul may have wanted to see if they were true on all systems and since you have a bit of ocing done how that may effect the scores...

If you are interested I would like you to do a benchmark I think the p4 does better at not necessary clock for clock but best of p4 to best of athlon...let me know if interested cause I would like to see the scores of your cpu default and oc'd...LMK thru PM...
 

Soccerman

Elite Member
Oct 9, 1999
6,378
0
0
One thing that I know, is that the only instruction set that gets used with rc5, is the Altivec one, for G4s, which is why they do so damned well!

Can someone explain to me why similar things cannot be, or aren't done on x86 CPU's?

what exactly is the difference between 3DNow!, and SSE (both 1 and 2)?
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Soccerman: The idea behind SIMD is to operate on vectors of data...instead of having a 32-bit ALU and 32-bit registers that do a single 32-bit operation, you have a 128-bit ALU and 128-bit registers, allowing you to do, for example, one 128-bit add, two 64-bit adds, four 32-bit adds, eight 16-bit adds, or sixteen 8-bit adds at one time. The challenge in coding for SIMD is to find, for example, four sequential 32-bit adds without interdependent data hazards that can be synthesized into a single 128-bit SIMD add. Compilers are not that good at extracting SIMD parallelism, so the best results come by hand-coding. SIMD instructions include normal integer and floating point operations, as well as a bunch of permute, pack/unpack, merge, and alignment ops.

Altivec is the most powerful of the SIMD sets because it can perform 128-bit integer and floating-point ops, and has its own set of 32 128-bit Altivec registers. It also has two dedicated Altivec execution units: one for ALU ops, the other for permutation ops. The units are fast, with latencies of 1 cycle for simple ops and 3-4 for more complex ops, and can operate in parallel with the normal integer and FP units.

MMX has integer instructions only, and operates on 64-bit vectors. The MMX registers share the same register space with the x87 FP register file, so you can't mix MMX instructions with x87 FP instructions.

SSE uses 128-bit FP instructions, and adds its own set of eight 128-bit SSE registers. The problem is that it only has 64-bit SIMD units (one add, one multiply) and datapaths, so it has to break most 128-bit SIMD ops down into two 64-bit ops. It can do a single 128-bit add-multiply (useful for matrix dot-products), or a 64-bit add/64-bit multiply in one cycle; otherwise it has to perform two 64-bit adds or multiplies to simulate a single 128-bit add or multiply. This limitation was incurred because adding full 128-bit execution units and datapaths would have been costly to the die size. Also, the CPU has to switch states to use the SSE register set, so mixing a lot of SSE ops with x87/MMX ops can incur a costly delay.

SSE2 extends the MMX integer instructions to 128-bit using the SSE register set, and adds some more FP instructions. I'm not too sure how SSE2 is implemented on the P4, but I believe it has the same 64-bit execution limitation as the P3. Since SSE/SSE2 are 128-bit instruction sets, it is possible for future Intel or AMD implementations to have 128-bit execution units and datapaths.

3DNow does 64-bit integer and FP instructions, and uses the same register space as x87/MMX. It can simulate 128-bit adds and multiplies the same way the P3 does.
 

Dark4ng3l

Diamond Member
Sep 17, 2000
5,061
1
0
3,447,701.56 keys/sec is what my tbird 900@ 1 gig has been doing(ive been browsing and using icq though so it's not s definite &quot;score&quot;)
 

ugh

Platinum Member
Feb 6, 2000
2,563
0
0


<< I didn't know they where in the database. :eek: >>



I knew there was one, but it's soooooooo outdated and it's not maintained by DNet themselves :D
 

Pabster

Lifer
Apr 15, 2001
16,986
1
0
No flames or anything here, please. Here's a couple numbers from my 1500MHz Palomino box:

[Jun 22 22:35:19 UTC] Benchmark for RC5 core #6 (RG/HB ath)
0.00:00:16.11 [5,531,477.66 keys/sec]

[Jun 22 22:37:13 UTC] Benchmark for OGR core #0 (GARSP 5.13)
0.00:00:16.04 [12,340,507.04 nodes/sec]

I'm not much of an RC5/OGR nut, much prefer SETI, and I've no idea how these numbers are. But they look pretty good to me :D