P4 and RC5 guess work [Updated!]

crYnOid · Nov 17, 2000

I seem to remember a news post on aceshardware.com that said that the P4 will have a latency of 4 cycles on the rotl instruction. Whereas the Athlon core has a latency of 1 cycle. The fact that the P4 doesn't have an optimised core for it yet doesn't help either (But I don't believe a better core can make up for the poor latency).

Adding on to that very simplistically:
1000MHz for the ALU
1000MHz Athlon/1 = 1000

1500MHz*2 = 3000MHz ALU
3000MHz P4/4 = 750

That gives you a good idea of the difference between the cpus.

So the P4@1.5GHz can do about 3/4 of what a Athlon@1GHz can do. Looking up the speed page, a 1GHz Athlon gets 3.4 Mk/sec, so the P4@1.5GHz would get, using the above figures, 2.55Mk/sec which is about the same as a P3 900.

And in an even contest that would leave the P4@1.5GHz with 2.55Mk/sec and an Athlon@1.5GHz with 5.1Mk/sec twice that of the P4.

Does that look right? Comments?

blade47 · Nov 17, 2000

Speculation is as you stated only a guess. We'll just have to wait and see.🙂

JHutch · Nov 17, 2000

It is only speculation, but based on what I've seen and read here and there, I thin crYnOid has a pretty good ball-park figure...

JHutch

Train · Nov 17, 2000

doesnt the FPU and ALU for the P4 run at double speed? so we are talking 2.8 and 3.0 Ghz respectively for the 1.5 and 1.5 Ghz so if they get a RC5 core out to utilize the CPU the same way the PIII is utilized, we should see some really nice keyrates on that CPU...

just my thoughts...

Moose · Nov 17, 2000

RC5 is all integer work, it never goes into the FPU to do work so that won't help any. I have not heard any numbers yet on keyrate of the P4 as none of us have access to a P4 yet, but I suspect that sometime soon somone will modify the core to take advantage of P4 features. (Don't hold me to that one its all a matter of time and if somone can help make that happen)

moose

BurntKooshie · Nov 17, 2000

train....the ALU's and some other parts are double pumped....not the FPU's. The FPU of the P4 is weaker than the P3 (also, no "free" FXCH..erm....I think that's what it is. Its the instruction that pops a lower part of the stack to the top so it can be worked on). SSE2 will allow it to be as fast as the Athlon is in FPU natively.

Besides, if you look at crYnOid's thoughts, he already took those into account.

sciencewhiz · Nov 17, 2000

There were some P4 benchmarks posted a few days ago.

The link doesn't work anymore but the reviewer says that RC5 ran slower than his 800 mhz athlon.

Fandu · Nov 17, 2000

Haha, well this was my second question to the Intel guys @ COMDEX. I talked with most of the marketing team and finally Justin Whitney who is one of the engineers on the project. Unfortunatly they were not familiar with the project and refused to give me a straight answer on the ALU performance. Sidestepping my best placed questions. And because of all the NDA's we were under they carefully guarded the systems while we had a chance to play around on them. Allowing us only to take visual benchmarks, no hard numbers. But I had my small but fitting revenge when the entire room of about 25 P4's went down. hehe, we were buggin them about drawing too much power.

Fandu · Nov 17, 2000

Just remembered an old friend who happens to have a 1.5GHz P4 and may be able to run some benchmarks. More to come...

EDIT: E-mail fired off, benchmarks to come. 🙂

crYnOid · Nov 18, 2000

🙂 thanks for the comments guys 😉

Rendus · Nov 18, 2000

If I talk nicely to a certain set of people at a certain place that happens to pay me by the hour, I may be able to get into the room holding our two demo machines we got from a certain large client (Not sure how many NDAs cover this crap) that happen to contain 1.5GHz P4s. I'll see what I can do, but that's unlikely.

Train · Nov 18, 2000

I just looked at the source code for the P3 dnet core,

I guess one of the major ways they optimize for different chips is to use the exact correct # of pipelines, and condsidering no chip out there has as many pipelines as the P4 does, no core could have taken advantage of that. i think when the P3 core is modified to use all the new pipelines in the P4, we should see some nice numbers.

Fandu · Nov 18, 2000

Well I (read he) does have all the benchmarks you could ever want from the d.net client, both RC5 and OGR. Unfortunatly he's not going to put his butt on the line and break NDA quite yet. Me, well mine doesn't cover benchmarks from other machines. hehe 🙂

crYnOid · Nov 19, 2000

WOOHOO! proof of sorts 😉 At this link on page 301, go down to ROL/ROR and you will notice that it says that the Latency is 4 (woohoo! I got it right) and the throughput is at 1. So the figures that I put together above are accurate (as they can be 😉).

BTW, an Athlon 800 does 2.7Mkeys/sec, so that explains the figures in the australianIT.com.au article.

ViRGE · Nov 19, 2000

Yikes, unless Dnet has a trick up their sleve, the P4 really will suck at RC5!:Q

crYnOid · Nov 19, 2000

suck like a cheap hoo........er..... 😱😀

BurntKooshie · Nov 19, 2000

crynoid, good work. I downloaded the file earlier, but I hadn't had a chance to read through it yet. I know the Athlons latency is 1 (read that techdoc), but what's the throughput? 1 I assume?

crYnOid · Nov 19, 2000

that is what I assume too.

ElFenix · Nov 19, 2000

when they come out with a client i'll "rent" a p4 for a short while and do a little benchmarking, if need be. rc5, ogr, and then ratbastard's s@h unit...

crYnOid · Nov 19, 2000

sadly I don't believe a new client will help a great deal as the problem is with an instruction in hardware. A new client may get a small performance boost but it won't fix the problem of having such a high latency (unless a work-around is found, which I doubt).

Riv · Nov 19, 2000

Notice how the throughput is 1 so I don't know why you think the P4 will suck, just increase the number of keys cracked per iteration to get rid of the latencies (and that's what the pipeline count in the Dnet RC5 source refers to- the number of keys cracked at once). It would also be interesting to see how fast a P4 SIMD core would be. The P5 MMX core sucks on a P3 so I don't have a lot of faith in it even if the number of keys cracked at once is doubled. I wish there was an instruction where multiple data could be shifted with different shift counts...

crYnOid · Nov 19, 2000

This is a quote from the above pdf file.

<< Latency: The number of clock cycles that are required for the execution core
to complete the execution of all of the µops that form a IA-32
instruction.

Throughput: The number of clock cycles required to wait before the issue ports
are free to accept the same instruction again. For many IA-32
instructions, the throughput of an instruction can be significantly less
than its latency.

Execution units: The names of the execution units in the execution core that are
utilized to execute the µops for each instruction. This information is
provided only for IA-32 instructions that are decoded into no more
than 4 µops. µops for instructions that decode into more than 4 µops
are supplied by microcode ROM. Note that several execution units
may share the same port, such as FP_ADD, FP_MUL, or MMX_SHFT in
the FP_EXECUTE cluster. >>

As I should have said above I have no idea about the Throughput of any other processor. You will need some else to explain throughput and it's affects as I have no idea🙁 I assume that the Athlon + P3 have the same throughput.

Riv · Nov 19, 2000

Let's first deal with latency- it's the time it takes from when you issue the instruction til you get the result. But that doesn't mean you can't execute other instructions while that instruction is carried out and that's where throughput comes to play. The throughput is 1 clock meaning you're free to issue another ROL (or whatever) the clock after the first one. So if you execute say 4 ROL instructions it will take you not 16 clocks but 8 clocks before you have the last result. The point is that if you can execute some other instructions before you need to use the result the latency is irrelevant. Obviously it would be easier to write code if every instruction had the same latency and throughput but that's another story...

crYnOid · Nov 19, 2000

I have interpreted the lantency as being the amount of time before the next instruction can be started. My idea of it was that the ALU would be in use for those 4 cycles while the µops that form the instruction are carried out. Once they are finished it can start the next one. I may be wrong, if so point me in the right direction😉

Fandu · Nov 19, 2000

Riv is correct, you guys are forgetting how the pipeline works. Last I heard the P4 was around 120% faster than a P3 in pure integer ops.

P4 and RC5 guess work [Updated!]

Senior member

Golden Member

Golden Member

Lifer

Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Senior member

Golden Member

Lifer

Golden Member

Senior member

Elite Member, Moderator Emeritus

Senior member

Diamond Member

Senior member

Elite Member

Senior member

Member

Senior member

Member

Senior member

Golden Member