• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

P4 and RC5 guess work [Updated!]

crYnOid

Senior member
I seem to remember a news post on aceshardware.com that said that the P4 will have a latency of 4 cycles on the rotl instruction. Whereas the Athlon core has a latency of 1 cycle. The fact that the P4 doesn't have an optimised core for it yet doesn't help either (But I don't believe a better core can make up for the poor latency).

Adding on to that very simplistically:
1000MHz for the ALU
1000MHz Athlon/1 = 1000

1500MHz*2 = 3000MHz ALU
3000MHz P4/4 = 750

That gives you a good idea of the difference between the cpus.

So the P4@1.5GHz can do about 3/4 of what a Athlon@1GHz can do. Looking up the speed page, a 1GHz Athlon gets 3.4 Mk/sec, so the P4@1.5GHz would get, using the above figures, 2.55Mk/sec which is about the same as a P3 900.

And in an even contest that would leave the P4@1.5GHz with 2.55Mk/sec and an Athlon@1.5GHz with 5.1Mk/sec twice that of the P4.

Does that look right? Comments?

 
It is only speculation, but based on what I've seen and read here and there, I thin crYnOid has a pretty good ball-park figure...

JHutch
 
doesnt the FPU and ALU for the P4 run at double speed? so we are talking 2.8 and 3.0 Ghz respectively for the 1.5 and 1.5 Ghz so if they get a RC5 core out to utilize the CPU the same way the PIII is utilized, we should see some really nice keyrates on that CPU...

just my thoughts...
 
RC5 is all integer work, it never goes into the FPU to do work so that won't help any. I have not heard any numbers yet on keyrate of the P4 as none of us have access to a P4 yet, but I suspect that sometime soon somone will modify the core to take advantage of P4 features. (Don't hold me to that one its all a matter of time and if somone can help make that happen)

moose
 
train....the ALU's and some other parts are double pumped....not the FPU's. The FPU of the P4 is weaker than the P3 (also, no "free" FXCH..erm....I think that's what it is. Its the instruction that pops a lower part of the stack to the top so it can be worked on). SSE2 will allow it to be as fast as the Athlon is in FPU natively.

Besides, if you look at crYnOid's thoughts, he already took those into account.
 
There were some P4 benchmarks posted a few days ago.

The link doesn't work anymore but the reviewer says that RC5 ran slower than his 800 mhz athlon.
 
Haha, well this was my second question to the Intel guys @ COMDEX. I talked with most of the marketing team and finally Justin Whitney who is one of the engineers on the project. Unfortunatly they were not familiar with the project and refused to give me a straight answer on the ALU performance. Sidestepping my best placed questions. And because of all the NDA's we were under they carefully guarded the systems while we had a chance to play around on them. Allowing us only to take visual benchmarks, no hard numbers. But I had my small but fitting revenge when the entire room of about 25 P4's went down. hehe, we were buggin them about drawing too much power.
 
Just remembered an old friend who happens to have a 1.5GHz P4 and may be able to run some benchmarks. More to come...

EDIT: E-mail fired off, benchmarks to come. 🙂
 
If I talk nicely to a certain set of people at a certain place that happens to pay me by the hour, I may be able to get into the room holding our two demo machines we got from a certain large client (Not sure how many NDAs cover this crap) that happen to contain 1.5GHz P4s. I'll see what I can do, but that's unlikely.
 
I just looked at the source code for the P3 dnet core,

I guess one of the major ways they optimize for different chips is to use the exact correct # of pipelines, and condsidering no chip out there has as many pipelines as the P4 does, no core could have taken advantage of that. i think when the P3 core is modified to use all the new pipelines in the P4, we should see some nice numbers.
 
Well I (read he) does have all the benchmarks you could ever want from the d.net client, both RC5 and OGR. Unfortunatly he's not going to put his butt on the line and break NDA quite yet. Me, well mine doesn't cover benchmarks from other machines. hehe 🙂
 
WOOHOO! proof of sorts 😉 At this link on page 301, go down to ROL/ROR and you will notice that it says that the Latency is 4 (woohoo! I got it right) and the throughput is at 1. So the figures that I put together above are accurate (as they can be 😉).

BTW, an Athlon 800 does 2.7Mkeys/sec, so that explains the figures in the australianIT.com.au article.
 
Yikes, unless Dnet has a trick up their sleve, the P4 really will suck at RC5!:Q
 
crynoid, good work. I downloaded the file earlier, but I hadn't had a chance to read through it yet. I know the Athlons latency is 1 (read that techdoc), but what's the throughput? 1 I assume?
 
when they come out with a client i'll "rent" a p4 for a short while and do a little benchmarking, if need be. rc5, ogr, and then ratbastard's s@h unit...
 
sadly I don't believe a new client will help a great deal as the problem is with an instruction in hardware. A new client may get a small performance boost but it won't fix the problem of having such a high latency (unless a work-around is found, which I doubt).
 
Notice how the throughput is 1 so I don't know why you think the P4 will suck, just increase the number of keys cracked per iteration to get rid of the latencies (and that's what the pipeline count in the Dnet RC5 source refers to- the number of keys cracked at once). It would also be interesting to see how fast a P4 SIMD core would be. The P5 MMX core sucks on a P3 so I don't have a lot of faith in it even if the number of keys cracked at once is doubled. I wish there was an instruction where multiple data could be shifted with different shift counts...
 
This is a quote from the above pdf file.



<< Latency: The number of clock cycles that are required for the execution core
to complete the execution of all of the µops that form a IA-32
instruction.

Throughput: The number of clock cycles required to wait before the issue ports
are free to accept the same instruction again. For many IA-32
instructions, the throughput of an instruction can be significantly less
than its latency.

Execution units: The names of the execution units in the execution core that are
utilized to execute the µops for each instruction. This information is
provided only for IA-32 instructions that are decoded into no more
than 4 µops. µops for instructions that decode into more than 4 µops
are supplied by microcode ROM. Note that several execution units
may share the same port, such as FP_ADD, FP_MUL, or MMX_SHFT in
the FP_EXECUTE cluster.
>>



As I should have said above I have no idea about the Throughput of any other processor. You will need some else to explain throughput and it's affects as I have no idea🙁 I assume that the Athlon + P3 have the same throughput.
 
Let's first deal with latency- it's the time it takes from when you issue the instruction til you get the result. But that doesn't mean you can't execute other instructions while that instruction is carried out and that's where throughput comes to play. The throughput is 1 clock meaning you're free to issue another ROL (or whatever) the clock after the first one. So if you execute say 4 ROL instructions it will take you not 16 clocks but 8 clocks before you have the last result. The point is that if you can execute some other instructions before you need to use the result the latency is irrelevant. Obviously it would be easier to write code if every instruction had the same latency and throughput but that's another story...
 
I have interpreted the lantency as being the amount of time before the next instruction can be started. My idea of it was that the ALU would be in use for those 4 cycles while the µops that form the instruction are carried out. Once they are finished it can start the next one. I may be wrong, if so point me in the right direction😉
 
Riv is correct, you guys are forgetting how the pipeline works. Last I heard the P4 was around 120% faster than a P3 in pure integer ops.
 
Back
Top