Povray Recompilation Project for Pentium 4's

JCholewa · Feb 13, 2001

> I had the oportunity to test the non compiled version about
> a week ago and got 149 seconds

...what? There's an interpreted version of povray? What language? Perl? *_*

-JC

*EDIT*
Sorry, sorry, I jest! ^_^

Incidentally, I put up an impromptu interactive table of povray scores if anybody is interested in submitting. Y'all can use this to get a more full idea of what these binaries are doing.
The submission url is at
http://www.jc-news.com/ll/htask.cgi?database=povray
and the query url is at
http://www.jc-news.com/ll/htquery.cgi?database=povray

Please be nice to it. It is a very crude program (it isn't compiled, btw<g>) and I have negative (as in less than zero) security implemented, so I put you guys in ultimate trust here (since I think all of you are likely more interested in seeing interesting data than taking advantage of chmod 666).

pm · Feb 13, 2001

Actually NOS440's scores's are incorrect on remnant's page. No offense to NOS440 but in the original mail he sent to both Remnant and I it was worded somewhat unclearly. I asked for clarification and he sent me mail tonight stating that he got 155s (2m35s) on the original code - not 1m55s.

Also, Frustrated2's scores on Remnant's page may be incorrect. He said he got 1m55s at stock, but I believe that "stock" was referring to the part not being overclocked. I don't think frustrated2 ran the test using the original code.

frustrated2 - can you comment on this?

Based on what NOS440 emailed to me (he actually gave me permission to post it, but I can't VPN into work and access the internet simultaneously, so I can't cut and paste it), he saw a more than 2x speed up (a 52% reduction in time) - which matches with what Fkloster posted. Unfortunately, he's been temporarily denied access from Anandtech, so he can't comment directly. Anyone curious, however, can email him for clarification. So my post above seems incorrect - instead of a 35% reduction in raytracing time on Pentium 4's using the recompiled code, we are seeing more like 50%.

Hey, JC, this might make a good news item for your webpage. 😉

fkloster · Feb 13, 2001

This is a very refreshing study to me right here @ Anands. I almost cannot believe what I am seeing...objective people comming to rational conclusions about products without deviation toward subjection. Very refreshing indeed. Night, night 🙂

Final fkloster's P4 povray results:

1500mhz + (non-SSE2) + pawntest = 178 seconds
1500mhz + SSE2 + pawntest = 86 seconds
1800mhz + (non-SSE2) + pawntest = 149 seconds
1800mhz + SSE2 + pawntest = 70 seconds

pm · Feb 13, 2001

Cool, JC posted on this on his news page here

edit: changed the wording so that my comment actually made sense. 🙂

Remnant2 · Feb 13, 2001

ah ok, that makes more sense. I knew fklosters' score was right, because we ran it last time, so it seemed odd that nox and frustrated2 would get such a different score. The confusion over "stock" could indeed be the cause.

In that case, the P4 definitely got a very nice boost from recompilation. Bodes well for new programs...

edit: Updated the entries in question. After looking over frustrated2's email, I think he also believed that by stock I meant "non-overclocked". I've changed them all to "default" to eliminate the confusion. Also added a bit of explanation,as we seem to be getting some linkage on this. 🙂

CHHASmatroxuser · Feb 14, 2001

It would be nice to get one of the 1500 MHz T-Bird users to run this for comparison, or any T-Bird for that matter.

I did a quick calculation on per clock efficiency, by multiplying the seconds used by the chip clock a quick performance per clock index can be built. This assumes a linear funktion, but without more data a better estimate cannot be made.

Chip - index
P4 (SS2 opt.) - 127500
P4 (normal) - 267500
P3 (P3 opt.) - 116000
Duron (P3 opt.) - 100000
Celeron (P3 opt.) - 150000

By dividing the index by the chip clock speed a rough estimate of rendering time is found.

This shows the huge improvement that the P4 recieves with SSE2 optimization, but still it lacks behind the Duron on a clock per clock basis.

borealiss · Feb 14, 2001

my results. too bad it isn't multithreaded, heh. this is on 2 466 celerons @ 581, but effectively 1 celeron at 581.

p3 optimized: 4 min 9 sec
original bin: 4 min 54 sec

andreasl · Feb 14, 2001

fkloster and others,

Thanks, that explains it 🙂

Remnant,

BTW, I tried to run the P3 binary on an old K6 yesterday, but it wouldn't work. Since it runs on a K7, I can only conclude that this binary contains some of those new instructions that were added to the P6 core. Is that correct? I know the K6 didn't include those, but the K7 did.

Is there any chance that you could compile a 3rd version that can be run on plain K6 (and P5) processors ?

Menelaos · Feb 14, 2001

You guys made it up to Ace's Hardware.

Nice going,

Menel.

CHHASmatroxuser · Feb 14, 2001

What I find interesting is that the Athlon T-Bird is not quicker clock for clock than the Duron, the 1200 T-bird does the rendering in 83 seconds for a total of 99600 Mclock pulses, the 900 Duron does it in 113 seconds or 101700 Mclock pulses. PovRey must be either memory bandwith limited or not very cache dependant.

andreasl · Feb 14, 2001

If the Duron and Athlon perform identical clock for clock, and scales perfectly with clock speed, it's an indication that the render part of the program runs entirely within the L1 cache. Which means it is not memory bandwidth limited at all.

CHHASmatroxuser · Feb 14, 2001

Yep, that's what I figure as well. As far as the scaling goes a Duron 750 uses the same amount of Mclockpulses (CPU speed in MHz * seconds to render) as a T-bird 1200, so a heavily OC'ed Duron would be the processor of choise to run PovRey.

Remnant2 · Feb 14, 2001

One interesting note: If you go look at Ace's results, they posted results with the new binaries from several of the systems in the lab, including a 1.2ghz Athlon.

Whereas with the standard compile, a Duron 900 was able to match a P4 1.7ghz, with the P3 optimization alone, it would probably take at least an Athlon 1.3ghz to match, and with the P4 binary, even more than that.

But it's not SSE2 thats providing this speedup, because most of the improvement is from going from stock->P3 compile.

Here's my guess:

As we all know, the P4's weakest point is its branch prediction. One of the things that is included in the P3 compile is conditional-move instructions, which allow for the elimination of some branching code. This might account for the much-improved score.

If this is true, then likely the performance with a standard modern compiler (like MSVC) would be less pronounced, because usually its compiled in "blend" mode that doesn't use the cmov opcodes for compatibility. So it'll take either mixed-optimization compilers like IntelC, or a good deal of time, before this optimization becomes more standard.

Interesting.

AND BTW, Andreas, this is also the reason why the P3 binary isn't working on your K6. The K6 doesn't have cmov support, so it crashes. It shouldn't be difficult at all to replace the P3-only binary with a P3 + Others compile, which shouldn't cut into speed any either.

Sir Fredrick · Feb 14, 2001

Hey, what about us SMPers? It only used one of my procs to render pawns...anyone care to make this multithreaded?

frustrated2 · Feb 15, 2001

PM it does seem that there is some confusion let me see if I can straighten it out here. I think that I made a mistake reporting my scores or wasn't clear enough.

The 1.4 ghz score is what NOS reported for his machine. We have very similiar systems and I assumed that my score would have been about the same as his but I didn't actually run it and yes stock means stock speed 1.4 ghz NOT stock Povray 🙂

NO optimization povray scores
179 seconds or 2 minutes 49 seconds @1568 mhz
154 seconds or 2 minutes 24 seconds @1.7 ghz

Optimized scores
P3 @1.7 ghz (122 x 14) 79 seconds or 1 minute 19 seconds
P4 @1.7 ghz (122 x 14) 73 seconds or 1 minute 13 seconds
P4 @1.4 ghz (stock) 115 seconds or 1 minute 55 second<<<<<<<< DISREGARD
P4 @ 1568 mhz (112 x 14) (this is stock for me everyday) 1 minute 20 seconds or 80 seconds

So it looks like the recompilation improved my scores ~2 X or cut the time in half. Which is really quite impressive I think and goes right along with your 50% improvement 🙂

Hope that this clarifies my scores and our understanding of the improvments.

frustrated2 · Feb 15, 2001

Well did this help 🙂

Adul · Feb 20, 2001

I missed that ace's hardware post. Was it a review? or a news post?

Degenerate · Feb 20, 2001

mm. I read the thread, bu being not very knowledgeable, i dont understand the linkd givern by PM at the biginnig. Htey go to intels site but what am i suppoes to do with the codes?

xtreme2k · Mar 23, 2001

Athlon Tbird 1200/100 at 1333/133 256MB CAS3 RAM Win98SE

P3 Optimised - 1m14s - 74 seconds.

Talk about a Pure X87 cruncher.

Povray Recompilation Project for Pentium 4's

JCholewa

Member

pm

Elite Member Mobile Devices

fkloster

Diamond Member

pm

Elite Member Mobile Devices

Remnant2

Senior member

CHHASmatroxuser

Senior member

borealiss

Senior member

andreasl

Senior member

Menelaos

Senior member

CHHASmatroxuser

Senior member

andreasl

Senior member

CHHASmatroxuser

Senior member

Remnant2

Senior member

Sir Fredrick

Guest

frustrated2

Golden Member

frustrated2

Golden Member

Adul

Elite Member

Degenerate

Platinum Member

xtreme2k

Diamond Member

TRENDING THREADS