SSE3 in the AMD...Does it even work???

Duvie · Sep 1, 2005

OK....

I ask this cause AMD once agian mimicked the SSE3 of the iNtel chips but if I remember things left a few flags out. However in apps that are known to run SSE3 like superpi I get a whopping 1 second faster with SSE3 then SSE2...how is it the iNtel guys see a much larger increase from SSE2 to SSE3???

Did AMD fvck something up or is there something like INtel famous compiler crippler not setting AMD cpus SSE3 correctly on???

Do you knwo of other aoos we can test to see if we get a boost with SSE3 or not???

1sec on an approximate 360sec second test sux arse....0.277% increase..WOW, thanks!!!

I knwo for a fact in some encoding apps Intel prescotts saw huge gains which help to offset their actual lessened performance against the northwoods ue to added length of piepline. Why are the AMDs not getting same increase in performance???

AkumaX · Sep 1, 2005

most of the time when we think about SSEx applications, we always think about SuperPi. From regular SuperPi to the SSE2 patch i got a 1 sec increase. From the SSE2 to the SSE3 patch i got

Duvie · Sep 1, 2005

Originally posted by: AkumaX
most of the time when we think about SSEx applications, we always think about SuperPi. From regular SuperPi to the SSE2 patch i got a 1 sec increase. From the SSE2 to the SSE3 patch i got

got what??? are you going to finish it????

I have been using SSE2 and went to SSE3...I ran them both one after another and virtually nothing./..Heck an 8mb test I could run maybe 2 to 3 times and get 1 sec variation....

MercenaryForHire · Sep 1, 2005

Originally posted by: Duvie

Originally posted by: AkumaX
most of the time when we think about SSEx applications, we always think about SuperPi. From regular SuperPi to the SSE2 patch i got a 1 sec increase. From the SSE2 to the SSE3 patch i got

Click to expand...

got what??? are you going to finish it????

I have been using SSE2 and went to SSE3...I ran them both one after another and virtually nothing./..Heck an 8mb test I could run maybe 2 to 3 times and get 1 sec variation....

He got ... well ... "nothing"?

- M4H

stevty2889 · Sep 1, 2005

I've noticed the same thing, didn't get any improvement with SSE3 patched Super PI, made a big differance on my P4 though. Not sure what else uses SSE3 that I could test with.

Hyperlite · Sep 1, 2005

Originally posted by: Duvie

Did AMD fvck something up or is there something like INtel famous compiler crippler not setting AMD cpus SSE3 correctly on???

looks like one or the other, unfortunantly. There has got to be some other program you can use to test SSE3 functionality...

Gamingphreek · Sep 1, 2005

It might not see as much benefit on AMD's architecture.

Please correct me if i am wrong, but doesn't SSE and 3dNow simply rearrange the way the values are multiplied. In other words instead of going one by one they sort of go in chunks or blocks. Lol i know what i want to say, but i cant really explain it... i dont have much experience (if any) in architecture and compilers.

-Kevin

Avalon · Sep 1, 2005

I thought SEE3 didn't do squat for INtel either?

Duvie · Sep 1, 2005

No as Stevty states the increase from SSE3 is pretty big.....It was also rumored to have enhanced prescott performance in some multimedia apps....

aatf510 · Sep 1, 2005

I believe they are instructions set, they extends the basic instruction of x86.
For example, x86 only has ADD and SUBTRACT, then someone thinks mutiply and divide is also important so that they add new instructions package that include multiply and divide and call it SSE3.

Unkno · Sep 2, 2005

my guess is that amd's architecture is optimized enough that even with SSE3, it doesn't add much benefit compared to intel's architecture...

Furen · Sep 2, 2005

The P4 has dedicated vector units, while the Athlon64 uses the 3 floating point units for x87 and SIMD so it does vector instructions at the expense of x87 performance.

Duvie · Sep 2, 2005

Originally posted by: Furen
The P4 has dedicated vector units, while the Athlon64 uses the 3 floating point units for x87 and SIMD so it does vector instructions at the expense of x87 performance.

sounds technical can you explain it a bit more??? basically it sounds like it was nothing more then a marketing whitewash...what damn good was the venice then?? Yeah better memory controller but it seems the SSE3 was touted as a sale feature and now it may be pretty much nothing...

Unkno · Sep 2, 2005

well, amd have to implement SSE3 or else intel might do the same as what nvidia did to ati...anyone remember the "power of 3"? So yes, it is a sales feature and also a feature just incase there happens to some differences when used in future apps

AnandThenMan · Sep 2, 2005

Perhaps the many apps that see increases on Intel and see non on AMD are programs that were compiled using Intel's "optimized" compiler.

Furen · Sep 2, 2005

Originally posted by: Duvie

Originally posted by: Furen
The P4 has dedicated vector units, while the Athlon64 uses the 3 floating point units for x87 and SIMD so it does vector instructions at the expense of x87 performance.

Click to expand...

sounds technical can you explain it a bit more??? basically it sounds like it was nothing more then a marketing whitewash...what damn good was the venice then?? Yeah better memory controller but it seems the SSE3 was touted as a sale feature and now it may be pretty much nothing...

First off, I havent really looked into SuperPI too much since I consider it (pretty much) a synthetic (or does anyone actually have ANY use for calculating 32m digits of Pi). Second, since Dothans eat SuperPI for breakfast, I'm guessing it does not rely very heavily on floating point performance (since Dothan sucks at that).

That said I'll expand a bit more on what I said before. I'm sure everyone here knows that the K8 (a k7 with an integrated northbridge and a few architectural tweaks) has 3 fully-pipelined FPUs: an FADD, an FMUL and an FSTORE (by fully-pipelined I mean that they dont share logic like the P6 does, for example). Each ones of these units is capable of doing either an FADD/FMUL/FSTORE operation but is also used for SIMD. Here comes AMD's huge SSE drawback: SSE/2/3 registers are 128bits long, while the x87 FPUs on the K8 are built to work with 80-bit x87 operands. This means that some vector instructions take 2 or more passes to complete. Why does it matter, you might ask? Simply because using SSE,in some cases, will lead to lower performance than using straight x87 (especially considering how insanely good the x87 performance of the K8s is) so you end up not gaining as much performance as when you use an Intel CPU that was designed from-the-ground-up to work on these 128bit operands with execution units that were designed to work on this SSE code. It is most certainly NOT marketing whitewash. Being SIMD-capable gives AMD's CPU a huge benefit in many SSE-intensive applications *cough*games*cough* but when a program is optimized for SSE in a half-assed manner, or just isnt very "optimizable" (is that even a word), you wont get much out of AMD's FPU hacks to enable SSE.

stevty2889 · Sep 2, 2005

Dothans only eat smaller superPI calculations for breakfast, and thats due to the large fast L2 cache. Run a 32M calculation, and it's a lot slower than a Prescott running the SSE3 version of super PI, or even the SSE2 version. And from what I have noticed, Dothan's FPU performance isn't what sucks, it's the Integer performance where it lags behind. The only thing I have seen my dothan ahead in, is a very slight lead in gaming, and not enough to make up for it lacking in most other areas. My prescott machine just sits around turned off now that I have my X2 since it's better than the prescott in everything, and my Dothan is in a SFF system and makes a nice portable gaming machine.

The only reason SuperPI is brought up when dealing with SSE3, is because I haven't seen anything else that has SSE3 to see if there is any real advantage, and it definatly gives a big boost on the prescott for superPI, so I'd like to see if it does the same in real applications, but I don't know of any specificly that support SSE3.

BlingBlingArsch · Sep 2, 2005

anandtech had an article on that topic a while ago
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2350&p=1

Furen · Sep 2, 2005

Originally posted by: stevty2889
Dothans only eat smaller superPI calculations for breakfast, and thats due to the large fast L2 cache. Run a 32M calculation, and it's a lot slower than a Prescott running the SSE3 version of super PI, or even the SSE2 version. And from what I have noticed, Dothan's FPU performance isn't what sucks, it's the Integer performance where it lags behind. The only thing I have seen my dothan ahead in, is a very slight lead in gaming, and not enough to make up for it lacking in most other areas. My prescott machine just sits around turned off now that I have my X2 since it's better than the prescott in everything, and my Dothan is in a SFF system and makes a nice portable gaming machine.

That makes sense. Huge low-latency L2 without the insanely huge pipeline on the P4s. Actually, the PM's are somewhat anemic all around but they have a crazy branch predictor. When I said that their FPU performance sucks I meant that their FPU performance is so so compared to the P4's but without the insane vector processing power. But I'll stop talking about the pentium M since I dont actually own one

.

How much the A64 gains from SSE3 depends on the application and the compiler, since it can handle 32/64 bit scalar SSE in a single pass without much trouble but has "problems" with 128bit Vectors. With Pentiums, on the other hand, any SSE will become a performance improvement.

Joepublic2 · Sep 3, 2005

Originally posted by: Furen

Originally posted by: Duvie

Originally posted by: Furen
The P4 has dedicated vector units, while the Athlon64 uses the 3 floating point units for x87 and SIMD so it does vector instructions at the expense of x87 performance.

Click to expand...

sounds technical can you explain it a bit more??? basically it sounds like it was nothing more then a marketing whitewash...what damn good was the venice then?? Yeah better memory controller but it seems the SSE3 was touted as a sale feature and now it may be pretty much nothing...

Click to expand...

First off, I havent really looked into SuperPI too much since I consider it (pretty much) a synthetic (or does anyone actually have ANY use for calculating 32m digits of Pi). Second, since Dothans eat SuperPI for breakfast, I'm guessing it does not rely very heavily on floating point performance (since Dothan sucks at that).

That said I'll expand a bit more on what I said before. I'm sure everyone here knows that the K8 (a k7 with an integrated northbridge and a few architectural tweaks) has 3 fully-pipelined FPUs: an FADD, an FMUL and an FSTORE (by fully-pipelined I mean that they dont share logic like the P6 does, for example). Each ones of these units is capable of doing either an FADD/FMUL/FSTORE operation but is also used for SIMD. Here comes AMD's huge SSE drawback: SSE/2/3 registers are 128bits long, while the x87 FPUs on the K8 are built to work with 80-bit x87 operands. This means that some vector instructions take 2 or more passes to complete. Why does it matter, you might ask? Simply because using SSE,in some cases, will lead to lower performance than using straight x87 (especially considering how insanely good the x87 performance of the K8s is) so you end up not gaining as much performance as when you use an Intel CPU that was designed from-the-ground-up to work on these 128bit operands with execution units that were designed to work on this SSE code. It is most certainly NOT marketing whitewash. Being SIMD-capable gives AMD's CPU a huge benefit in many SSE-intensive applications *cough*games*cough* but when a program is optimized for SSE in a half-assed manner, or just isnt very "optimizable" (is that even a word), you wont get much out of AMD's FPU hacks to enable SSE.

So the eight (16 in 64 bit mode) SIMD registers use the same 3 floating point pipelines as the 8 x87 registers? Can they pipeline SIMD and regular floating point instructions like the P4 with it's dedicated circuitry can (and the P3 couldn't)? The three floating point pipelines are still only 80 bits wide, though? I though they were widened to 128 bits in the K8 to avoid the problem you just described and that this was one of the flaws that AMD corrected from the K7 linei. Or am I thinking about the L1 to L2 cache pathway being widened from 64 bits in the K7 to 128 bits in the K8?

Furen · Sep 3, 2005

Originally posted by: Joepublic2

So the eight (16 in 64 bit mode) SIMD registers use the same 3 floating point pipelines as the 8 x87 registers? Can they pipeline SIMD and regular floating point instructions like the P4 with it's dedicated circuitry can (and the P3 couldn't)? The three floating point pipelines are still only 80 bits wide, though? I though they were widened to 128 bits in the K8 to avoid the problem you just described and that this was one of the flaws that AMD corrected from the K7 linei. Or am I thinking about the L1 to L2 cache pathway being widened from 64 bits in the K7 to 128 bits in the K8?

I know for sure that SIMD uses the 3 FP pipelines.

Do you mean if the different FPUs can process different types of instructions (like having 2 of the FPUs doing SIMD and one doing x87?), if so then yes. The P6 cannot do an FMUL and an FADD at the same time because FMUL utilizes the FADD logic as well, so they have to be reordered for maximum efficiency.

I'm pretty sure that the only benefits over the K7 architecture are the integrated northbridge and the various improvements at the front-end. So the FPUs should still be 80bits wide. The L2 bus width was indeed doubled from 64bits to 128bits (64bit L2 width is probably what kept AMD's CPU from actually matching P4s at the end of their life) though it's still less than Dothan's 256bit wide l2 bus..

Shenkoa · Sep 4, 2005

imported_wyrmrider · Sep 4, 2005

Methinks that the SSEIII instructions are rom macros
placeholders for hardware acceleration which could be added later, if required
perhaps a litttle tweek after going to 65gum

Joepublic2 · Sep 7, 2005

Thanks for clarifying that! :beer: I have one more question, since you brough it up:

Originally posted by: Furen
The L2 bus width was indeed doubled from 64bits to 128bits (64bit L2 width is probably what kept AMD's CPU from actually matching P4s at the end of their life) though it's still less than Dothan's 256bit wide l2 bus..

I've read that the K7/K8 wouldn't benefit as much from a large L1 to L2 bandwidth as Intel's designs do, due to their exclusive cache design. In your opinion, is this really a big bottleneck in AMD's designs?

SSE3 in the AMD...Does it even work???

Elite Member

Lifer

Elite Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Elite Member

Golden Member

Golden Member

Golden Member

Elite Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Senior member

Golden Member