SiSoft Leaks AMD FX "Zambezi" Scores: Worse Than Intel Core i7

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

sequoia464

Senior member
Feb 12, 2003
870
0
71
I hope you can recognize the self-defeating aspects of relying on yourself to be the judge of what you think to be an evil corp versus a lesser evil corp.

The truly evil corp will be so good at being evil that it will deftly convince you that IT is the lesser evil.

Unless you view yourself to be supremely intelligent to the collective brains employed at evil corp XYZ, you have to acknowledge the reality that you are led to believe whatever they want you to believe about any given corporation.

In your efforts to be noble you may well be an unwitting supporter of the bigger evil. A pawn is a pawn and nothing leaves you more open to manipulation by others than your own ignorance.

I know I am ignorant of the truth of what goes on between AMD and Intel, at best we are exposed to half-truths from both as they see fit in their marketing efforts to prop themselves up while putting their competition down.

Didn't mean to, even remotely, derail the topic here, I'm usually pretty careful to stay away from politics, religion, etc.

Just trying to figure out where I go from here and watching the BD threads closely - made a bad decision after a recent failure with an AMD setup.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,227
126
I thought we were all in agreement in concluding that tomorrow's announcement is going to be entirely fusion related (the unlocked Llano's for desktop enthusiasts) and have nothing to do with bulldozer/zambezi?

Probably will end up faster than BD too. :p

(Did I just say that, yes I did.)
 

ocre

Golden Member
Dec 26, 2008
1,594
7
81
BD looks bad in these benchmarks, but i have a gut feeling its gonna vary, even wildly. There are gonna be benchmarks that the BD looks very good on. The bad is its gonna be all over the place....i mean could be.

So dont write off BD so quick. There will be times when the data will flow great and BD will crunch. Honestly, its possible. If priced right there is no reason ppl wont buy BD cpus. The BD looks to be getting off to a rocky start. Its not gonna be for everyone. But if its priced right, it will sell. I have a feeling there will be all kindsa improvements made in the future. BD can only get better and better. If AMD can stay afloat.

This is my concern, and if the price is right and care is taken in marketing, then i see no reason BD fails. AMD has got to be careful, i believe this to be the most important thing.
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
JF-AMD was perfectly clear that all benchmarks we will see until the launch will be fakes. The guy is in the know, so to say that publicly, he knows something important that we don't. Why do you guys get so negatively excited?

AMD is probably about to pull another rv770 only better.

The way things are going, it DOES look at if their next round of GPU offerings will be out before BD is...
 

velis

Senior member
Jul 28, 2005
600
14
81
Well, not really following too closely here, but I was under the impression that the following was all true for BD, some of it backed by JFAMD's statements too:
1. BD core IPC will be > Stars core IPC
2. BD is optimized for higher frequencies (kinda puts in question the first one)
3. Brazos has a stripped down core --> IPC lower than Stars

Taking out all the delays and stuff, one would assume that BD will come out @4GHz-ish at which point it would simply kill any existing AMD offering and offer serious competition to Sandy Bridge CPUs. Not by IPC alone but by IPC + clock.

At least that's what I gathered, at least by repeated quotes of "known facts".

Seeing these benches thus leaves two possibilities for my overloaded brain:
1. AMD seriously failed to deliver (until I see legit reviews, this is the actual option)
2. All the announced SKUs as well as all the BIOSes are fake.

There is no way BD @3GHz or less is or will be competitive with Sandy Bridge, even less so with Ivy Bridge. Additionally, the "improved IPC" core cannot - per JFAMDs statements - be slower than brazos core - at least not in most benchmarks. Additionally, the leaked benchmarks, though probably even really done on a BD CPU show a serious flaw with the CPU / MB combo. That suggests either a BIOS or a chipset flaw - chipset as in *ALL* chips involved.

So while not really holding my breath, I suppose BD will go into history as either the biggest AMD flop or as the biggest AMD *coverup* + success to date. The latter would really be hilarious though :D

Edit: To those argumenting lesser FP performance in a BD module: AFAIK stars cores don't have 256bit support (may be wrong, but it doesn't really matter for most non HPC scenarios). When used in 128bit operation, the FP unit can still function as two units (execute two operations per clock) so a BD FP unit cannot be slower than a Stars FP unit, even if it isn't dedicated to the core. Additionally those argumenting the poor shared performance also assume that even HPC code is almost 100% FPU code. I must say in all my years as a programmer I have hardly seen a program that would do all its work using only mathematical operations. I usually use some control structures as well and they are not FP code... Don't know about you guys though.
 
Last edited:

velis

Senior member
Jul 28, 2005
600
14
81
The way things are going, it DOES look at if their next round of GPU offerings will be out before BD is...

Well, suits me. As long as 7970 significantly outperforms my 5870, I'm building my next system around it. The CPU will be best price / performance at the time. Of course, if NV manages to pull some kind of a stunt at that time, I don't see an issue about going with them either.
I'm such an infidel bastard :D No loyalty whatsoever.
 

velis

Senior member
Jul 28, 2005
600
14
81
As for that good corp / bad corp debate:
As far as I can remember, AMD also sold $1000 CPUs when they were in the lead.
Currently the fastest consumer CPU costs more like $300 which makes Intel look pretty good in my eyes.
They are still corporations, both of them. And they want to make profit for their owners, that's all. We live in capitalism, so the high prices are not evil, they are just a result of our (not necessarily good) economic setting.
 

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
There is no way BD @3GHz or less is or will be competitive with Sandy Bridge, even less so with Ivy Bridge. Additionally, the "improved IPC" core cannot - per JFAMDs statements - be slower than brazos core - at least not in most benchmarks. Additionally, the leaked benchmarks, though probably even really done on a BD CPU show a serious flaw with the CPU / MB combo. That suggests either a BIOS or a chipset flaw - chipset as in *ALL* chips involved.
If you followed some of my recent posts, then you should come across Bulldozer cache issue (or my follow-up post on the same subject). It might be one of the main causes for recent delays. ;)

So while not really holding my breath, I suppose BD will go into history as either the biggest AMD flop or as the biggest AMD *coverup* + success to date. The latter would really be hilarious though :D
Conspiracy theory-wise, the current situation reminds me of Barcelona with AMD first shipments to Cray knowingly there was a TLB issue (possibly including the patch for the problem). The first shipments of Interlagos goes to Cray only (similar scenario). Coincidentally Cray runs (customized) Linux (thus patch can be easily applied on installation). Also, I haven't seen any SPEC benchmarks coming from server vendors. :hmm:

Edit: To those argumenting lesser FP performance in a BD module: AFAIK stars cores don't have 256bit support (may be wrong, but it doesn't really matter for most non HPC scenarios). When used in 128bit operation, the FP unit can still function as two units (execute two operations per clock) so a BD FP unit cannot be slower than a Stars FP unit, even if it isn't dedicated to the core.
However there's only one "FP scheduler running per clock" to serve both "cores" in each "module". Thus even though Bulldozer supposedly able to simultaneously execute 2 x 128-bit operations per clock, each decoded 128-bit FP/SSE/AVX instruction has to go thru the single FP scheduler (per clock) which IMHO would negate this advantage. Should also be noted that each Bulldozer "core" has access to a single 128-bit FP unit only when executing 128-bit FP operations. :hmm:

Additionally those argumenting the poor shared performance also assume that even HPC code is almost 100% FPU code. I must say in all my years as a programmer I have hardly seen a program that would do all its work using only mathematical operations. I usually use some control structures as well and they are not FP code... Don't know about you guys though.
You're right about almost 100% in most programs (there must be in between register/addressing/memory operations, conditional branches and other related instructions), though to the contrary here I've seen some that has close to 90% FP/SSE code. It depends on the type task and method of coding (e.g. sort/collate data first and pass them to another FP-intensive subroutine/thread to process). ;)
 
Last edited:

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Conspiracy theory-wise, the current situation reminds me of Barcelona with AMD first shipments to Cray knowingly there was a TLB issue (possibly including the patch for the problem). The first shipments of Interlagos goes to Cray only (similar scenario). Coincidentally Cray runs (customized) Linux (thus patch can be easily applied on installation). Also, I haven't seen any SPEC benchmarks coming from server vendors. :hmm:

http://www.amd.com/us/press-releases/Pages/amd-ships-bulldozer-processors-2011sep7.aspx

SUNNYVALE, Calif. —9/7/2011

Today, AMD (NYSE: AMD) announced revenue shipments of the first processors based on its new x86 “Bulldozer” architecture. Initial production of the world’s first 16-core x86 processor, codenamed “Interlagos,” began in August and shipping to customers is already underway. Compatible with existing AMD OpteronTM 6100 Series platforms and infrastructure, “Interlagos” is expected to launch and be available in partner systems in the fourth quarter of this year. Many of the initial shipments have been earmarked for large custom supercomputer installations that are now underway.

“This is a monumental moment for the industry as this first ‘Bulldozer’ core represents the beginning of unprecedented performance scaling for x86 CPUs,” said Rick Bergman, senior vice president and general manager, AMD Products Group. "The flexible new ‘Bulldozer’ architecture will give Web and datacenter customers the scalability they need to handle emerging cloud and virtualization workloads.”

No wonder we haven't seen SPEC benches yet.

However there's only one "FP scheduler running per clock" to serve both "cores" in each "module". Thus even though Bulldozer supposedly able to simultaneously execute 2 x 128-bit operations per clock, each decoded 128-bit FP/SSE/AVX instruction has to go thru the single FP scheduler (per clock) which IMHO would negate this advantage. Should also be noted that each Bulldozer "core" has access to a single 128-bit FP unit only when executing 128-bit FP operations. :hmm:

Everyone knows Intels CPUs have a single Scheduler for both Integer and FP units.
Also, a single Core inside the BD Module can use the entire FP execution unit, meaning it can use both 128bit FMACs.
 

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
Then can you explain the lack of SPEC benchmark presentations from AMD themselves? Afterall, AMD did widely publicise about SPEC during Barcelona pre-launch. :hmm:

Everyone knows Intels CPUs have a single Scheduler for both Integer and FP units.
Also, a single Core inside the BD Module can use the entire FP execution unit, meaning it can use both 128bit FMACs.
Perhaps I didn't phrase it properly. I was talking about both cores accessing both the 128-bit FMACs simultaneously (whereas in Sandy Bridge all the FP resources are available to each thread). Inherently there are differences in design philosophies from each company. :hmm:
 

Riek

Senior member
Dec 16, 2008
409
15
76
Then can you explain the lack of SPEC benchmark presentations from AMD themselves? Afterall, AMD did widely publicise about SPEC during Barcelona pre-launch. :hmm:

Perhaps I didn't phrase it properly. I was talking about both cores accessing both the 128-bit FMACs simultaneously (whereas in Sandy Bridge all the FP resources are available to each thread). Inherently there are differences in design philosophies from each company. :hmm:

No. All resources are available for both threads also, just like Sandy bridge resources is for 2 threads.

It is perfectly expected that 2 128bit ops are executed in the fpu of thread 0 while there are requests from thread1 in the scheduler. The scheduler will shuffle (ooo) the ops its receives from thread0 and thread1 as it sees fit. The scheduler itself can only receive ops from thread at the time. But how they are executed is completely up to the scheduler. It can do th0, th0 or th0,th1 or th1,th1 or th1,th0. (if we only look at the FMAC).

The difference between SB HT and CMT of AMD is that AMD has another scheduler in between that only accepts one input thread/clock. Eventual ordering of the ops and executions is almost the same after.

the advantage of Sb is that it can schedule its FP ops faster, the negative is that they can't use the same ports and they will block eachother while they might use completely different ops. (integer thread0 might block fp operations for thread1 ).
 
Last edited:

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Then can you explain the lack of SPEC benchmark presentations from AMD themselves? Afterall, AMD did widely publicise about SPEC during Barcelona pre-launch. :hmm:

The AMD press release says Interlagos is expected to Launch in Q4, we are at Q3 that means NO benches yet.

Perhaps I didn't phrase it properly. I was talking about both cores accessing both the 128-bit FMACs simultaneously (whereas in Sandy Bridge all the FP resources are available to each thread). Inherently there are differences in design philosophies from each company. :hmm:

The only difference is that with HT enable in SB, the second thread only have access to the FP execution resources that Thread number one doesn't use, but in BD the two threads have their own FP resources.

When you'll have two threads in a module, each thread have access to its dedicated FP execution units (each thread gets one 128bit FMAC). The only problem is that we dont know the performance each 128bit FMAC has over last gen FP units.

So when we will not have AVX, BD could be better with 8 threads than SB with 8 Threads.
 

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
It is perfectly expected that 2 128bit ops are executed in the fpu of thread 0 while there are requests from thread1 in the scheduler. The scheduler will shuffle (ooo) the ops its receives from thread0 and thread1 as it sees fit. The scheduler itself can only receive ops from thread at the time. But how they are executed is completely up to the scheduler. It can do th0, th0 or th0,th1 or th1,th1 or th1,th0. (if we only look at the FMAC).

The difference between SB HT and CMT of AMD is that AMD has another scheduler in between that only accepts one input thread/clock. Eventual ordering of the ops and executions is almost the same after.
I've also wondered about the way FP scheduling works on each architecture. If two threads invokes a single SIMD instruction that encompasses two simultaneous 128-bit FP operations then would one thread able to use both FP at once (faster as the FP scheduler passes both 128-bit FP operation into both 128-bit FMACs simulataneously, per clock), or rather inter-leave each of the two simultaneous 128-bit FP operation for each thread (slower as FP scheduler splits each of the 128-bit operation to each 128-bit FMAC)? :hmm:

The AMD press release says Interlagos is expected to Launch in Q4, we are at Q3 that means NO benches yet.
I did quote "pre-launch". Heck, Interlagos are shipping now (to Cray). Just a blast from the past >> AMD Shows Off More Quad-Core Server Processor Benchmark Results. and this was before shipping. :eek:

When you'll have two threads in a module, each thread have access to its dedicated FP execution units (each thread gets one 128bit FMAC). The only problem is that we dont know the performance each 128bit FMAC has over last gen FP units.

So when we will not have AVX, BD could be better with 8 threads than SB with 8 Threads.
We still do not know how well AVX is implemented on Bulldozer (like SSE2 on early Athlons and Phenoms). On a side note, SiSoftware benchmarks are AVX aware though. ;)
 
Last edited:

Riek

Senior member
Dec 16, 2008
409
15
76
I've also wondered about the way FP scheduling works on each architecture. If two threads invokes a single SIMD instruction that encompasses two simultaneous 128-bit FP operations then would one thread able to use both FP at once (faster as the FP scheduler passes both 128-bit FP operation into both 128-bit FMACs simulataneously, per clock), or rather inter-leave each of the two simultaneous 128-bit FP operation for each thread (slower as FP scheduler splits each of the 128-bit operation to each 128-bit FMAC)? :hmm:

In case of BD:
The scheduler accepts incoming ops from different threads. Just like the front end this one works the same. Only ops from 1thread can be accepted per clock.
So logic would assume that the most common operations would

th0, th0 -> scheduler -> nothing to do
th1, th1 -> scheduler -> execute th0, th0
th0, th0 -> scheduler -> execute th1, th1
nothing scheduler -> execute th0, th0

Unless there are interdependancies between the ops from the threads things might change. In that case its op to the ooo scheduler to find good combination to limit the needed cycles. Most likely scenario with interdepdancies is is both th0 and th1 will have ops in the pipeline because they are not linked to eachother. (note in the example below i'm assuming unrealistic scenario that ops ops are handled in 1cycle.

th0, th0d -> scheduler -> nothing to do
th1, th1d -> scheduler -> execute th0
th0, th0 -> scheduler -> execute th0d, th1
nothing scheduler -> execute th1d, th0
nothing scheduler -> execute th0

If however both threads need to execute 1 op on the fp.

th0 -> scheduler ->nothing to do
th1 -> scheduler -> th0,
nothing -> scheduler -> th1,


In case of intel is all up to the ooo to make it happen. They use one window for fp and integer and both of them are running over the execution resources (ports). So they have to look for dependancies between ops, but also which pipeline they need. From this information they need to choose the highest ilp possible. For intel it is probably alot less common to have the following scenario

port0 -> th0, port1 -> th0, port5 ->th0
port0 -> th1, port1 -> th1, port5 ->th1
port0 -> th0, port1 -> th0, port5 ->th0
port0 -> th1, port1 -> th1, port5 ->th1

Might be great if Dresdenboy or some other with more knowledge would verify this :). (so take note, but don't assume this is correct :))
 
Last edited:

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
@Riek:
Your detailed explaination of the principles looks good. While ops are being received as dispatch packets belonging to one thread only, the scheduler is free to select which ops to execute depending on available execution resources, ready operands and beginning with the oldest ops.

Let me quote what I wrote in another forum (discussing the AVX execution with drwho):
Dresdenboy said:
Currently it doesn't look like the FPU would be switching between threads (then it would need to save its state). OTOH some BD components like the frontend apply thread switching.

Well, the FPU is handled like a coprocessor, maybe the main difference to Intel's architectures. See this quote out of the optimization manual which I luckily found one day after it appeared in AMD's Dev Central ;)

I think that you will easily understand following quotes, I won't comment them much:

Since the Bulldozer core implements a floating point co-processor model of operation, most scheduling and execution decisions of floating-point operations are handled by the floating point unit. However, the scheduler does track the completion status of all outstanding operations and is the final arbiter for exception processing and recovery.
(page 35)
They're talking about the int scheduler here.

The FPU is a coprocessor model that is shared between the two cores of one AMD Family 15h compute unit. As such it contains its own scheduler, register files and renamers and does not share them with the integer units. This decoupling provides optimal performance of both the integer units and the FPU.
(page 37)

The FPU can receive up to four ops per cycle. These ops can only be from one thread, but the thread may change every cycle. Likewise the FPU is four wide, capable of issue, execution and completion of four ops each cycle. Once received by the FPU, ops from multiple threads can be executed.
(page 37)
So while dispatch packets are transmitted on a per-thread-basis, ops ready for execution are not.

Only 1 256-bit operation can issue per cycle, however an extra cycle can be incurred as in the case of a FastPath Double if both micro ops cannot issue together.
(page 38)
In my eyes this means AVX ops don't necessarily require parallel issue of the 128-bit uops.
 

swilli89

Golden Member
Mar 23, 2010
1,558
1,181
136
I hope you can recognize the self-defeating aspects of relying on yourself to be the judge of what you think to be an evil corp versus a lesser evil corp.

The truly evil corp will be so good at being evil that it will deftly convince you that IT is the lesser evil.

Unless you view yourself to be supremely intelligent to the collective brains employed at evil corp XYZ, you have to acknowledge the reality that you are led to believe whatever they want you to believe about any given corporation.

In your efforts to be noble you may well be an unwitting supporter of the bigger evil. A pawn is a pawn and nothing leaves you more open to manipulation by others than your own ignorance.

I know I am ignorant of the truth of what goes on between AMD and Intel, at best we are exposed to half-truths from both as they see fit in their marketing efforts to prop themselves up while putting their competition down.

I rarely cease lurking and post but man, is that well written AND the truth. same can be applied to any large public entity such as our governance.
 

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
Let me quote what I wrote in another forum (discussing the AVX execution with drwho):
Thanks, was looking at this section...
The FPU can receive up to four ops per cycle. These ops can only be from one thread, but the thread may change every cycle. Likewise the FPU is four wide, capable of issue, execution and completion of four ops each cycle. Once received by the FPU, ops from multiple threads can be executed.
And where the heck is drwho? Like to get more information on Ivy Bridge GPU.. :D
 
Last edited: