My (conspiracy) theory on bulldozer benches so far...

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
Take OBR's postings with lots of salt. After all he "punk'd" lots of news sites with his fake graphs. :p

He just wanted to punk FX-57(Stach) and again I'm mainly point out MP Ratio should I use coolater who gets the same score?

The improvements I've seen (IMHO valid leaks, but not OBR's) from B0 stepping (earliest) to B2 stepping (latest) is hardly much at all in this Cinebench benchmark (by extrapolating results hypothethically with frequency). Whatever fixes forthcoming will take time (silicon re-spins, packaging, testing, validation and debugging which causes delays). ;)

Same issue the application might not support the correct paths as the man says in my ear we will see a 50%(vague 50% with a ±25%) improvement on single core performance once the application is upgraded if it will ever be upgraded

You should take the man in my ear with a truck of salt but I'm pretty sure he knows what he is talking about

I might as well drop my claim that it's Global Foundries fault then

-Nost(r)aSeronx

P.S.
My list so far:
Motherboard Manufacturers(Throttling CPUs without consent) 50/50 on this one
Microsoft Windows(Scheduler isn't optimized/Drivers aren't optimized) I don't really bother with closed source Windows benchmarks(I was just pointing out MP 6.7 on FX-8150(ES) and 4.8 on i7 2600K)
Applications(Not supported or making the AMD CPU run down 386 paths and SSE2 paths instead of SSE2 and SSE4.1/4.2/AVX paths) <-- this is probably more realistic

P.S.S.
I really want to get my predictions down.
(My conspiracy theory on Bulldozer Benches)

P.S.S.S.
AMD K15h x87 has increased latencies than Intel not sure if that matters though

P.S.S.S.S.
MP Ratios are hard to fake
 
Last edited:

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
He just wanted to punk FX-57(Stach) and again I'm mainly point out MP Ratio should I use coolater who gets the same score?
Go to his blog and check his previous postings, and you will know what type of person he is. As for Coolaler's result (highly reliable source of leaks), he did not run a single threaded test. :hmm:

Same issue the application might not support the correct paths as the man says in my ear we will see a 50%(vague 50% with a ±25%) improvement on single core performance once the application is upgraded if it will ever be upgraded

You should take the man in my ear with a truck of salt but I'm pretty sure he knows what he is talking about
And from all the sources (thoughout the months until recently), nothing of that sort of improvements have been heard. Those came from people who had handled Bulldozer. ;)

I might as well drop my claim that it's Global Foundries fault then

-Nost(r)aSeronx

My list so far:

Motherboard Manufacturers
Microsoft Windows
Applications
You must be talking about drivers(patches) and re-compilation of applications to use Bulldozer's new features? Does this imply that current software will run slower on Bulldozer? :hmm:
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
You must be talking about drivers(patches) and re-compilation of applications to use Bulldozer's new features? Does this imply that current software will run slower on Bulldozer? :hmm:

Well there is a radical change from K10 to K15

6 Functional Units(Only 3 can work) -> 4 Functional Units(Only 4 can work)

Different names for decodes

Different Pipes

Different Rules(Latencies)
 
Last edited:

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
Well there is a radical change from K10 to K15

6 Functional Units(Only 3 can work) -> 4 Functional Units(Only 4 can work)
If you are talking about the 4-issue wide integer execution unit, its not the same type as Intel's. There are differences in design and functionality, thus performance is different. So far for single threaded performance, the Engineering Samples doesn't look impressive. Any improvements and fixes will take time (and delays, as mentioned earlier and my speculation is AMD was trying to fix many performance issues, not clock speed issues anymore). Still no official word from AMD about shipping and launch. ;)
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
If you are talking about the 4-issue wide integer execution unit, its not the same type as Intel's. There are differences in design and functionality, thus performance is different. So far for single threaded performance, the Engineering Samples doesn't look impressive.

I wish Coolaler would show the MP Ratio

Intel's Core = GPR, Int, FP
AMD's Cores = GPR, FP Coprocessor= Int, FP

But K7ish cores to K15ish cores there is a huge issue they are no longer the same and the Floating Point Coprocessor has also changed

Any improvements and fixes will take time (and delays, as mentioned earlier and my speculation is AMD was trying to fix many performance issues, not clock speed issues anymore). Still no official word from AMD about shipping and launch. ;)

MSI Europe has leaked 1 to 2 weeks
Smartidiot says 3 to 4 weeks
 
Last edited:

Riek

Senior member
Dec 16, 2008
409
14
76
If you are talking about the 4-issue wide integer execution unit, its not the same type as Intel's. There are differences in design and functionality, thus performance is different. So far for single threaded performance, the Engineering Samples doesn't look impressive. Any improvements and fixes will take time (and delays, as mentioned earlier and my speculation is AMD was trying to fix many performance issues, not clock speed issues anymore). Still no official word from AMD about shipping and launch. ;)

He is not saying it is the same as intel.... He is saying BD is a radical change from current K8 like cores and therefor the codepath from those K8 like architecures might be (extremely) bad for BD.

Any improvements will be done in the next revision. Some improvements are design choices to get things going. Otherwise you will be stuck with a moving product that never launches.

The only performance issues that can be fixed are bugs imo. You can't go haywire on a design to get better improvements. Changing functional blocks would take ages to do (sort of speak).

What you can fix is frequency (design goals), power consumption, bugs, wrong settings/or optimize settings for the current design.
 
Last edited:

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
I wish Coolaler would show the MP Ratio

Intel's Core = GPR, Int, FP
AMD's Cores = GPR, FP Coprocessor= Int, FP
The "MP Ratio" is deceptive. That's because when using a single core/thread the CPUs have turbo-boost which inflates the single core result. If you want to really know the actual performance per core, then divide the multi-thread score by 8 (for 8 threads) and then adjust for base frequency (the frequency at which all cores running utilized) for both CPUs. Its that simple... ;)

Core i7 2600K >> 6.89 / 8 = 0.861
AMD FX-8120 ES >> ( 5.27 / 8 ) * (3.4 / 3.2) = 0.7

Clock-to-clock when running multi-threaded, difference >> ((0.861 / 0.7) - 1) * 100 = 23%

MSI Europe has leaked 1 to 2 weeks
Smartidiot says 3 to 4 weeks
Most of our speculated predictions (based on bits of information) is somewhere in mid-October (around 12th). Less than one month to go, and still no shipping announcements.... :hmm:
 

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
He is not saying it is the same as intel.... He is saying BD is a radical change from current K8 like cores and therefor the codepath from those K8 like architecures might be (extremely) bad for BD.

Any improvements will be done in the next revision. Some improvements are design choices to get things going. Otherwise you will be stuck with a moving product that never launches.

The only performance issues that can be fixed are bugs imo. You can't go haywire on a design to get better improvements. Changing functional blocks would take ages to do (sort of speak).
Also if the program uses generic code (no specific code paths) then that would be more interesting. I hope they do improve the next revision, its kinda taking too long already (with all the "silence"). :(

What you can fix is frequency (design goals), power consumption, bugs, wrong settings/or optimize settings for the current design.
Which is a worrisome issue. Should AMD just either abandon the current one or launch/release the CPU as it is, so that they can start on work an improved design as soon as possible? :hmm:
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
The "MP Ratio" is deceptive. That's because when using a single core/thread the CPUs have turbo-boost which inflates the single core result. If you want to really know the actual performance per core, then divide the multi-thread score by 8 (for 8 threads) and then adjust for base frequency (the frequency at which all cores running utilized) for both CPUs. Its that simple... ;)

Core i7 2600K >> 6.89 / 8 = 0.861
AMD FX-8120 ES >> ( 5.27 / 8 ) * (3.4 / 3.2) = 0.7

Clock-to-clock when running multi-threaded, difference >> ((0.861 / 0.7) - 1) * 100 = 23&#37;

(6.95 / 8) * (3.4 / 3.6) = .821
(5.27 / 8) * (3.4 / 3.1) = .723

(0.861 / 0.723) - 1) * 100 = 19.09%

(0.861 / 0.821) - 1) * 100 = 4.87%


Weird scaling be cool if it is true because that would be the higher the clocks the more it improves(every 500MHz it improves by .1~ but how does the i7 2600K scale)
(I forgot everything about math so is it that correct?)

(9.5 / 8) = 1.19

(11.4 / 8) * (5.1 / 5.1) = 1.43

(1.43 / 1.19) - 1) * 100 = 20.17% improvement over i7 Sandy Bridge if it scales perfectly (i7 SB 2600K 5.1GHz vs FX 4.9GHz)

Most of our speculated predictions (based on bits of information) is somewhere in mid-October (around 12th). Less than one month to go, and still no shipping announcements.... :hmm:

:'(
 
Last edited:

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
(6.95 / 8) * (3.4 / 3.6) = .821
(5.27 / 8) * (3.4 / 3.1) = .723

(0.861 / 0.723) - 1) * 100 = 19.09%

(0.861 / 0.821) - 1) * 100 = 4.87%

Weird scaling be cool if it is true because that would be the higher the clocks the more it improves
(I forgot everything about math so is it that correct?)
Yups, that's why I said often "its all about overclocking". You can check what Chew* said here.....
There are many bios options that can effect the outcome of benches.

HPET is 1 for example, it stops the cpu from throttling back in mulithreaded apps.

Running pi on a cluster, versus a core ( 2 threads ) versus being able to disable a single cluster in a core ( which 99% boards/bios's do not have implemented so resources are not shared ) can all influence the results in single threaded.

Knowing all this tells you one thing for sure, you can make it look worse or make it look better all depending on your knowledge of the chip and or your intentions.

As far as PI it's an antiquated bench and has not been AMD's strong point for quite some time.

Granted some results shown tend to lead to the fact that 1m times are bad but looking at the bigger picture we also know that in many cases you can validate 1000mhz higher in many cases with BD, which would point to the fact that you can run 1m at alot faster speeds than current AMD tech.

Things that make you go hmm like what kind of times will we see at 8 gig or even comparing BD to deneb/thuban when same cooling is used.
Up to your interpretations. ;)

Not many people noticed that (shipping and distribution takes time). :hmm:
 

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
Where did "11.4" comes from? AFAIK Core i7 2600K scores 9.42 at 4.85GHz (reference Sandy Bridge Cinebench and WPrime Performance). That "11.4" is only possible on a 8C/16T Sandy Bridge EP system (for single socket). :p

I don't remember how I got 11.4 what the to late in the night

.65 3.1GHz
.85 3.6GHz
1.05 4.1GHz
1.25 4.6GHz
1.45 5.1GHz

1.45 x 8 = 11.6

1.65 5.6GHz
1.85 6.1GHz
2.05 6.6GHz
2.25 7.1GHz
2.45 7.6GHz
2.65 8.1GHz

2.65 x 8 = 21.2

Darn if only AMD packed LHe coolers in the box as well

I got the SB Result randomly from a forum and it was @ 5.1GHz

This is expecting linear improvement..but if this works in real world dang...
 
Last edited:

BlueBlazer

Senior member
Nov 25, 2008
555
0
76
Some quotes from Andy Glew (creator of CMT).....
I will be quite interested to see whether Bulldozer's cluster-private L1
caches (in AMD's swapped terminology, core-private L1 caches) are write
through or write-back. Willamette's L0 was write-through. I leaned
towards write-back, because my goal was to isolate clusters from each
other, to reduce thrashing. Also, because write-back lends itself
better to a speculative versionong cache, useful for SpMT.

With Willamette as background, I leaned towards a relatively small, L0,
cache in the cluster. Also, such a small L0 can often be pitch-matched
with the cluster execution unit datapath. A big L1, such as Bulldozer
seems to have, nearly always has to lie out of the datapath, and
requires wire turns. Wire turns waste area. I have, from time to time,
proposed putting the alignment muxes and barrel shifters in the wire
turn area. I'm surprised that a large cluster L1 makes sense, but that's
the sort of thing that you can only really tell from layout.

Some posters have been surprised by sharing the FP. Of course, AMD's K7
design, with separate clusters for integer and FP, was already half-way
there. They only had to double the integer cluster. It would have been
harder for Intel to go MCMT, since the P6 family had shared integer and
FP. Willamette might have been easier to go MCMT, since it had separate FP.

Anyway... of course, for FP threads you might like to have
thread-private FP. But, in some ways, it is the advent of expensve FP,
like Bulldozer's 2 sets of 128 bit, 4x32 bit, FMAs, that justify integer
MCMT: the FP is so big that the overhead of replicating the integer
cluster, including the OOO logic, is a drop in the bucket.
You'd like to have per-cluster-thread FP, but such big FP workloads are
often so memory intensive that they thrash the shared-between-clusters
L2 cache: threading may be disabled anyways. As it is, you get good
integer threads via MCMT, and you get 1 integer thread and 1 FP thread.
Two FP threads may have some slowdown, although, again, if memory
intensive they may be blocking on memory, and hence allowing the other
FP thread t use the FP. But two purely computational FP threads will
almost undoubtedly block, unless the schedulers are piss-poor and can't
use all of the FP for a single thread (e.g. by being too small).

I certainly want to explore possibilities such as SpMT and other single
thread speedups. But I know that you can't build all the neat ideas in
one project. Apparently MCMT by itself was enough for AMD Bulldozer.
(Actually, I am sure that there are other new ideas in Bulldozer. Just
apparently not SpMT or spreading a single thread across clusters.) Look
at the time-lag: 10-15 years from when I came up with MCMT in
Wisconsin, 1996-2000. It is now 7-5 years from when I was at AMD,
2002-2004, and it will be another 2 years or so before Bulldozer is a
real force in the marketplace.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,542
10,167
126
No, AMD's implementation of CMT is an inferior, less elegant version of a wide core with HT. For a BD module and a SB core with Hyperthreading, with two threads both a BD module and SB core can be considered two "Wimpy" cores. However, with one thread, SB becomes a "Brawny" core as the one thread has access to every resource of the core. With one thread on a BD module, it's still only running on a "Wimpy" core and the execution resources of the second integer core are inaccessible to the thread.

Are you sure about that? According to Anand's Ivy Bridge architecture preview article, Sandy Bridge uses static partitioning for chip resources for HT, whereas Ivy Bridge will dynamically partition chip resources, such that when not running with a HT, the current thread will get more resources.

So I think that if you were talking about IB, then you would be correct, but not for SB.
 

blackened23

Diamond Member
Jul 26, 2011
8,548
2
0
OBR just admitted to falsifying BD benchmarks. Good stuff.

I can't believe AMD would release a CPU slower than Phenom II....hopefully real benchmarks will be released soon so that all of this nonsense will end.
 

GammaLaser

Member
May 31, 2011
173
0
0
Are you sure about that? According to Anand's Ivy Bridge architecture preview article, Sandy Bridge uses static partitioning for chip resources for HT, whereas Ivy Bridge will dynamically partition chip resources, such that when not running with a HT, the current thread will get more resources.

So I think that if you were talking about IB, then you would be correct, but not for SB.

It's true for certain structures--the re-order buffer, load/store buffers, uop cache, and probably other queues are statically partitioned, while things like the execution units, reservation stations, and L1/MLC/LLC caches are competitively shared between both threads. So it looks like there could be some gains in IPC in single-threaded scenarios with IVB's more flexible HTT. :thumbsup:
 

StrangerGuy

Diamond Member
May 9, 2004
8,443
124
106
It's true for certain structures--the re-order buffer, load/store buffers, uop cache, and probably other queues are statically partitioned, while things like the execution units, reservation stations, and L1/MLC/LLC caches are competitively shared between both threads. So it looks like there could be some gains in IPC in single-threaded scenarios with IVB's more flexible HTT. :thumbsup:

If you read Anand's article on IB architecture he says that IB increases IPC over SB by 4-6%.
 

blackened23

Diamond Member
Jul 26, 2011
8,548
2
0
It's true for certain structures--the re-order buffer, load/store buffers, uop cache, and probably other queues are statically partitioned, while things like the execution units, reservation stations, and L1/MLC/LLC caches are competitively shared between both threads. So it looks like there could be some gains in IPC in single-threaded scenarios with IVB's more flexible HTT. :thumbsup:

IB will have increases primarily through higher clockspeeds. IPC will not improve that much over the existing sandy bridge....what intel did do however, was increase IGP by 60&#37; and lower power consumption.
 

Accord99

Platinum Member
Jul 2, 2001
2,259
172
106
Are you sure about that? According to Anand's Ivy Bridge architecture preview article, Sandy Bridge uses static partitioning for chip resources for HT, whereas Ivy Bridge will dynamically partition chip resources, such that when not running with a HT, the current thread will get more resources.

So I think that if you were talking about IB, then you would be correct, but not for SB.
From a previous article on Nehalem's Hyperthreading:

http://www.anandtech.com/show/2594/8

The execution units are thread unaware and can execute instructions from either thread.
 

intangir

Member
Jun 13, 2005
113
0
76
This article applies to Hyperthreading in the Pentium 4 Xeons, but it should provide some insight into how it must be implemented in other architectures:

http://arstechnica.com/old/content/2002/10/hyperthreading.ars/4


In order to present two logical processors to both the OS and the user, the Xeon must be able to maintain information for two distinct and independent thread contexts. This is done by dividing up the processor's microarchitectural resources into three types: replicated, partitioned, and shared. Let's take a look at which resources fall into which categories:

Replicated
- Register renaming logic
- Instruction Pointer
- ITLB
- Return stack predictor
- Various other architectural registers

Partitioned
- Re-order buffers (ROBs)
- Load/Store buffers
- Various queues, like the scheduling queues, uop queue, etc.

Shared
- Caches: trace cache, L1, L2, L3
- Microarchitectural registers
- Execution Units

Sounds to me like some structures that used to be statically partitioned between threads in Sandy Bridge are now shared in Ivy Bridge.
 

Riek

Senior member
Dec 16, 2008
409
14
76
This article applies to Hyperthreading in the Pentium 4 Xeons, but it should provide some insight into how it must be implemented in other architectures:

http://arstechnica.com/old/content/2002/10/hyperthreading.ars/4




Sounds to me like some structures that used to be statically partitioned between threads in Sandy Bridge are now shared in Ivy Bridge.

Since they know the partitionning now and that made their designchoices and size choices of those elements with that aprtitionning in mind i would suppose their isn't much performance to grasp.
 

intangir

Member
Jun 13, 2005
113
0
76
Since they know the partitionning now and that made their designchoices and size choices of those elements with that aprtitionning in mind i would suppose their isn't much performance to grasp.


It's probably hit or miss depending on workload. But research indicates up to 33% throughput gain over static partitioning is possible.

http://ieeexplore.ieee.org/Xplore/l...629335.pdf?arnumber=5629335&authDecision=-203

Simultaneous multithreading (SMT) increases processor throughput by allowing parallel execution of several threads. However, fully sharing processor resources may cause resource monopolization by a single thread or other misallocations, resulting in overall performance degradation. Static resource partitioning techniques have been suggested, but are not as effective as dynamic ones since program behavior does change over the course of its execution. In this paper, we propose an Adaptive Resource Partitioning Algorithm (ARPA) that dynamically assigns resources to threads according to changes in thread behavior. ARPA analyzes the resource usage efficiency of each thread in a given time period and assigns more resources to threads which can use them more efficiently. Its purpose is to improve the efficiency of resource utilization, thereby improving overall instruction throughput. Our simulation results on a set of 42 multiprogramming workloads show that ARPA outperforms the traditional fetch policy ICOUNT by 55.8 percent with regard to overall instruction throughput and achieves a 33.8 percent improvement over Static Partitioning. It also outperforms the current best dynamic resource allocation technique, Hill-climbing, by 5.7 percent. Considering fairness accorded to each thread, ARPA attains 43.6, 18.5, and 9.2 percent improvements over ICOUNT, Static Partitioning, and Hill-climbing, respectively, using a common fairness metric. We also explore the energy efficiency of dynamically controlling the number of powered-on reorder buffer entries for ARPA. Compared with ARPA, our energy-aware resource partitioning algorithm achieves 10.6 percent energy savings, while the performance loss is negligible.
 

JFAMD

Senior member
May 16, 2009
565
0
0
After reading some of the the leaked benchmarks that have been appearing lately regarding bulldozer (FX-8150), it appears that 8150 has less single thread ipc (clock for clock) then phenom II (super pi = 21sec at 3.6GHz maybe with turbo on).

My theory is that the chips (ES samples?) being benchmarked are restricted in the following way. For a module running a single thread, that thread should be able to use all available resources within the module (module becomes a 4 issue core essentially). However, I think that during these benchmarks, there might only be a static allocation of resources (so during leaked benchmarks, module acts as a two issue core). I also heard that clock for clock, somebody on these threads mentioned that bulldozer appears to have the IPC of zacate chips (which are two issue out of order execution cores).

your thoughts?

Also on a side note, I think a better way to look at bulldozer is not a 8 core processor, but a quad core processor where AMD has implemented a more elegant solution to run simultaneous threads in each core than HT.


My thought is that 90% are fake and the other 10% are not representative of actual performance. Trying to draw any conclusions on performance, in light of how many fakes are out there is impossible.
 

blackened23

Diamond Member
Jul 26, 2011
8,548
2
0
I'm not sure what you're talking about, AMD releasing a CPU slower than the 486 sounds completely plausible. I mean, thats what all the benchmarks floating around would have us believe. That makes fantastic business sense and i'm sure that AMD doing years of research would result in a super 486.