Setting performance expectations for Bulldozer(client)

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Mopetar

Diamond Member
Jan 31, 2011
8,496
7,752
136
But still, writing efficient message parsing programs is way more complicated than writing efficient programs using threads and that's way more complicated than writing sequential programs.

Yeah it's a lot more difficult, but if there's money to be made by writing more efficient software, someone is going to do it.

On the other hand, we're just starting to see companies work towards creating APUs. It will be interesting to see how the designs evolve over time. They can also find other stuff to integrate into their chips or move on-die. Intel and AMD should have plenty to do for the next several generations.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
From a performance perspective, it is a Tick IMHO. The only possible rationale for a "Tock" would be the inclusion of a grahics core.

I see your point. But I have a feeling most people are expecting Pentium 4 to Core 2 Duo improvements every Tock. But then you could say Core 2 was a Tick based on laptop improvements: http://www.anandtech.com/show/2056/9

The advancement was only 10-15% there too. But its a Tock based on mobile chip doesn't it? You could say it was because the previous chip was crippled not having 32nm, but so was the Pentium 4. On mobiles it achieves significantly better battery life and performance. Something of which only one happens with a mere shrink.

Penryn gains were also about half, or less: http://www.anandtech.com/bench/Product/56?vs=59

Edrick said:
I agree about your statement on the multiple cores and programming. People usually tend to focus on one metric because marketing makes us that way(its easier for them). Beyond 4 cores on desktop, I don't know.

I do not think each process is dependent of the previous.
That I have to disagree strongly. You can't skip to the next step in process technology when you are on the cutting edge already. For example, Itanium was able to "skip" 45nm because even with that skip, that would still be their last generation process. Intel has been increasing their lead by marching one step at a time, to reduce risk. You remember they can fail miserably too, like with Prescott when they couldn't match process tech well with design.

There was a PC Watch article where the Nehalem architect said they could have eeked out 5-10% additional performance, but didn't because of causing too much risk. This is one case where theories do not match with reality at all.

Going to 22nm also requires the companies supplying the tools be able to do it in volume and with reliability. One of the problems the lithography people said was that as the light that prints the circuits are far larger than the ciruits themselves, and they need some complex steps to make it work. You go one step beyond that, you will be just multiplying the problems that needs to be solved.

The guys who make world records going up a 60-storey or more tall building by stairs say this: "You need to know your limits and do it at a certain pace, if you go too fast in the beginning, you'd needlessly waste your energy and never make it to the top".

Same thing. If the process guys don't know their limits they'll fail just like the amateurs doing the stair climbing. All the theories won't help as much as single real world application of the theory. Doing mass manufacturing is just that.
 
Last edited:

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
I don't know much about financial software, but I assume there are some operations that need to applied to numerous members of a data set. These are the cases where having huge numbers of threads can significantly improve performance. Of course if there's not good support for something like that in the language that you use it's not going to help you much.

Instruction set improvements like FMA (hopefully in Haswell) will be much more benificial to financial applications (and games) than adding more cores. And adding these sets requires less coding changes than adding/managing more threads. In theory, FMA could double the FPU performance per core (using these instructions). That can be huge.

I am not saying more cores is bad. Just saying there are many other IPC improvements that can be just as important, if not better, for some applications.
 
Last edited:

Voo

Golden Member
Feb 27, 2009
1,684
0
76
In theory, FMA could double the FPU performance per core (using these instructions).
Well I've seen some explanations of performance gains by FMA (e.g. saving one round and normalize; less register pressure), but doubling the performance? Any sources where I can read that up?
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Well I've seen some explanations of performance gains by FMA (e.g. saving one round and normalize; less register pressure), but doubling the performance? Any sources where I can read that up?

Double performance in FMA instructions, not the entire FPU. Similar to how Intel says AVX can double certain FPU operations. I understand FMA more than I do AVX since FMA has been around for a long time on other platforms (IBM Power, Itanium, Mips, Nv Fermi, etc.). And most compilers supported FMA long ago, while AVX is new. But again, this is all theory in regards to Intel. I do not expect double, but I do expect a huge gain in performance.

I will try to get some good websites to read for you.
 

OneEng1

Junior Member
Apr 3, 2010
9
0
0
IntelUser2000 said:
I see your point. But I have a feeling most people are expecting Pentium 4 to Core 2 Duo improvements every Tock. But then you could say Core 2 was a Tick based on laptop improvements: http://www.anandtech.com/show/2056/9

The advancement was only 10-15% there too. But its a Tock based on mobile chip doesn't it? You could say it was because the previous chip was crippled not having 32nm, but so was the Pentium 4. On mobiles it achieves significantly better battery life and performance. Something of which only one happens with a mere shrink.
True. The mobile Intel products only deviated from the basic PPRO design off to P4 for a short time and then were back with Centrino. Going to Core 2 from Centrino was not really a new architecture, but rather a good enhancement to Centrino. Add in a few more tweaks, an IMC, and point to point interconnect protocol, and you have Nehalem.

I am guessing I am simply going to have to redefine my acceptance of what a "Tock" is :)
Edrick said:
Instruction set improvements like FMA (hopefully in Haswell) will be much more benificial to financial applications (and games) than adding more cores. And adding these sets requires less coding changes than adding/managing more threads. In theory, FMA could double the FPU performance per core (using these instructions). That can be huge.
I have nothing against instruction set improvements; however, things like OpenCL have the opportunity to have a much bigger impact on more applications IMHO. Having hundreds of cores at hand through OpenCL is a powerful game changer. Still, like some of the AES instructions, certain applications can really get a step jump in performance from instruction set improvements.
 

OS

Lifer
Oct 11, 1999
15,581
1
76
very interesting discussion..of course i would take a billion GHZ single core over billion GHZ/X cores..

the industry hit the wall on clockrate long time ago, hence switching to selling people more cores.

I think IPC can only be squeezed so much without affecting things like pipeline length and parallel execution which have their own limits.


it did not occur to me that thread/core scaling might also hit a wall, as noted earlier most educational cirriculum and programs do a poor job of covering multicore/multithreading topics.
I have BSEE/MSEE both in computer option and my program did not cover this well.

in this case, i wonder if we might basically see our entire known computing model/paradigm more or less max out within our lifetimes.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
isn't FMA 3 in IB? in any case FMA 4 is in bulldozer :whiste:

FMA requires significant change in the FPU. Going from only executing Multiply OR Add per unit at a time, to Multiply AND Add is a radical change. I don't expect any cache changes either, like for L2. L3 could be bigger, only because there would be more cores. It'll probably be like Westmere was to Nehalem.

OS said:
the industry hit the wall on clockrate long time ago, hence switching to selling people more cores.

The next fad I assume is... heterogeneous computing.

I really hope it works. Really like to see few powerful cores + many weaker cores. Not CPU+GPU, cause it would only satisfy certain markets. I'd like Haswell with few Pentium cores on the side please. :)
 
Last edited:

OneEng1

Junior Member
Apr 3, 2010
9
0
0
I guess what is exciting to me about BD is that it is a major design architecture change from anything we have previously seen.

AMD has always had different pipelines for FPU and INT, and with BD they have extended this to completely different execution units. Intel on the other hand, has opted for more ubiquity in their core design.

With BD we also see a very different approach to transistor sharing than what Intel has done with SMT.

I think this is a very interesting design that AMD has embarked on. I am thinking that this is a move toward putting a number of different KINDS of cores on a die and having them share a front end and cache paradigm.

Perhaps Cell was ahead of its time, but BD certainly reminds me of it :)
 

maddie

Diamond Member
Jul 18, 2010
5,157
5,545
136
A question for a single thread workload

Assumptions:
For a given thread and instruction set, you can only expand the individual processing steps as far as the specific instruction allows and thus increase IPC.

There appears to be a maximum IPC for any given instruction set and also any specific instruction within that set.


Can anyone expand on this or correct it?

Intel might be close to max IPC possible.

Could this be why AMD appears to be emphasizing clockspeed increases without the historical increase in power used?
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
I really hope it works. Really like to see few powerful cores + many weaker cores. Not CPU+GPU, cause it would only satisfy certain markets. I'd like Haswell with few Pentium cores on the side please. :)
Hopefully we'll get both. If all this SoC-on-demand hype comes to fruition, that may be an option. For some things, nearly branch-free compute cores like our GPUs have blow anything else away, being so dense. Throw too many direct branches (or PITA-to-code facsimiles of), or just a few indirect branches...and depending on who's you are using, it will either completely fail, or be too slow to be usable.

GPGPU is the only reasonable way to go for desktop/laptop/cellphone/etc., but for proper workstations and servers, masses of weak CPUs w/ heavy multithreading could be quite good for some workloads, and in a few gens, we might see them as Xeons, or parts of Xeons.

maddie said:
Could this be why AMD appears to be emphasizing clockspeed increases without the historical increase in power used?
Not likely. While we are likely approaching max IPC for x86 (infinitely large instruction windows don't exist, though, so we won't hit a hard wall), assuming branch prediction and cache misses never happen, the reality is that Intel has only a single other company with CPU R&D resources in their league: IBM. Nobody else in the entire world comes even close. AMD has to do the best they can with exceptionally limited R&D resources. That they were able to pull off the K7 and K8, even with Intel on psychoactive drugs, is no small feat. If BD can compete with the higher-end SB CPUs, it will be quite a show of what carefully directed ingenuity can accomplish (I'm doubtful, but AMD has done it in the past). Intel pisses away more R&D resources than AMD can bring to bear, and clock speeds are an easy way to speed things up.
 

OneEng1

Junior Member
Apr 3, 2010
9
0
0
AMD has a nearly impossible task.
  1. They have a fraction of the R&D resources
  2. Their processors spend the majority of their life competing against Intel processors with 2 times the transistor budget (since Intel is nearly always a die shrink ahead).
  3. AMD struggles to get software, OS and Compiler support while Intel drives the entire industry with regards to these items.
  4. AMD is greatly inferior to Intel in terms of business deals. Intel can package chipsets, motherboards and processors into a deal.

When AMD succeeded in eclipsing Intel, it was only due to 2 factors. First, AMD moved to copper interconnects allowing K7 to surpass PIII's clock potential while having equal or even slightly better IPC. Second, P4's basic design dictated very high clocks to achieve high performance. This design was in direct conflict with multi-core power requirements. Third, Intel withheld 64 bit from x86 in a vain attempt to promote their Itanium product which they had spent so much time and money on.

Without this colossal list of horrible mistakes, is it possible for AMD to ever eclipse Intel again?

Bulldozer is most surely going to be a huge step forward for AMD. At the end of the day, it all comes down to the price you can command for your product compared to the cost of making and developing it.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Yeah cuz they did so well all the other times they attempted to get into the x86 market.

And look at what IBM did for Apple, breathtaking is one word we could use to describe their effect on their business partners and market segments.

Hence buy AMD and not try to do it alone. :)

IBM can offer AMD more development cash as well as facilities. Not to mention AMDs BD design has taken a page from IBMs handbook so to speak. I think it would be a good fit, but who really knows. Much better than the Oracle-Sun venture in my opinion.
 

greenhawk

Platinum Member
Feb 23, 2011
2,007
1
71
Quotes from members, looking through the thread:

Black ops (chewed up all 8 threads of my xeon)
FSX (chews up all 8 threads of my xeon as well)

If using HyperThreading, then I find peoples "it uses <x> threads" a little pointless. Seen too many proper reviews on HyperThreading that shows usage between HT on and HT off for the same software (ie: thread limited software) giving similar performance numbers for both, but showing near 100% cpu utilization regardless of "cores"/threads running (when threads=real cores counts).