S/A: "AMD outs bulldozer based orochi die"

Martimus · Sep 9, 2010

jvroig said:
If you mean this to be an explanation of the Hot Chips statement, then I will have to disagree. The statement is 80% of the throughput of CMP. By this option you present, the max throughput will be 90% of CMP. We thus end up at square one with the Hot Chips vs JFAMD numbers. I believe this to be an unlikely explanation.

In this case, I meant that it is possible for AMD to design the module so that one thread gets priority over the other thread. If they did this (which I am not sure would be all that useful) the first thread could have priority to all resources available on the module, and the second thread would have to wait until the first thread is done with that particular resource before it can use it.

The other option would be to do it how Intel does it, and just have each resource come on a first-come, first-serve basis where it will slow down both threads on an individual thread basis, but should result in an overall faster application.

TBH, I think the 80% figure comes from the amount of resources that are shared. If only 20% of the resources are shared where only one thread can use the resource at a time, then that would make the 80% figure make sense. It would also make it a very conservative figure, since not all resources will be used at one time. Of course this is just my thought after reading through this recent line of questions about efficiency, since I had not thought about it before.

jvroig · Sep 9, 2010

Martimus said:
TBH, I think the 80% figure comes from the amount of resources that are shared. If only 20% of the resources are shared where only one thread can use the resource at a time, then that would make the 80% figure make sense. It would also make it a very conservative figure, since not all resources will be used at one time. Of course this is just my thought after reading through this recent line of questions about efficiency, since I had not thought about it before.

Yes, it is absolutely because of the amount of shared resources. There's a balance to be struck in "sharing everything" and "sharing nothing", so somewhere along the way they made the decision to share x amount and that shaped the power/size/performance characteristics into what it is now (or will be).

At any rate, It is not about whether 80% is believable or not. It certainly is believable. What started all of these was simply a couple of conflicting estimates - is it 80% of CMP (c/o Hot Chips), or is it 90% of CMP (c/o JFAMD, now and even before Hot Chips)?

As I've said, it's been closed (at least for me) the moment JFAMD opted to cast a wider net. An estimate will always have a low and high figure anyway. Perhaps at Hot Chips AMD decided to play only with the low or middle-ground, while JFAMD is more inclined to play with the high figure. Either way, at least by casting a wider net JFAMD puts it to rest. As you've concluded yourself, 80% is probably conservative, and that seems a reasonable target to me.

Martimus · Sep 9, 2010

jvroig said:
No, Hyperthreading would be SMT

CMP = chip multiprocessor (real cores, traditional, no HT)
CMT = clustered / cluster-based multi-threading (module approach)
SMT = simultaneous multi-threading (HT)

100% is the baseline, so a quad i7 will have a max throughput (we are talking of multi-threads, as I did clearly say; we have already gone over the issue that at single-threads only, there is negligible penalty, so we aren't discussing that anymore) of no less than 400%. With HT on, we can add ~80% to that, so we get ~480%, maybe 500% even. Still, it is above the baseline. Core efficiency goes up, not down.

For CMT, we already accept a penalty. For simple core efficiency metrics (and realize here that "efficiency" in this conversation was misused to mean performance or throughput - not by me, I simply followed through it), we lost performance immediately. In the end, it can still be a win, if the performance loss through CMT will be more than offset by the additional cores made available (and this is part of the picture AMD paints, so yes we can count on this especially on serverland). But as it is, the "efficiency" (throughput) went down per core in multi-threaded workloads, we just count on having more cores to end up with competitive / better performance. Hence, my bewilderment in putting at parity the throughput of a CMP and CMT design when all cores are running.

I think you are looking at a four module BD chip as an 8 core processor. While that is true on the marketing side (since they call it an 8 core processor), it isn't as simple as you are putting it. If you have 4 threads, there should be no immediate penalty as long as each thread uses a different module (which you say already happens with Hyperthreading, so it is reasonable to think that OSes can do this.) There should not be any additional penalties that you won't see on a CMP chip when using only one thread per module.

The biggest difference between the AMD CMT and the Intel SMT is that AMD seems to have gone through the core and figured out which portions of the processor are used the most often, and duplicated those portions. The rest is shared in a very similar way that Intel shares resources in SMT. This is a pretty straight forward evolution to SMT, and I am sure Intel will follow suit in the near future, if for no other reason than it just makes a lot of logical sense.

We are probably just talking past each other, as I am sure we both want to know the same thing. I want to know how much of the module is shared between the two "cores", and which parts are shared and how, so I can better understand what conflicts there may be. This will tell us what kinds of programs will see the best throughput, and what kinds of programs will see the least benefit from CMT.

JFAMD · Sep 9, 2010

Wow, a lot of discussion.

Here's the real question - 4 cores with HT vs. 8 cores.

Up to 4 threads probably not a big difference.

Thread #5 is where everything turns heavily in favor of physical cores. Let's not get caught up in percentages and doing math. Let's all agree that there is a big difference.

I think more physical cores matter.

If you think I am wrong and believe that things will either stay the same as today or go down, then you should be looking at big cores and HT.

When both processors are out, we'll be able to see who is right. I am banking on things being more threaded in the future and workloads getting heavier.

Idontcare · Sep 9, 2010

JFAMD said:
Thread #5 is where everything turns heavily in favor of physical cores. Let's not get caught up in percentages and doing math. Let's all agree that there is a big difference.

If your app is LinX then Thread #5 is about 1 thread past the time when things turn to shit...

oops, did I say that? I mean to say they become less than optimal in this particular niche application...()

Contrast to CMP:

Can't wait to see how Interlagos delivers :thumbsup:

Caribbean Geek · Sep 9, 2010

But is it certainly true that HyperThreading can increase the performance up to 15% per core?

http://www.anandtech.com/bench/Product/47?vs=109

Which would means that in the ideal multithreaded scenario, a similar hyperthreaded CPU like the i7 920 should be 60% faster than the HT less i5 750, and yet, that link proves that the gains due to HT are very slim, which shows that if a thread is using near 100% of execution resources usage, there's little idle resources for a second thread, so for me, Bulldozer approach for Multi Threading makes more sense.

Idontcare · Sep 9, 2010

Caribbean Geek said:
But is it certainly true that HyperThreading can increase the performance up to 15% per core?

http://www.anandtech.com/bench/Product/47?vs=109

Which would means that in the ideal multithreaded scenario, a similar hyperthreaded CPU like the i7 920 should be 60% faster than the HT less i5 750, and yet, that link proves that the gains due to HT are very slim, which shows that if a thread is using near 100% of execution resources usage, there's little idle resources for a second thread, so for me, Bulldozer approach for Multi Threading makes more sense.

I'm no expert on SMT/CMT but I believe that the situation with SMT is this - it's opportunity to add value rests solely on the existence and prevalence of inefficiently compiled code.

As compilers get better at eliminating inefficiencies then the opportunities for SMT to add value to the equation goes away whereas the exact opposite is true for CMT and CMP implementations.

CMP and CMT the more effective the compiler is in generating code that can keep the decoders/schedulers/etc busy then the higher the performance for the same chip.

I can't overstate enough the likelihood of my being wrong about this, but the last pseudo-in depth discussion we had here at AT forums regarding hyperthreading this was the agreed upon conclusion.

Kind of like SSD's and garbage collection...the value-add of garbage collection exists because of the inefficiencies of the OS and storage drivers. As the OS and storage drivers become better (trim, etc) then the value added by implementing a GC algorithm on your SSD diminishes.

Garbage collection and hyper-threading take advantage of existing inefficiencies present in existing software environments. The magnitude and prevalence of those inefficiencies are expected to decrease over time, so the advantages and opportunities afforded by these methods will decline as well.

CMP and CMT are the future. Won't keep Intel from making a mint though while we take our sweet time getting there

And who knows, could be like MCM and Fusion, AMD can talk about CMT all day long but it ain't here yet and for all we know Intel will upstage them with their own CMP solution anyways.

Accord99 · Sep 9, 2010

Caribbean Geek said:
But is it certainly true that HyperThreading can increase the performance up to 15% per core?

It's 15-30% in total throughput. Assuming an application that can scale well to 8 threads.

http://www.anandtech.com/bench/Product/47?vs=109

http://www.anandtech.com/bench/Product/88?vs=146
2 extra real cores but the gains are slim other than applications that scale well. Just shows that most desktop applications can't multi-thread well.

Which would means that in the ideal multithreaded scenario, a similar hyperthreaded CPU like the i7 920 should be 60% faster than the HT less i5 750, and yet, that link proves that the gains due to HT are very slim, which shows that if a thread is using near 100% of execution resources usage, there's little idle resources for a second thread, so for me, Bulldozer approach for Multi Threading makes more sense.

But compare it with a Core 2 Duo or Phenom 2 and you'll see that an i7 920 does have around 50% more throughput in well-threaded applications. Nehalem is a powerful core that with HT gives excellent single-threaded and dual-threaded performance.

Nemesis 1 · Sep 9, 2010

JFAMD said:
Wow, a lot of discussion.

Here's the real question - 4 cores with HT vs. 8 cores.

Up to 4 threads probably not a big difference.

Thread #5 is where everything turns heavily in favor of physical cores. Let's not get caught up in percentages and doing math. Let's all agree that there is a big difference.

I think more physical cores matter.

If you think I am wrong and believe that things will either stay the same as today or go down, then you should be looking at big cores and HT.

When both processors are out, we'll be able to see who is right. I am banking on things being more threaded in the future and workloads getting heavier.

First of all your constantly wanting to compare 8 threads rather than cores.

Than your talking future about good threading and how well real cores will scale . Did intel die and go away in this future. By the time we have BD on the desktop 22nm. Ivy bridge will be ready to appear . Bottom line you want to compare cost constantly which is fine . But lets say your correct BD bitch slaps SB . Suddenly AMD cpus cost more than intel . Now are you going to say Intel is the better value because its cheaper . Or will you say AMD is better value because it offers more performance . This at a time when intel is releasing 22nm so it should be more energy effient no matter the case. I know the ans to my question . I know what you will say already.

Accord99 · Sep 9, 2010

JFAMD said:
Wow, a lot of discussion.

Here's the real question - 4 cores with HT vs. 8 cores.

Up to 4 threads probably not a big difference.

Unless the cores are Westmere and Magny Cours, in which case you're looking at a massive difference.

JFAMD · Sep 9, 2010

Nemesis 1 said:
This at a time when intel is releasing 22nm so it should be more energy effient no matter the case.

And here is Intel's 32nm to our 45nm:

Interesting, I don't see a ton of power savings from that 32nm process.

Plus, the performance is pretty much neck and neck, we both win some we both lose some.

But I am a LOT less expensive.

So, if the power is the same, the performance is the same and the price is a lot higher, what is the advantage of going to that new node?

Face the facts, process node is not an advantage other than bragging rights around the water cooler.

When it comes to power, performance and pricing, customers aren't seeing a benefit.

JFAMD · Sep 9, 2010

Oh, and if you are wondering, that is a 2P config of a 4P system, so there is a bigger power supply. If that were a real 4P it would be a lot lower, right?

Caribbean Geek · Sep 9, 2010

JFAMD said:
And here is Intel's 32nm to our 45nm:

Interesting, I don't see a ton of power savings from that 32nm process.

Plus, the performance is pretty much neck and neck, we both win some we both lose some.

But I am a LOT less expensive.

So, if the power is the same, the performance is the same and the price is a lot higher, what is the advantage of going to that new node?

Face the facts, process node is not an advantage other than bragging rights around the water cooler.

When it comes to power, performance and pricing, customers aren't seeing a benefit.

Wow, that's some interesting findings.

jvroig · Sep 10, 2010

Martimus said:
We are probably just talking past each other, as I am sure we both want to know the same thing.

I suppose that is true. It seems that it was only your misinterpretation of CMP which started our conversation (when you quoted point #4). Obviously, if you read it again with knowledge that CMP is not HT, then we are in agreement. And if you re-read why I said what I said (the post I replied to), you'll see the context, and everything you are saying is exactly what I'm saying now and have been saying over and over for a while now, on this thread and the other thread, about single-thread, dual-thread in a module, and having threads scheduled to separate modules whenever necessary (by the way, since you commented on this topic, too, it is not automatically done just because OSes are HT-aware now. AMD still has to work with Linux and MS people to make sure their schedulers know how to exploit the module design. It is certainly possible, but it is not a given, and we'll see if they succeed with it on time with BD launch)

Janooo · Sep 10, 2010

JFAMD said:
Wow, a lot of discussion.

Here's the real question - 4 cores with HT vs. 8 cores.

Up to 4 threads probably not a big difference.

Thread #5 is where everything turns heavily in favor of physical cores. Let's not get caught up in percentages and doing math. Let's all agree that there is a big difference.

I think more physical cores matter.

If you think I am wrong and believe that things will either stay the same as today or go down, then you should be looking at big cores and HT.

When both processors are out, we'll be able to see who is right. I am banking on things being more threaded in the future and workloads getting heavier.

Before going to 5 threads let me ask about 2 threads first.

Is OS (Windows, Linux, ...) going to be aware of module design or all the cores are equal?
Could it happen that 2 threads end up in one module (if cores are equal) and they will run with 10%-20% penalty?
At the end, what OS schedulers will do?
Thanks.

JFAMD · Sep 10, 2010

Read my blog, I have answered the question there as part of the 20 questions series.

Riek · Sep 10, 2010

Janooo said:
Before going to 5 threads let me ask about 2 threads first.

Is OS (Windows, Linux, ...) going to be aware of module design or all the cores are equal?
Could it happen that 2 threads end up in one module (if cores are equal) and they will run with 10%-20% penalty?
At the end, what OS schedulers will do?
Thanks.

As i understand the penalty is due to high loaded threads which have to wait to use shared info. with HT there is a performance decrease in those situations. And as far as i understand it, bulldozer does not use a HT flag for the second core. So Os should be able to post two threads to one module.

Idontcare · Sep 10, 2010

Janooo said:
Before going to 5 threads let me ask about 2 threads first.

Is OS (Windows, Linux, ...) going to be aware of module design or all the cores are equal?
Could it happen that 2 threads end up in one module (if cores are equal) and they will run with 10%-20% penalty?
At the end, what OS schedulers will do?
Thanks.

Hi Janooo, you don't even have to go so far as worrying about the module level, just look at the performance impact in Linpack on an AthlonII X4 when running one thread if the thread affinity is not locked.

http://forums.anandtech.com/showpost.php?p=29008307&postcount=100

Hyperlite's rig suffered a 14% drop in single-threaded performance (10.76Gflops -> 9.24GFlops) if he let the OS decide where the thread wandered versus locking the affinity to a specific core.

That is with CMP.

Now go CMT and add inter-Module versus intra-module thread migration issues to that and I don't think the OS thread scheduler is going to do any better.

Janooo · Sep 10, 2010

JFAMD said:
Read my blog, I have answered the question there as part of the 20 questions series.

Thanks, it's good to hear that you work with Win and Linux devs and OS should be aware of modules.

Janooo · Sep 10, 2010

Idontcare said:
Hi Janooo, you don't even have to go so far as worrying about the module level, just look at the performance impact in Linpack on an AthlonII X4 when running one thread if the thread affinity is not locked.

http://forums.anandtech.com/showpost.php?p=29008307&postcount=100

Hyperlite's rig suffered a 14% drop in single-threaded performance (10.76Gflops -> 9.24GFlops) if he let the OS decide where the thread wandered versus locking the affinity to a specific core.

That is with CMP.

Now go CMT and add inter-Module versus intra-module thread migration issues to that and I don't think the OS thread scheduler is going to do any better.

Yea, that's why I wrote 'should be aware' in the above post.

I am sure we'll see some tests to check it out.

Idontcare · Sep 10, 2010

Janooo said:
Yea, that's why I wrote 'should be aware' in the above post.
I am sure we'll see some tests to check it out.

I only wish more reviewers were interested in exploring this side of performance scaling.

Techreport does a good job taking a stab at it. They usually include a couple benchmark apps that look at thread-scaling.

But outside of techreport I haven't seen any consumer-SKU reviews that look at this. Have you?

Martimus · Sep 10, 2010

JFAMD said:
Read my blog, I have answered the question there as part of the 20 questions series.

That is a good point about the Turbo mode you made. You have an inherent disadvantage to having two threads on the same module, in that they have to share some resources, but at the same time there is an advantage in that the module is more likely to be running at a higher clockspeed in Turbo mode (since another module would be idle). It is an interesting dynamic.

I like that you are actively working with both Linux and MS to work with the BD architecture correctly. The more I think about it, the more I realize that my conception that AMD does not work well with software developers is based on things that happened more than 15 years ago. I really have no basis to think that AMD currently acts that way.

S/A: "AMD outs bulldozer based orochi die"

Diamond Member

Platinum Member

Diamond Member

Senior member

Elite Member

Banned

Elite Member

Platinum Member

Lifer

Platinum Member

Senior member

Senior member

Banned

Platinum Member

Golden Member

Senior member

Senior member

Elite Member

Golden Member

Golden Member

Elite Member

Diamond Member