• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

First Steamroller processor core exposure

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
also L1i

the only thing that has moved from shared to dedicated is decode. If AMD really wanted to they could have expanded decode and still kept it shared. By beefing up the execution resources they actually show the value of CMT. Go look at piledrivers dieshot then look at this, we are talking about the doubling of lots of resources but its nowhere near double the diesize.
 
also L1i

the only thing that has moved from shared to dedicated is decode. If AMD really wanted to they could have expanded decode and still kept it shared. By beefing up the execution resources they actually show the value of CMT. Go look at piledrivers dieshot then look at this, we are talking about the doubling of lots of resources but its nowhere near double the diesize.

AMD may see an increase in cost per chip by increasing die size, but they will at least get far better sales than they currently do, and may even have improved pricing power.

It will be interesting to see how SR compares with Haswell.
 
rkf1sCW.jpg


Maybe this diagram is off or my interpretation but it looks like fetch and L1I are separate here.

Makes business sense to go directly for most single thread performance, the 8 "core" approach didn't really make a big splash in the server market, and worrying less about die size (GF WSA), rolling back CMT is a R&D light way of increasing ST performance.

In another thread I did some straightforward (perfect scaling) +15% IPC +10% clocks on a FX 6300 and that would be just a bit behind a 3570K in ST but ~20% faster in MT. Stands to reason a 2 Module version would get pretty close to a stock 3570K in both ST and MT (bit higher clocks than 3 module). http://forums.anandtech.com/showthread.php?t=2321195
 
Last edited:
quite a few of the structures that are individual in fetch-0 and fetch-1 have two simlar structures in the piledriver dieshotswhere they are next to each other . The problem is the piledriver/bulldozer dieshots are way lower rez then this dieshot so its quite hard to compare.
 
rkf1sCW.jpg


Maybe this diagram is off or my interpretation but it looks like fetch and L1I are separate here.

Makes business sense to go directly for most single thread performance, the 8 "core" approach didn't really make a big splash in the server market, and worrying less about die size (GF WSA), rolling back CMT is a R&D light way of increasing ST performance.

In another thread I did some straightforward (perfect scaling) +15% IPC +10% clocks on a FX 6300 and that would be just a bit behind a 3570K in ST but ~20% faster in MT. Stands to reason a 2 Module version would get pretty close to a stock 3570K in both ST and MT (bit higher clocks than 3 module). http://forums.anandtech.com/showthread.php?t=2321195


Compare original Steamroller as presented on Hot chips with this "new one". They doesn´t look the same....

Steam_Excavator.jpg
 
There are no Steamroller AM3+ chips on any roadmaps. It seems the socket is dead in favour of FM2+. Single socket servers will use FM2+ as well.

Wow. What an F U to AMD's current customers. That just seals the deal as far as my not going back to AMD for CPUs.
 
Compare original Steamroller as presented on Hot chips with this "new one". They doesn´t look the same....

Steam_Excavator.jpg

Yes, no idea if this is actually Steamroller but it does seem to have regressed more in CMT choices than the Hot Chips one.
 
Yes, no idea if this is actually Steamroller but it does seem to have regressed more in CMT choices than the Hot Chips one.

Slowly but surely, AMD is undoing all of the mistakes they made with Bulldozer, and admitting that CMT has too much of a single threaded performance penalty for it to be worth it.
 
Slowly but surely, AMD is undoing all of the mistakes they made with Bulldozer, and admitting that CMT has too much of a single threaded performance penalty for it to be worth it.

But it doesn't have one, the only people who say that are people who cant seperate what CMT is vs what bulldozer is. CMT's bottlenecks only occurred with Multithreaded workloads but even that was caused by design choices not CMT itself.

please name one restriction that CMT imposes on single thread performance.
 
But it doesn't have one, the only people who say that are people who cant seperate what CMT is vs what bulldozer is. CMT's bottlenecks only occurred with Multithreaded workloads but even that was caused by design choices not CMT itself.

please name one restriction that CMT imposes on single thread performance.

Sorry, I worded that badly. Single threaded performance suffered because they stripped out some of its integer prowess and increased the pipeline length. Multi threaded performance suffered because of this resource sharing idea, which now even AMD admits was a mistake.

It looks like, to redeem FailDozer, AMD is moving away from CMT. Less and less is being shared.
 
Sorry, I worded that badly. Single threaded performance suffered because they stripped out some of its integer prowess and increased the pipeline length. Multi threaded performance suffered because of this resource sharing idea, which now even AMD admits was a mistake.

It looks like, to redeem FailDozer, AMD is moving away from CMT. Less and less is being shared.

And again, AMD is not "moving away" from CMT, they're moving to a different implementation. To have a properly inane comparison, you don't hear that Intel is moving away from x86 decode when they power down the frontend on micro op cache hits.
 
And again, AMD is not "moving away" from CMT, they're moving to a different implementation. To have a properly inane comparison, you don't hear that Intel is moving away from x86 decode when they power down the frontend on micro op cache hits.

So they are sharing less between cores?

Calling it a different implementation, to me, is an attempt to not own up to the fact that it sucked, badly. Yes, its a different implementation, and the implementation is closer to full cores than CMT, compared to FailDozer.

Read the Anandtech article on Steamroller's changes.
 
and the implementation is closer to full cores than CMT, compared to FailDozer.

No, if they can execute 4 threads within a single Module.

This implementation could be like a Single Module, 2 Cores (CMT) with 4 Threads (SMT) or,
Single Module, 2 Cores 4 Threads (CMT) like BD/PD.I’m leaning towards this one.

Either way, they continue evolving the CMT design.
 
No, if they can execute 4 threads within a single Module.

This implementation could be like a Single Module, 2 Cores (CMT) with 4 Threads (SMT) or,
Single Module, 2 Cores 4 Threads (CMT) like BD/PD.I’m leaning towards this one.

Either way, they continue evolving the CMT design.

Please post a link supporting this idea that a single module will execute 4 threads?
 
Please post a link supporting this idea that a single module will execute 4 threads?

The die pic shows 4 ALUs + 4 AGUs per Integer Core. I don’t believe that they will use 8 pipes per Thread. The utilization of all 8 pipes from a single thread will be very low, not to mention the performance gains to die area ratio used will be even lower. This implementation is surely a 4 Threads design.
 
The die pic shows 4 ALUs + 4 AGUs per Integer Core. I don’t believe that they will use 8 pipes per Thread. The utilization of all 8 pipes from a single thread will be very low, not to mention the performance gains to die area ratio used will be even lower. This implementation is surely a 4 Threads design.

That would be contrary to everything AMD has said about Steamroller. I'll believe it when Anandtech writes a detailed article on it.
 
Wow. What an F U to AMD's current customers. That just seals the deal as far as my not going back to AMD for CPUs.
I suspected this, thus my purchase of a 8350. However, in fairness Intel does this. It was 1366, then 1156, them 1155 and within a week 1150. Progress. And implementation of faster overall systems.

I would prefer AMD to use a new chipset with all the bells and whistles and release a Steamroller that is powerful enough that even Elvis would sing "I'm a Steamroller Baby, and I'm going to roll over you!":awe:
 
Well you cannot deny there are 4 ALUs and 4AGUs in the die shot. And to expect that all of these resources (+ the FP ones) would be effectively utilized without some form of SMT/CMT is very optimistic. 4 thread per module sounds about right for this configuration of exec. resources.
 
That would be contrary to everything AMD has said about Steamroller. I'll believe it when Anandtech writes a detailed article on it.

Anandtech talked about SR core. We don´t know if this is SR or XV or anything else. Look at Jaguar. One Compute Unit (CU) has 4 cores/4threads now.
Bulldozer brought a module with 2 INT clusters processing two threads. Evolving the idea of high throughput/thread parallelization and you´ll get one module with ability processing four threads 😉 CMT/SMT all the way.
 
Last edited:
Back
Top