• We should now be fully online following an overnight outage. Apologies for any inconvenience, we do not expect there to be any further issues.

Steamroller core

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
Steamroller will have AVX2 and an extension of XOP.

Post-post edit: Kaveri is coming out Q3 2013 with the 1H of 2013 getting a new stepping of Trinity. The new stepping only includes higher clocks for the CPU and GPU, and no architecture tweaks or bug fixes.
That kind of conflicts with the definition of a stepping, no?

As much as I wish anything from AMD was competitive (because competition drives innovation), I have very little faith in AMD pulling another "Phenom 2"-like release. Bulldozer was a big disappointment, Vishera chips haven't even been released yet, and now AMD is touting Steamroller? As other posters have said, Vishera might end up being another disappointment and with Steamroller being one to two years away from release (in realistic time frames), all I can say is....bring on Haswell. Heck, bring on Ivy Bridge-E.
I think you misunderstand the entire point of Hot Chips. Hot Chips, where this Steamroller information was released, is where tech companies showcase current and future architectures.

AMD's talking about Steamroller before Vishera is out? So what? Piledriver's already out in the form of Trinity — you're being quite brat-like by pretending that APUs don't matter. Steamroller will be coming in Kaveri first... it doesn't matter even remotely in this context that Vishera isn't out.

If the information was ready to go public, why would you care what AMD's current product lineup looks like? Unless you really think it's best to keep it under wraps as to feed your bizarre desires until AMD's Financial Analyst Day...

Also, what was so special about Phenom II? AMD was still in second place by a rather considerable margin. Phenom II was an alternative if you didn't have the money to get a dramatically faster i5 or i7... still a pretty shameful position, and not far worse than where they are right now. It's not hard to improve on garbage, so an AMD return to relevance is definitely possible and very likely as the gains from process shrinks begin to disappear. Steamroller will likely still be behind Haswell by a significant margin, but still put AMD in a vastly improved position over where they are now.
 
Dec 30, 2004
12,553
2
76
What isn't accurately captured or portrayed in those articles is that the state of the art in synthesis (automated layout) is not static, rather it is advancing at a blistering rate thanks solely due to the increasingly difficult requirements placed on fabless design houses by the foundries as the process nodes increase in complexity at every node (making hand-coded cells all the more arduous node after node).

The commentary regarding bulldozer's reliance on synthesis came from an engineer who left the company 2yrs before bulldozer came out, that is 3yrs ago. 3yrs is an eon in this industry. I'm sure his comments and experience were relevant to the state of synthesis in 2009 with 45nm, not so relevant to the state of synthesis in 2013 with 28nm.

Think of it like this...consider the game of Chess. In 1960 you would not want to bet on a computer competing against world-class chess players, the computer would stink. Same in 1970, and 1980. Computers were slow and not as good as humans.

But what happened in 1997 between Deep Blue and Kasparov? Computer won.

This is what has happened in pretty much every industry that involves engineering. Slowly but surely the software and hardware has evolved to the point where computers can run through millions of simulated models to find more optimal cases than humans could ever hope to achieve - be it with bridges, autos, skyscrapers, or integrated circuits.

It is not that the computers are smarter, its just that they are faster. So they can run through so many more test designs while filtering out the dead-ends faster than a team of humans ever could.

So the limits are not that of the CPU designers but now the limits are on the people who program the synthesis tools themselves. Very much like the limitations in programming that come at the hands of the people who create the compiler tools.

It was only a matter of time before computers would become better than human at designing CPU's. And it is a matter of budget as well. Looks like AMD is saying when you factor in the budget considerations, computers have reached that point now.

the rules in the game of chess are static and easily definable. The chess computers lost because they didn't have the horsepower to recursively evaluate every potential move until the end of the game to determine the best one.
The problem with the CPU design is at the software level, not the computing hardware level IMO. So, just saying that I am less inclined to believe time will solve this problem, if by 3 years ago they hadn't already solved it.

Hm come to think of it, maybe as you say with processors getting more complex these days more R&D is being invested in the CAD routines that do the routing.
In addition with billions of transistors maybe running the optimization routines on "fully recursive mode" back then did take too much CPU power. They could have just rented server space though. I can't imagine it would have been that difficult to scale up.
 
Last edited:

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
Steamroller will have AVX2 and an extension of XOP.

Post-post edit: Kaveri is coming out Q3 2013 with the 1H of 2013 getting a new stepping of Trinity. The new stepping only includes higher clocks for the CPU and GPU, and no architecture tweaks or bug fixes.

i remember something about XOP2....(not about avx-2)

....do you have a link?
 

exar333

Diamond Member
Feb 7, 2004
8,518
8
91
VR-Zone make an interesting point in comparing Steamroller not to the desktop Intel platform, but the server. Given Intel's tendency of massively delaying the server/enthusiast versions of their new architectures (see: SB-E), AMD's "time lag" doesn't look quite so bad. Don't forget, Bulldozer is just an Opteron in disguise. Steamroller won't be going up against Haswell, it will be going up against IB-E- or maybe even SB-E, depending on just how bad Intel's execution is this time. (IB-E is currently scheduled for Q3 2013. http://www.tomshardware.co.uk/Ivy_Bridge-E-LGA_2011-X79-cpu-mobo,news-39375.html ) While to us Steamroller is looking like it will still lag well behind Haswell, it might compete very nicely against IB-E.

This is a pretty laughable comment. The money riding on ensuring server offerings are validated, reliable, and as tweaked as possible is mind-boggling. It It is critical for their business to do as much effort as possible around releasing a perfect product. Look at Intel SSDs, same situation. They even use the same controller now, but usually come to market later with better NAND and more reliable firmware. For businesses, this is paramount. Especially when your current offerings are kicking the competition in the teeth, you better not have a misstep and have to recall or patch your new products. That loss of confidence is usually calculated in huge losses.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
i remember something about XOP2....(not about avx-2)

....do you have a link?
It's not XOP2. It's more inline with XOP1.1.

XOP is AMD specific for Bulldozer. It basically makes SSE4.1/SSE4.2/AVX/AVX1.1 operations run better on Bulldozer. XOP1.1 includes AVX2 operations in the XOP instruction set.

AVX2:

  • Expansion of most integer AVX instructions to 256 bits
  • 3-operand general-purpose bit manipulation and multiply
  • Gather support, enabling vector elements to be loaded from non-contiguous memory locations
  • DWORD- and QWORD-granularity any-to-any permutes
  • Vector shifts
  • 3-operand fused multiply-accumulate support
XOP is specific to 128-bit so it has no converted 256-bit forms.

(XOP basically mimics SSE4.1/SSE4.2/AVX/AVX1.1 and XOP 1.1 includes AVX2 into that list).
That kind of conflicts with the definition of a stepping, no?
Wikipedia said:
Typically, when an integrated circuit manufacturer such as Intel or AMD invests money to do a stepping (i.e. a revision to the masks), they have found bugs in the logic, have made improvements to the design that allow for faster processing, or have found a way to increase yield or improve the "bin splits" (i.e. create faster transistors and hence faster CPUs). One result of some new steppings is that the CPU design is improved such that it overclocks better than others
There is no bugs to fix with Trinity.
Wikipedia said:
have made improvements to the design that allow for faster processing, or have found a way to increase yield or improve the "bin splits" (i.e. create faster transistors and hence faster CPUs). One result of some new steppings is that the CPU design is improved such that it overclocks better than others
So, you only have faster bins. Trinity 2.0's only improvement is a slight increase in MHz.
 
Last edited:

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
There is no bugs to fix with Trinity.

I've got $100 that says Trinity ships with at least 50 known bugs, and an equal amount of unknown bugs.

Or are you saying there's no bugs in Bulldozer?
 
Last edited:

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
I've got $100 that says Trinity ships with at least 50 known bugs, and an equal amount of unknown bugs.

Or are you saying there's no bugs in Bulldozer?

Just in the sense that there are bugs in every single CPU that's ever been made. It's just a question of whether the bugs are worth fixing or not. I doubt Trinity has any bugs that are worth fixing. I also doubt any current Intel CPUs have any bugs that are worth fixing. What I mean by this is that even though there are bugs in all CPUs, it's extremely unlikely that you or I (or anyone else in the world) will run code that exposes the bugs in the silicon.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
I've got $100 that says Trinity ships with at least 50 known bugs, and an equal amount of unknown bugs.

Or are you saying there's no bugs in Bulldozer?
http://support.amd.com/us/Processor_TechDocs/48931_15h_Mod_10h-1Fh_Rev_Guide.pdf
Errata list for AMD FM2 Trinity, platform.

To get back to Steamroller:
If Bulldozer & Piledriver have any performance detriment from bugs it is minimal in comparison to architectural design. On average, only being able to retire one macro-op per cycle per core is probably the worst architectural point of all. This occurs in most code but no worries AMD designed around that with the high clock.

If Steamroller does have dedicated decode for Thread A and Thread B we will probably see a twenty to twenty-five percent drop in clocks. If Bulldozer & Piledriver retires one macro-op per core in most code then it is safe to bet Steamroller will be able to retire at least two to three macro-ops per core.

Since, Steamroller for the MPU platform will be using the Viperfish die which was going to be used in Komodo, Sepang, and Terramar. (Also, it appears AMD likes to add two cores after they went past four-cores.) One cores -> Two cores -> Four cores -> Six cores -> Eight cores -> Ten cores.

Equation and precedent:
2.6 GHz x-nm one core -> 2.1 GHz y-nm two core => 500 MHz loss
3.1 GHz x-nm two core -> 2.6 GHz x-nm four core => 500 MHz loss
2.6 GHz x-nm four core -> 3.4 GHz y-nm four core => 800 MHz gain
3.4 GHz x-nm four core -> 3.2 GHz x-nm six core => 200 MHz loss
3.2 GHz x-nm six core -> 3.9 GHz y-nm eight core => 700 MHz gain
3.9 GHz x-nm eight core -> 4.2 GHz x-nm eight core => 300 MHz gain

1200 MHz loss, 1800 MHz gain. Ignoring the gains adding two cores only had a cost of 200-500 MHz. K8/K10 were sound architectures they do not have the same problem Bulldozer -> Steamroller will have.

At worst:
4.2 GHz x-nm eight core - 500 MHz * (7.5/10) = X GHz y-nm ten core.

4200 - 500 = 3,700 MHz * (7.5/10) + rounding = 2.8 GHz
Now we add fabrication process benefits I've been told performance wise 28-nm bulk only provides a ten percent boost over 32-nm SOI.

2.8 GHz * (11/10) + rounding = 3.1 GHz
Which gives a loss of a hefty 1100 MHz.

TDP:
If, 4.2 GHz * (1.37v)² * 8 cores + uncore = 125 to 130W TDP
Then, 3.1 GHz * (1.1v)² * 10 cores + uncore = 95 to 100W TDP

--off-topic--
TSMC 28-nm, gate length ~33-nm
Intel 22-nm, gate length ~26-nm
GlobalFoundries 28-nm, gate length ~25-nm (Edit: Those that have read about AMD from 2003 through 2009 should know why GloFo 28-nm has 25-nm gate length)
 
Last edited:

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
How exactly would AMD get a rectangular die out of 5 modules? There'd be too much wasted space, even when accounting for the uncore.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Wait wait wait... I thought THIS was universally known/admitted to be the problem with Bulldozer - the fact that they moved away from hand-tuned/drawn logic and moved to automated crap which ended up costing them in terms of power, performance AND delays. :confused:

(A quick google = first link found)
Not universally... ;-)

C. Maier left AMD during the conceptional phase +-few months. AMD mentioned hand optimization in parts w/ critical timing in their ISSCC papers.


Sent via mobile.
 

nehalem256

Lifer
Apr 13, 2012
15,669
8
0
How exactly would AMD get a rectangular die out of 5 modules? There'd be too much wasted space, even when accounting for the uncore.

Use a more Intel-esque dieplan

M-M-M-M-M
L2-L2-L2-L2-L2
L3-L3-L3-L3-L2
Uncore

Problems solved.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
How exactly would AMD get a rectangular die out of 5 modules? There'd be too much wasted space, even when accounting for the uncore.
Art Design time lets guess what Viperfish would look like...

Not drawn for accuracy or scale: (Here is my entry)
amdviperfish.png

vs Orochi
878c28c38e196048a322573e9070333a.jpg
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
I suppose I could see that making sense... that is if the scaling from 32nm -> 28nm is good enough. The sad thing is, that sketch you just drew is more elegant than AMD's past few core designs, IMO. Never understood AMD's L3 cache placements.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
I suppose I could see that making sense... that is if the scaling from 32nm -> 28nm is good enough. The sad thing is, that sketch you just drew is more elegant than AMD's past few core designs, IMO. Never understood AMD's L3 cache placements.
Ignore the L3 it is just the XBR. I just woke up so when I drew that I forgot about Steamroller architecture to better align with APU & MPU designs it has discarded the L3 cache.

I remember someone telling me that AMD's L2 in Steamroller has gotten the same treatment as Jaguar. The L2 for Steamroller is unified so with the L2 cache the cores will see a 20MB inst & data cache. Also, with the new L2 can be two times as big because of the L3 cache being gone. Steamroller is also the start when the CPU dies start looking like GPU dies.

Steamroller MCM 40MB Unified L2 vs Bulldozer MCM 8 * 2 MB L2 & 16 MB L3. Will be interesting.

Viperfish with Piledriver might have had L3 cache but Viperfish with Steamroller doesn't.

slide12.png

I corrected the slide it had GB instead of MB.
 
Last edited:

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
TSMC 28-nm, gate length ~33-nm
Intel 22-nm, gate length ~26-nm
GlobalFoundries 28-nm, gate length ~25-nm (Edit: Those that have read about AMD from 2003 through 2009 should know why GloFo 28-nm has 25-nm gate length)

Any links for those plz ?? I only know that GloFos 32nm Lgate is 30nm (From AMDs briefings)
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
What isn't accurately captured or portrayed in those articles is that the state of the art in synthesis (automated layout) is not static, rather it is advancing at a blistering rate thanks solely due to the increasingly difficult requirements placed on fabless design houses by the foundries as the process nodes increase in complexity at every node (making hand-coded cells all the more arduous node after node).

The commentary regarding bulldozer's reliance on synthesis came from an engineer who left the company 2yrs before bulldozer came out, that is 3yrs ago. 3yrs is an eon in this industry. I'm sure his comments and experience were relevant to the state of synthesis in 2009 with 45nm, not so relevant to the state of synthesis in 2013 with 28nm.

Think of it like this...consider the game of Chess. In 1960 you would not want to bet on a computer competing against world-class chess players, the computer would stink. Same in 1970, and 1980. Computers were slow and not as good as humans.

But what happened in 1997 between Deep Blue and Kasparov? Computer won.

This is what has happened in pretty much every industry that involves engineering. Slowly but surely the software and hardware has evolved to the point where computers can run through millions of simulated models to find more optimal cases than humans could ever hope to achieve - be it with bridges, autos, skyscrapers, or integrated circuits.

It is not that the computers are smarter, its just that they are faster. So they can run through so many more test designs while filtering out the dead-ends faster than a team of humans ever could.

So the limits are not that of the CPU designers but now the limits are on the people who program the synthesis tools themselves. Very much like the limitations in programming that come at the hands of the people who create the compiler tools.

It was only a matter of time before computers would become better than human at designing CPU's. And it is a matter of budget as well. Looks like AMD is saying when you factor in the budget considerations, computers have reached that point now.
the rules in the game of chess are static and easily definable. The chess computers lost because they didn't have the horsepower to recursively evaluate every potential move until the end of the game to determine the best one.
The problem with the CPU design is at the software level, not the computing hardware level IMO. So, just saying that I am less inclined to believe time will solve this problem, if by 3 years ago they hadn't already solved it.

Hm come to think of it, maybe as you say with processors getting more complex these days more R&D is being invested in the CAD routines that do the routing.
In addition with billions of transistors maybe running the optimization routines on "fully recursive mode" back then did take too much CPU power. They could have just rented server space though. I can't imagine it would have been that difficult to scale up.

Hardware helps. You can get quite a bit of improvement in QOR ("quality of results", physical-design-engineer speak for achievable clock frequency / die area / power consumption) when you allow optimization tools to run longer, or you can sometimes enable new algorithms that used to take unreasonably long to run. At the very worst, assuming tools give you no improvement, faster hardware lets you iterate the design more quickly, giving you more opportunities to improve the design you're feeding to the tool (e.g. identifying unexpected critical paths so the architects can tweak the pipelining a little).

In practice, the tools also improve. Unfortunately I can't find a public source for this, but basically every new tool release comes with a spiel from the vendor about how it gives x% better timing in y% less area while requiring z% less RAM or cpu time than the previous version for a bunch of example designs. These percents add up... if a human was 10% better five years ago and the tools improved 2% every year, well, they're in trouble today (102%^5 ~= 110%). There's definitely a lot of money being spent on this; 3 of the large EDA vendors have a combined market cap over $10 billion (compare to ~$2.6B for all of AMD).
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
Any links for those plz ?? I only know that GloFos 32nm Lgate is 30nm (From AMDs briefings)
http://www.chipworks.com/en/technical-competitive-analysis/resources/technology-blog/2011/07/more-hkmg-hits-the-market-%E2%80%93-gate-first-and-gate-last/ <-- TSMC
http://www.electronicsweekly.com/bl...g/2012/08/st-to-run-28nm-fd-soi-novathor.html <-- Intel

The 25-nm Lgate is from the dead 22-nm SHP which was FinFETs w/o SOI if you check in 2009. If 32-nm SHP has a 30-nm Lgate, it wouldn't be unreasonable that GloFo is doing the same thing with 28-nm SHP. My findings pretty much point that 28-nm SHP is 25-nm Lgate, double gate planer, SOI-less/FinFET-less, but has properties of both. It has a buried oxide similar to SOI but doesn't have the floating body but it isn't similar with FD-SOI as it isn't SOI. It also doesn't have FinFETs, either.

{
http://www.google.com/patents/US8217450 <-- patent maybe? (citations)
http://img36.imageshack.us/img36/8184/amdfoundries.png <-- this maybe?
}<--something like that but it is completely Planar and doesn't have SOI.
 
Last edited: