AMD marketing Steamroller before Vishera launch. Thoughts?

ShintaiDK · Sep 12, 2012

AtenRa said:
You are the one to believe that TSMC and GloFo will still be at 28nm until 2016 when Intel will be in production at 10nm, are you or are you not over exaggerating ??

Read what I wrote again, there is a "might".

And Intel will be in production of 10nm in 2015. Release of products in 2016. TSMC and GloFo only needs to slip abit before it happens. Something not uncommon for the two companies.

Pilum · Sep 12, 2012

AtenRa said:
You are the one to believe that TSMC and GloFo will still be at 28nm until 2016 when Intel will be in production at 10nm, are you or are you not over exaggerating ??

He is exaggerating, but there already is a significant supply problem from non-Intel foundries for modern processes. The dates of first production are getting irrelevant; AFAIK 28nm production is still lagging behind demand, and that's a year after first production. In contrast, Intel had a record ramp for 22nm, with a quarter of client CPU shipments being 22nm in July, three months after IVB launch.

I don't see how this will improve on future nodes. So while there may be low-volume production of 20nm in 2013, it will likely stay low-volume into 2014 and probably only really ramp in 2015. With all kinds of clients lusting after the newest nodes, it will come down to the highest bidder allocating the early 20nm output. This includes all high-end ARM and FPGA manufactuers, Nvidia, and of course AMD - which has to decide if it wants to prioritize its CPUs or GPUs.

So it may be simply uneconomic for AMD to switch CPU production to 20nm soon after introduction because production costs will be high due to high wafer prices, especially in light of their low ASPs. And they'll have to carefully calculate which CPUs they switch to 20nm in which order 'Cat, BD-class APUs or server. This depends on wafer costs, estimated yields and ASPs.

So while AMD may introduce a few CPUs in 20nm in 2014/15, a complete switchover to 20nm actually may happen after Intel introduces 10nm in 2016 assuming Intel can keep to their process roadmap.

Olikan · Sep 12, 2012

what is really bugging me, is that excavator will use a more dense library..
...but doing that it loses max clocks, which is the whole point of this arquitecture

MLSCrow · Sep 12, 2012

Olikan said:
what is really bugging me, is that excavator will use a more dense library..
...but doing that it loses max clocks, which is the whole point of this arquitecture

Steamroler will also use a dense library, but honestly, I don't mind lower frequencies if it means much better performance. Since we're already seeing Vish hitting 5GHz stable on H20, I wouldn't care about Steamroller only hitting 4.6-4.8 if it was 30% faster.

Olikan · Oct 11, 2012

Here's some code comments from within the patch that explain a little about how the Bulldozer v3 scheduling is done:

The bdver3 contains three pipelined FP units and two integer units. Fetching and decoding logic is different from previous fam15 processors. Fetching is done every two cycles rather than every cycle and two decode units are available. The decode units therefore decode four instructions in two cycles.

Three DirectPath instructions decoders and only one VectorPath decoder is available. They can decode three DirectPath instructions or one VectorPath instruction per cycle.

The load/store queue unit is not attached to the schedulers but communicates with all the execution units separately instead.

bdver3 belong to fam15 processors. We use the same insn attribute that was used for bdver3 decoding scheme.

New AMD processors never drop prefetches; if they cannot be performed immediately, they are queued. We set number of simultaneous prefetches to a large constant to reflect this (it probably is not a good idea not to limit number of prefetches at all, as their execution also takes some time)." Additionally, "BDVER3 has optimized REP instruction for medium sized blocks, but for very small blocks it is better to use loop."

The way that the new compiler code is determining a "bdver3" processor rather than a previous-generation Bulldozer is based upon the AMD APU/CPU having xsaveopt. The xsaveopt instruction is part of AVX (Advanced Vector Extensions) and is an optimized extended state save instruction similar to xsave. With bdver3 is apparently the first time this xsaveopt instruction is being supported by AMD processors.

interesting about the prefetches

http://www.phoronix.com/scan.php?page=news_item&px=MTIwNDY

Ajay · Oct 11, 2012

So if I'm reading this correctly, the theoretical peak decode rate is the same as in PD, since there is only one fetch per cycle (though actual peak will be better due to other factors), so the real bonus will be in multi-threaded apps (read mainly for the server market) since the decode logic has been doubled for better multi-threaded performance per module. Yes/No?

inf64 · Oct 11, 2012

Actually AMD states the very opposite in their presentation on SR core. It's all about improving single core performance. At least that is what they state. This in turn could improve MT performance also(to a lesser extent). How it will turn out in reality is a whole other matter.

Ajay · Oct 11, 2012

inf64 said:
Actually AMD states the very opposite in their presentation on SR core. It's all about improving single core performance. At least that is what they state. This in turn could improve MT performance also(to a lesser extent). How it will turn out in reality is a whole other matter.

Yes, AMD does state that, but from some of what was posted from Phoronix, the compiler changes address improved MT performance. And, AMD has also stated improvements in ST performance, but I'm wondering how that's going to work if they only do one fetch per cycle vs. two?! Well, I assume they mean pre-fetches - maybe freeing up some bandwidth for load/stores etc.

NostaSeronx · Oct 12, 2012

-The bdver1 contains four pipelined FP units, two integer units and two address generation units.
-The predecode logic is determining boundaries of instructions in the 64byte cache line. So the cache line straddling problem of K6 might be issue here as well, but it is not noted in the documentation.
-Three DirectPath instructions decoders and only one VectorPath decoder is available. They can decode three DirectPath instructions or one VectorPath instruction per cycle.
-The load/store queue unit is not attached to the schedulers but communicates with all the execution units separately instead.
+(define_cpu_unit "bdver1-decode0" "bdver1")
+(define_cpu_unit "bdver1-decode1" "bdver1")
+(define_cpu_unit "bdver1-decode2" "bdver1")
+(define_cpu_unit "bdver1-decodev" "bdver1")
+(define_cpu_unit "bdver1-ffma0" "bdver1_fp")
+(define_cpu_unit "bdver1-ffma1" "bdver1_fp")
+(define_cpu_unit "bdver1-fmal0" "bdver1_fp")
+(define_cpu_unit "bdver1-fmal1" "bdver1_fp")
+(define_reservation "bdver1-ffma" "(bdver1-ffma0 | bdver1-ffma1)")
+(define_reservation "bdver1-fcvt" "bdver1-ffma0")
+(define_reservation "bdver1-fmma" "bdver1-ffma0")
+(define_reservation "bdver1-fxbar" "bdver1-ffma1")
+(define_reservation "bdver1-fmal" "(bdver1-fmal0 | bdver1-fmal1)")
+(define_reservation "bdver1-fsto" "bdver1-fmal1")

---The bdver3 contains three pipelined FP units and two integer units and two address generation units.
---Fetching and decoding logic is different from previous fam15 processors. Fetching is done every two cycles rather than every cycle and two decode units are available. The decode units therefore decode four instructions in two cycles.
---Two DirectPath instructions decoders and only one VectorPath decoder is available. They can decode two DirectPath instructions or one VectorPath instruction per cycle.
---The load/store queue unit is not attached to the schedulers but communicates with all the execution units separately instead.
(define_cpu_unit "bdver3-decode0" "bdver3")
(define_cpu_unit "bdver3-decode1" "bdver3")
(define_cpu_unit "bdver3-decodev" "bdver3")
(define_cpu_unit "bdver3-ffma0" "bdver3_fp")
(define_cpu_unit "bdver3-ffma1" "bdver3_fp")
(define_cpu_unit "bdver3-fpsto" "bdver3_fp")
(define_reservation "bdver3-ffma" "(bdver3-ffma0 | bdver3-ffma1)")
(define_reservation "bdver3-fcvt" "bdver3-ffma0")
(define_reservation "bdver3-fmma" "bdver3-ffma0")
(define_reservation "bdver3-fxbar" "bdver3-ffma1")
(define_reservation "bdver3-fmal" "(bdver3-ffma0 | bdver3-fpsto)")
(define_reservation "bdver3-fsto" "bdver3-fpsto")
(define_reservation "bdver3-fpshuf" "bdver3-fpsto")

bdver1 6 directpath instructions in two cycles for both cores? (3 per core)
bdver3 4 directpath instructions in two cycles per core? (4 per core)

bdver1 = ffma01, fmal01
bdver3 = ffma01, fpsto
It would appear that the first ffma in bdver3 takes the place of fmal0 from bdver1. fpshuf seems to be new.

http://c-cpp.r3dcode.com/files/gcc/4/6.2/gcc/config/i386/bdver1.md
http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01079/bdver3.md

I'm going to guess that the missing address units is a typo. I believe fetching happening every two cycles might be the fact that it is fetching for each core rather one fetch that addresses both cores.

1 x 16B per cycle per core(end result: 2 x 16B per cycle both cores) -bdver1
1 x 32B per cycle(end result: 1 x 32B per cycle every other core) -bdver3
^-- lengthier pipeline maybe...

I think the three directpath decoder thing was copy and pasted because there is a different number of decode units. 3 -> 2. I'm looking at the newest GCC to see if bdver1 info has been updated. (4.8 snapshot has 82,000 items, 10 hours later...why did I put this on the HDD!!!!!!)

Someone should probably upgrade bdver1.md if there is actually four decoders...and bdver3.md if there is actually four decoders...

Ajay · Oct 12, 2012

NostaSeronx said:
bdver1 6 directpath instructions in two cycles for both cores? (3 per core)
bdver3 4 directpath instructions in two cycles per core? (4 per core)

bdver1 = ffma01, fmal01
bdver3 = ffma01, fpsto
It would appear that the first ffma in bdver3 takes the place of fmal0 from bdver1. fpshuf seems to be new.

http://c-cpp.r3dcode.com/files/gcc/4/6.2/gcc/config/i386/bdver1.md
http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01079/bdver3.md

I'm going to guess that the missing address units is a typo. I believe fetching happening every two cycles might be the fact that it is fetching for each core rather one fetch that addresses both cores.

1 x 16B per cycle per core(end result: 2 x 16B per cycle both cores) -bdver1
1 x 32B per cycle(end result: 1 x 32B per cycle every other core) -bdver3
^-- lengthier pipeline maybe...

I think the three directpath decoder thing was copy and pasted because there is a different number of decode units. 3 -> 2. I'm looking at the newest GCC to see if bdver1 info has been updated. (4.8 snapshot has 82,000 items, 10 hours later...why did I put this on the HDD!!!!!!)

Someone should probably upgrade bdver1.md if there is actually four decoders...and bdver3.md if there is actually four decoders...

Makes sense. My bad on assuming pre-fetches would interfere with load/stores. A longer pipeline is a possibility, since AMD's slide 1 says that SR will still be a high frequencies design - seems like that would be hard with higher xtor density (higher watts/mm^2), but pipelines don't seem to be simple any more and thus it's not so easy to add a stage.

It is clear that AMD needs more CPUs per wafer to offset increasing wafer costs, so Excavator will be using an even higher density layout to get die size down even more (will EX be on 20nm SHP?)

Idontcare · Oct 12, 2012

Ajay said:
...so Excavator will be using an even higher density layout to get die size down even more (will EX be on 20nm SHP?)

When I asked this question I was told excavator will be 28nm bulk-Si, the same as steamroller. Which really makes me concerned how 28nm bulk-Si excavator is going to do in competition with a 14nm broadwell

I just don't see a silver lining in that cloud for AMD.

Ferzerp · Oct 12, 2012

Wow. So roughly only 1/4 the density?

Ajay · Oct 12, 2012

Idontcare said:
When I asked this question I was told excavator will be 28nm bulk-Si, the same as steamroller. Which really makes me concerned how 28nm bulk-Si excavator is going to do in competition with a 14nm broadwell I just don't see a silver lining in that cloud for AMD.

Thanks IDC. Wow, that is bad news

I knew GF was advancing their 20nm process and already has LP test shuttles, I guess I should have known the answer. If they had 20nm SHP shuttles, maybe they could have made it with some delay - if everything proceeded as planned.

With an even higher xtor density @ 28nm than SR, Excavator won't be able to sustain 4 GHz clocks without hitting real high temps, I would think.

I guess Keller's team will be putting something out @ 20nm at this rate. They better have a rabbit up there sleeve or ShintaiDK's predictions of a VIA like flight plan are going to come true.

pelov · Oct 12, 2012

Idontcare said:
When I asked this question I was told excavator will be 28nm bulk-Si, the same as steamroller. Which really makes me concerned how 28nm bulk-Si excavator is going to do in competition with a 14nm broadwell I just don't see a silver lining in that cloud for AMD.

I haven't seen any Excavator slides noting any node particulars, but if it is indeed 28nm bulk AMD might as well throw in the towel. There's just no way a 28nm Excavator chip can compete with a 14nm Broadwell design, I don't care how good it is.

Let's hope it's 20nm...

pelov · Oct 12, 2012

inf64 said:
Actually AMD states the very opposite in their presentation on SR core. It's all about improving single core performance. At least that is what they state. This in turn could improve MT performance also(to a lesser extent). How it will turn out in reality is a whole other matter.

Splitting the front end for the integer cores eases up on the CMT tax, though by how much is still up for debate. There's still the L2 cache that's shared as well as the fetch, while there are two separate decoders

Screen%20Shot%202012-08-28%20at%204.38.05%20PM_575px.png

Screen%20Shot%202012-08-28%20at%204.38.09%20PM_575px.png

AMD did increase the pipeline from Thuban/Deneb to BD, but it really wasn't all that long. The bigger issue was the cache latency -- all 3 of them, actually.

The shared L1 instruction cache grew in size with Steamroller, although AMD isnt telling us by how much. Bulldozer featured a 2-way 64KB L1 instruction cache, with each core using one of the ways. This approach gave Bulldozer less cache per core than previous designs, so the increase here makes a lot of sense. AMD claims the larger L1 can reduce i-cache misses by up to 30%. Theres no word on any possible impact to L1 d-cache sizes.

Although AMD doesnt like to call it a cache, Steamroller now features a decoded micro-op queue. As x86 instructions are decoded into micro-ops, the address and decoded op are both stored in this queue. Should a fetch come in for an address that appears in the queue, Steamrollers front end will power down the decode hardware and simply service the fetch request out of the micro-op queue. This is similar in nature to Sandy Bridges decoded uop cache, however it is likely smaller. AMD wasnt willing to disclose how many micro-ops could fit in the queue, other than to say that its big enough to get a decent hit rate.

The L1 to L2 interface has also been improved. Some queues have grown and logic is improved.

Finally on the caching front, Steamroller introduces a dynamically resizable L2 cache. Based on workload and hit rate in the cache, a Steamroller module can choose to resize its L2 cache (powering down the unused slices) in 1/4 intervals. AMD believes this is a huge power win for mobile client applications such as video decode (not so much for servers), where the CPU only has to wake up for short periods of time to run minor tasks that dont have large L2 footprints. The L2 cache accounts for a large chunk of AMDs core leakage, so shutting half or more of it down can definitely help with battery life. The resized cache is no faster (same access latency); it just consumes less power.

Steamroller brings no significant reduction in L2/L3 cache latencies. According to AMD, theyve isolated the reason for the unusually high L3 latency in the Bulldozer architecture, however fixing it isnt a top priority. Given that most consumers (read: notebooks) will only see L3-less processors (e.g. Llano, Trinity), and many server workloads are less sensitive to latency, AMDs stance makes sense.

So... we'll still get really crappy big, slow L3 cache... great... Granted, Haswell looks to do the exact same, so I guess it's good that they'll both suck in the same department.

Ajay · Oct 12, 2012

Well in AtenRa's post above it looks like 20nm SHP will be available for production in 2014:

That does jive with GF's announcement that they are bringing technology forward from their 14XM project into 20nm (minus finfet I believe). Apparently allot of resources are being poured into this (which they really have to do if they want to stay in the foundry business for the couple of nodes, and I imagine they are hoping to pull in enough business to establish themselves behind TSMC).

NostaSeronx · Oct 12, 2012

20-nm SHP won't be used till 2015. 28-nm SHP was killed off and 28-nm LPH became the process to use.

Steamroller 28-nm LPH(10 track?)/2013 -> Excavator 28-nm LPH(8/9 track?)/2014 -> 5th gen 20-nm SHP/2015 -> 6th gen 14-nm(20-nm EOL) SHP/2016

20-nm/14-nm SHP -> near same Vt libraries as LPM/XM.

Ajay · Oct 12, 2012

NostaSeronx said:
20-nm SHP won't be used till 2015. 28-nm SHP was killed off and 28-nm LPH became the process to use.

Steamroller 28-nm LPH(10 track?)/2013 -> Excavator 28-nm LPH(8/9 track?)/2014 -> 5th gen 20-nm SHP/2015 -> 6th gen 14-nm(20-nm EOL) SHP/2016

20-nm/14-nm SHP -> near same Vt libraries as LPM/XM.

What?! 28nm LPH? Holy smokes, why kind of frequencies can be expected from that process? Is this with FD-SOI? I seem to recall there being some tests showing good high frequency results with FD, but don't remember what the frequencies where.

GF is clearly gunning for lower power nodes for smart phones & tablets (ARM, since their site is now full of ARM references after their announced strategic alliance @ 20nm w/finfet).

AMD is in a world of sh*t compared to Intel. Really, the news seems worse by the day

NostaSeronx · Oct 13, 2012

Ajay said:
What?! 28nm LPH? Holy smokes, why kind of frequencies can be expected from that process? Is this with FD-SOI? I seem to recall there being some tests showing good high frequency results with FD, but don't remember what the frequencies where.

LPH's benefits come from that fact that AMD can produce chips at four foundries.

Germany, Dresden "Fab 1" - GlobalFoundries
USA, NY, Malta "Fab 8" - GlobalFoundries
Giheung, Korea "Fab S1" - Samsung
USA, TX, Austin "Fab S2" - Samsung
--
28-nm LPH is not FD-SOI as SHP is the moniker for SOI.

In a statement, Jay Min, vice president of System LSI foundry marketing at Samsung Electronics, said the 28nm LPH process “will be the first semiconductor technology to truly eliminate the border between desktop computers and mobile devices.”

http://semimd.com/blog/2011/08/31/samsung-globalfoundries-cooperation-growing/

Ajay said:
GF is clearly gunning for lower power nodes for smart phones & tablets (ARM, since their site is now full of ARM references after their announced strategic alliance @ 20nm w/finfet).

Just so you know Intel nodes have all been low power nodes for smartphones & tablets.

AtenRa · Oct 13, 2012

I dont believe they will use the LPH for high end Desktop and Server parts. The HPP should be more suited for that high performance parts.

NostaSeronx · Oct 13, 2012

AtenRa said:
I dont believe they will use the LPH for high end Desktop and Server parts. The HPP should be more suited for that high performance parts.

HPP and LPH are closely related but LPH is more advanced.

28-nm LPH over HPP:
Lower leakage.
Faster SRAMs.
More choices.

guskline · Oct 13, 2012

After yesterday's horrible day for AMD's stock value, I really wonder if "Steamroller" might have been "steamrolled" by the market?

Ajay · Oct 13, 2012

NostaSeronx said:
LPH's benefits come from that fact that AMD can produce chips at four foundries.

Germany, Dresden "Fab 1" - GlobalFoundries
USA, NY, Malta "Fab 8" - GlobalFoundries
Giheung, Korea "Fab S1" - Samsung
USA, TX, Austin "Fab S2" - Samsung
--
28-nm LPH is not FD-SOI as SHP is the moniker for SOI.http://semimd.com/blog/2011/08/31/samsung-globalfoundries-cooperation-growing/

Just so you know Intel nodes have all been low power nodes for smartphones & tablets.

Yes as to Intel, of course, but not for their desktop CPUs.

inf64 · Oct 13, 2012

guskline said:
After yesterday's horrible day for AMD's stock value, I really wonder if "Steamroller" might have been "steamrolled" by the market?

SR core is a finished design. Now having said that,whether GloFo will be able to actually manufacture 28nm Kaveri products is another matter. Let's hope they will since AMD gave us the official compiler support 2 days ago.

Ajay · Oct 13, 2012

inf64 said:
SR core is a finished design. Now having said that,whether GloFo will be able to actually manufacture 28nm Kaveri products is another matter. Let's hope they will since AMD gave us the official compiler support 2 days ago.

I hope so, I was thinking Kaveri would make a nice HTPC with acceptable gaming performance. WMC + a Ceton card will make a nice system with enough tuners to solve the occasional case when we need more than two.

AMD marketing Steamroller before Vishera launch. Thoughts?

Lifer

Member

Platinum Member

Member

Platinum Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Elite Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer