computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Mahigan · Feb 21, 2016

Leadbox said:
"Bigger instructions buffer for better single-threaded performance"

Yes and then you see this...

And you wonder why a new Command Processor when the previous one could handle more than enough draw calls under DX12.

Then it dawns on you that NVIDIA are using a Gigathread Engine which buffers many instructions (hence the name Giga Thread).

Leading one to believe that AMD lacked this large buffering feature and understanding that the Command Processor executes the commands then most likely it lacked a command buffer large enough to benefit from DX11 multi-threaded command listing and other such features. This leads one to think that the Command Processor was at fault.

parvadomus · Feb 21, 2016

Mahigan said:
Yes and then you see this...

And you wonder why a new Command Processor when the previous one could handle more than enough draw calls under DX12.

Then it dawns on you that NVIDIA are using a Gigathread Engine which buffers many instructions (hence the name Giga Thread).

Leading one to believe that AMD lacked this large buffering feature and understanding that the Command Processor executes the commands then most likely it lacked a command buffer large enough to benefit from DX11 multi-threaded command listing and other such features. This leads one to think that the Command Processor was at fault.

I think all that "NEWs" are mostly components revised for power efficiency. DX11 performance is a driver (software) fault only, and this can be easily backed by the huge gains AMD get from DX12.

Mahigan · Feb 21, 2016

parvadomus said:
I think all that "NEWs" are mostly components revised for power efficiency. DX11 performance is a driver (software) fault only, and this can be easily backed by the huge gains AMD get from DX12.

Power efficiency is achieved by the manufacturing process itself. 2.5x more performance per watt translates directly to Samsung and GoFlo's claim of 60% power reduction.

Samsung’s 14nm FinFET Process Offering ⁃ 14LPE and 14LPP.

14LPE (Early edition) targets the early technology leaders and time-to-market customers such as mobile application SoCs to meet the latest mobile gadgets’ aggressive schedule and improved performance/power requirements. 14LPE is the first foundry process technology manufactured in the foundry industry with the successful volume ramp-up. 14LPE offers 40% faster performance; 60% less power consumption; and, 50% smaller chip area scaling as compared to its 28LPP process.

14LPP (Performance boosted edition) is the 2nd FinFET generation which the performance is enhanced up to 10%. 14LPP is the single platform for every application designs with the improved performance for computing/Network designs and the lowered power consumption for Mobile/Consumer designs. 14LPP will be the main process technology offering in 2016 and after.

Source: http://www.samsung.com/semiconductor/foundry/process-technology/14nm/

So three things stand out,

1. 50% chip reduction means that a Fiji die would be 298mm2 instead of 596mm2 under the 28nm node.

2. Clock speeds are around 50% improved under 14LPP. So a 1,050 MHz Fiji die could be clocked to 1,575 MHz.

3. So if you take a 50% die reduction at 60% power reduction with 50% improved clocks you get 2.5x performance per watt without any architectural changes to Fiji. (50+50+60= 160% or 2.6x).

So the "new" components are performance oriented tweaks rather than power reduction tweaks.

Basically,

Baffin XT could have these specs:
3,200 SIMDs clocked at 1.35Ghz in 50 CUs
64 ROps clocked at 1.35Ghz
200 Texture Units clocked at 1.35Ghz
4 New Polaris Geometry "Processors" (GCN had Units)
4,096-bit Memory Interface on an improved controller
4GB HBM
232mm2 die

And easily outperform both a Fury-X or a GTX 980 Ti. How?

- 3,200 SIMDs at 1.35Ghz = 8.64 TFlops (theoretically the same as Fiji but with improved shader efficiency)
- 64 ROps at 1.35Ghz = 86.4 GPixels (vs 67 for Fiji)
- 200 TMUs at 1.35Ghz = 270 GTexels (same as Fiji)
- Improved Geometry Culling (Conservative Rasterization requires this a.k.a "Primitive Discard Acceleration")
- Better memory throughput from the new controller and memory compression
=
Better than Fury-X performance at less than half the power consumption.

Also note that "Geometry Units", in GCN, was replaced by "Geometry Processor". This is not by mistake. Any accelerator is, by definition, a processor. So a Primitive Discard Accelerator is a Geometry Processor rather than a Geometry Unit. Expect BIG tessellation improvements.

But Polaris won't have these features?

Yes it will.

What is instruction pre-fetch? Fetching instructions from memory and placing them into cache. Basically as stated by the AMD engineer in the video linked above...

Bigger command buffer for better single threaded performance

Since DX12 isn't single threaded then he's evidently referring to DX11 performance. Where would you place a Command Buffer? In the Command Processor of course. Hence "new".

And if you look here: http://wccftech.com/amd-radeon-r9-400-gpus/

You see that an AMD Engineer worked on a 232mm2 die, that shipping manifests are showing a BaffinXT with HBM and the price ball parks BaffinXT as being an R9 390x replacement.

This isn't Greenland, this is BaffinXT. Greenland XT could therefore be a Titan-X class ($999) GPU.

Baffin/Baffin XT are likely both based on Polaris 11.
Greenland and Greenland XT are likely both based on Vega 11.

This leaves Polaris 10 and Vega 10 open to other SKUs.

I think that AMD will be targeting Pascal as such..

Baffin vs GTX 970 successor $349
Baffin XT vs GTX 980 successor $499
Greenland vs GTX 980 Ti successor $650
Greenland XT vs Titan-X successor $999

maddie · Feb 21, 2016

Mahigan said:
Power efficiency is achieved by the manufacturing process itself. 2.5x more performance per watt translates directly to Samsung and GoFlo's claim of 60% power reduction.

Source: http://www.samsung.com/semiconductor/foundry/process-technology/14nm/

So three things stand out,

1. 50% chip reduction means that a Fiji die would be 298mm2 instead of 596mm2 under the 28nm node.

2. Clock speeds are around 50% improved under 14LPP. So a 1,050 MHz Fiji die could be clocked to 1,575 MHz.

3. So if you take a 50% die reduction at 60% power reduction with 50% improved clocks you get 2.5x performance per watt without any architectural changes to Fiji. (50+50+60= 160% or 2.6x).

So the "new" components are performance oriented tweaks rather than power reduction tweaks.

Basically,

Baffin XT could have these specs:
3,200 SIMDs clocked at 1.35Ghz in 50 CUs
64 ROps clocked at 1.35Ghz
200 Texture Units clocked at 1.35Ghz
4 New Polaris Geometry "Processors" (GCN had Units)
4,096-bit Memory Interface on an improved controller
4GB HBM
232mm2 die

And easily outperform both a Fury-X or a GTX 980 Ti. How?

- 3,200 SIMDs at 1.35Ghz = 8.64 TFlops (theoretically the same as Fiji but with improved shader efficiency)
- 64 ROps at 1.35Ghz = 86.4 GPixels (vs 67 for Fiji)
- 200 TMUs at 1.35Ghz = 270 GTexels (same as Fiji)
- Improved Geometry Culling (Conservative Rasterization requires this a.k.a "Primitive Discard Acceleration")
- Better memory throughput from the new controller and memory compression
=
Better than Fury-X performance at less than half the power consumption.

But Polaris won't have these features?

Yes it will.

What is instruction pre-fetch? Fetching instructions from memory and placing them into cache. Basically as stated by the AMD engineer in the video linked above...

Since DX12 isn't single threaded then he's evidently referring to DX11 performance. Where would you place a Command Buffer? In the Command Processor of course. Hence "new".

And if you look here: http://wccftech.com/amd-radeon-r9-400-gpus/

You see that an AMD Engineer worked on a 232mm2 die, that shipping manifests are showing a BaffinXT with HBM and the price ball parks BaffinXT as being an R9 390x replacement.

This isn't Greenland, this is BaffinXT. Greenland XT could therefore be a Titan-X class ($999) GPU.

Baffin/Baffin XT are likely both based on Polaris 11.
Greenland and Greenland XT are likely both based on Vega 11.

This leaves Polaris 10 and Vega 10 open to other SKUs.

I think that AMD will be targeting Pascal as such..

Baffin vs GTX 970 successor $349
Baffin XT vs GTX 980 successor $499
Greenland vs GTX 980 Ti successor $650
Greenland XT vs Titan-X successor $999

Fiji was made on TSMC 28nm process not on Samsung 28LPP. You have to compare the actual processes for the density increase and area reduction.

I don't think you get both the power reduction AND speed increase simultaneously. It's one or the other, or combination thereof.

Mahigan · Feb 21, 2016

maddie said:
Fiji was made on TSMC 28nm process not on Samsung 28LPP. You have to compare the actual processes for the density increase and area reduction.

I don't think you get both the power reduction AND speed increase simultaneously. It's one or the other, or combination thereof.

You do if you can pack twice as many transistors per mm2 and have a significant reduction in leakage.

We're not simply talking about a planar to planar reduction but a planar to Finfet+ reduction.

parvadomus · Feb 21, 2016

I doubt 14nm LP clocks much better than 28nm without being hit by leakage, and losing performance/watt, specially at those 1.35Ghz and taking into account that AMD likes to make very dense chips. Running 14nm LP at that clocks could very well drive efficiency too low to reach the expected 2.5X performance/watt.
My bet is a very dense chip running at much lower frequencies, 1Ghz at most, and a lot of shader cores.
The command processor update could very well simply serve to feed more cores.
I simply expect an updated Fiji on 14nm, maybe with 128Rops, at most. It will be an update similar to Tahiti -> Tonga, with changes in all that "NEWs". But we will see.

raghu78 · Feb 21, 2016

Mahigan said:
Power efficiency is achieved by the manufacturing process itself. 2.5x more performance per watt translates directly to Samsung and GoFlo's claim of 60% power reduction.

WRONG. Logic and I/O scale differently for both area reduction and power efficiency in a process node shrink. Logic scales perfectly (area and power reduction) while I/O does not. btw AMD and Nvidia GPUs are manufactured at TSMC 28nm (gate last) which is a better process than Samsung 28nm (gate first) in terms of electrical characteristics and yields.

So three things stand out,

1. 50% chip reduction means that a Fiji die would be 298mm2 instead of 596mm2 under the 28nm node.

2. Clock speeds are around 50% improved under 14LPP. So a 1,050 MHz Fiji die could be clocked to 1,575 MHz.

3. So if you take a 50% die reduction at 60% power reduction with 50% improved clocks you get 2.5x performance per watt without any architectural changes to Fiji. (50+50+60= 160% or 2.6x).

So the "new" components are performance oriented tweaks rather than power reduction tweaks.

Basically its either power reduction at same transistor performance or higher transistor performance at same power consumption. You don't get both. So if you shrink GPU at same performance you can get the power reduction. If you clock the GPU up to use the performance gain then power reduction is zero.

btw AMD even gave rough ratio of process node vs micro architectural improvements in improving perf/watt

http://www.pcper.com/reviews/Graphi...hnologies-Group-Previews-Polaris-Architecture

"How is Polaris able to achieve these types of improvements? It comes from a combination of architectural changes and process technology changes. Even RTG staff were willing to admit that the move to 14nm FinFET process tech was the majority factor for the improvement we are seeing here, something on the order of a 70/30 split. That doesn’t minimize the effort AMD’s engineers are going through to improve on GCN at all, just that we can finally expect to see improvements across the board as we finally move past the 28nm node"

Glo. · Feb 21, 2016

Shrink from 28 nm TSMC to 14 nm FinFET is more than 2x. It is like 2.2 times.

You get also 50-60% better power efficiency from shrink alone, and another few percent from Architecture of Polaris GPUs again. So we are looking at 60-70% lower power consumption with 65% smaller die sizes for similar tech GPU.

Lets look at FuryX. 598 mm2 shrinked to... around 250mm2 or smaller. Power consumption? Lets factor even that 60% from 275W - we are getting to 125W levels.

Lets hope that finally GPUs will be properly fed thanks to modified Command Scheduler.

I do not believe we would see higher core clocks than 1050 MHz. AMD will want to get maximum from Efficiency, as one of the key selling points.

Mahigan · Feb 21, 2016

raghu78 said:
WRONG. Logic and I/O scale differently for both area reduction and power efficiency in a process node shrink. Logic scales perfectly (area and power reduction) while I/O does not. btw AMD and Nvidia GPUs are manufactured at TSMC 28nm (gate last) which is a better process than Samsung 28nm (gate first) in terms of electrical characteristics and yields.

Basically its either power reduction at same transistor performance or higher transistor performance at same power consumption. You don't get both. So if you shrink GPU at same performance you can get the power reduction. If you clock the GPU up to use the performance gain then power reduction is zero.

btw AMD even gave rough ratio of process node vs micro architectural improvements in improving perf/watt

http://www.pcper.com/reviews/Graphi...hnologies-Group-Previews-Polaris-Architecture

"How is Polaris able to achieve these types of improvements? It comes from a combination of architectural changes and process technology changes. Even RTG staff were willing to admit that the move to 14nm FinFET process tech was the majority factor for the improvement we are seeing here, something on the order of a 70/30 split. That doesnt minimize the effort AMDs engineers are going through to improve on GCN at all, just that we can finally expect to see improvements across the board as we finally move past the 28nm node"

Well those were the first statements I made over at overclock.net before a user named Serandur rebuked me. If I could access overclock.net I'd share what he posted.

My first estimations were of more SIMDs across 6 to 8 shader engines.

I was told that this wouldn't work and that most likely AMD went for less SIMDs at higher clocks.

I had also estimated that at around 0.5x of the speedup seen in 2.5x per watt was due to architectural improvements.

He told me that this was wrong and provided sources. Let me see if I can access it (hard to access from Morocco).

maddie · Feb 21, 2016

This will illustrate the power saving/frequency options. The numbers you quoted are the max gain in frequency IF power used is the same and the reverse. You can't get a 60% power saving at the same time as a 50% frequency increase.

See for yourself. Do you think AMD is lying?

raghu78 · Feb 21, 2016

Mahigan said:
Well those were the first statements I made over at overclock.net before a user named Serandur rebuked me. If I could access overclock.net I'd share what he posted.

My first estimations were of more SIMDs across 6 to 8 shader engines.

I was told that this wouldn't work and that most likely AMD went for less SIMDs at higher clocks.

I had also estimated that at around 0.5x of the speedup seen in 2.5x per watt was due to architectural improvements.

He told me that this was wrong and provided sources. Let me see if I can access it (hard to access from Morocco).

Why would you bother with some forum user when AMD itself has stated the design goal of Polaris was maximizing power efficiency through a combination of process node and microarchitectural improvements.

btw I would say you are roughly correct on 0.5x coming from architectural improvements. If we take the improvement in perf/watt from 1x to 2.5x and take 30% of that its (2.5x - 1x) * 0.3 = 0.45x .

JDG1980 · Feb 21, 2016

parvadomus said:
I doubt 14nm LP clocks much better than 28nm without being hit by leakage

Both Apple and Samsung increased clock rates on their smartphone SoCs when moving to FinFET processes. Because battery life is an issue, we can assume that this wasn't simply due to sacrificing efficiency for higher clocks. FinFET does seem to make a difference.

parvadomus said:
and losing performance/watt, specially at those 1.35Ghz and taking into account that AMD likes to make very dense chips. Running 14nm LP at that clocks could very well drive efficiency too low to reach the expected 2.5X performance/watt.

Well, there's nothing stopping them from clocking the mobile chips at the most efficient rate and the desktop chips at whatever the highest rate is that they can fit into the TDP budget. For instance, even if the 40W laptop Polaris 10 part only sustains 800-900 MHz, the desktop part with a 75W cap might well be able to hit 1300-1400 MHz, considerably higher than what we get from 28nm GCN parts.

parvadomus said:
I simply expect an updated Fiji on 14nm, maybe with 128Rops, at most. It will be an update similar to Tahiti -> Tonga, with changes in all that "NEWs". But we will see.

Tonga was not a very successful chip, either technically (mediocre perf/watt and perf/mm^2) or in the marketplace. I don't expect we will see a repeat of something like that. Truth be told, I have no idea why Tonga was ever released in the first place; my latest theory is that it was originally designed for 20nm and then underwent a "reverse die shrink" when it became clear that was unviable, explaining why it appears to be such a poor use of space for a late-28nm design.

I expect the chips to be genuinely new designs. Of course some IP blocks will be reused, but I don't think they will be just blindly copying layouts over without a lot of thought on how they work on 14nm FinFET. They won't be a direct copy of any 28nm design, certainly not the unsuccessful Fiji.

Mahigan · Feb 21, 2016

raghu78 said:
Why would you bother with some forum user when AMD itself has stated the design goal of Polaris was maximizing power efficiency through a combination of process node and microarchitectural improvements.

btw I would say you are roughly correct on 0.5x coming from architectural improvements. If we take the improvement in perf/watt from 1x to 2.5x and take 30% of that its (2.5x - 1x) * 0.3 = 0.45x .

Because while I'm well versed in many areas of GPU architectures, I am not well versed on process technologies.

So I will argue several aspects of a GPU but things which I am relatively ignorant about, I'll be inclined to take the advice of others. Of course in time I'll learn that as well.

Dygaza · Feb 22, 2016

Anyone else having performance problems with their Fury cards in getting performance out of normal and medium batches? I feel like I'm ramming against 60 fps cap all the time, even vsync is off.

Look how 290 performs here:

http://gamers-review.com/direct-x-12-vs-direct-x11-in-ashes-of-singularity-beta

92,7 in normal, 74,9 medium, 47,3 heavy

My numbers with far more powerful card (and more powerful cpu)

60,4 normal, 53,3 medium, 52 heavy

I hope upcoming patch fixes this issue for me.

Edit, nm ignore this. That review site did the oldschool error again. They're running the test in "CPU test" mode. Can't believe how blind I can be sometimes...

Headfoot · Feb 22, 2016

Clockspeed will likely be as low as the comptetive situation allows. FuryX got extra clocks at the 11th hour in order to try and close the gap on 980 Ti. This pushed GCN well out of its most efficient clock range (~800's-900's), since Pitcairn is still to this day more perf/w competitive than every other GCN revision besides Nano (which again got to come back into the 800s-900s range).

This will depend on how good AMD's information on pascal is -- which in turn depends on how good nVidia's info control is (provided Pascal is not out yet) and how far Pascal is at the time of release.

AMD will likely clock down as far as they can afford to so their perf/w marketing metric can go up while maintaining a competitive position, and also allowing an upclocked refresh / aftermarket factory OC counterplay opportunity. They got mileage out of the 7970 -> 7970 Ghz edition refresh, and again on 290->390, so I anticipate them trying to set up for that play again if the situation allows.

JDG1980 · Feb 23, 2016

Headfoot said:
Clockspeed will likely be as low as the comptetive situation allows. FuryX got extra clocks at the 11th hour in order to try and close the gap on 980 Ti. This pushed GCN well out of its most efficient clock range (~800's-900's), since Pitcairn is still to this day more perf/w competitive than every other GCN revision besides Nano (which again got to come back into the 800s-900s range).

Indeed it is true that 28nm GCN is most efficient between 800 and 900 MHz, but we don't know if the same will be true of the new generation, with a heavily revamped architecture and a 14nm FinFET process. The sweet spot of optimal performance per watt could easily be higher.

Headfoot said:
AMD will likely clock down as far as they can afford to so their perf/w marketing metric can go up while maintaining a competitive position, and also allowing an upclocked refresh / aftermarket factory OC counterplay opportunity. They got mileage out of the 7970 -> 7970 Ghz edition refresh, and again on 290->390, so I anticipate them trying to set up for that play again if the situation allows.

That's not exactly how the first generation of GCN played out. For Cape Verde and Pitcairn, the cut-down SKUs (7750 and 7850) were released at efficiency-focused clock rates (800-860 MHz), but the top bins (7770 and 7870) were clocked at 1000 MHz. Tahiti was a bit different, but I think that may have been because yields of this large chip on the then-new 28nm node were too low at first to reliably hit 1000 MHz and meet demand.

I wouldn't be surprised to see much the same thing happen this time, with the non-"X" cards getting clocks focused on maximum perf/watt and the "X" cards getting higher clocks to focus on raw performance.

flopper · Feb 23, 2016

JDG1980 said:
Indeed it is true that 28nm GCN is most efficient between 800 and 900 MHz, but we don't know if the same will be true of the new generation, with a heavily revamped architecture and a 14nm FinFET process. The sweet spot of optimal performance per watt could easily be higher.

1200mhz isnt out of it for Polaris.

Headfoot · Feb 23, 2016

Regardless of where the sweet spot is, physics dictates lower clock speeds = less power consumption. The design goal can dictate where on that exponential curve it really starts to take off (e.g. shifting curve in a grid of power vs perf), but you're still on the curve regardless.

I stand by my prediction. Im not saying it will be 800-900 again, rather it will be as low as the situation allows which is admittedly fuzzy.

Dygaza · Feb 23, 2016

Zlatan mentioned about new patch that should be deployed this week. Anyone has any idea about current eta? Can't wait to see how their new multigpu code is working.

Headfoot · Feb 23, 2016

Tapoer said:
Remember one of the changes between Nvidia GTX 480 and 580, they used leaky transistors on the critical paths, this can increased the clock speed at the same voltage, or lower voltage for the same speed, overall a nice boost to performance/Watt.
Leaky transistor use more power, but are also faster, it totally justifies to use them where they are most useful.
Intel also said about Nehalem (AFAIR), that they wanted a low leakage process node, but still with some leakage.

AMD also uses a more compact design and not the most focused on high performance (clock speed).
Maxwell might also have a higher pipeline stages than GCN.

AMD also have on Fiji the tech on Carrizo that they use an independent circuit with a simulated critical path that test the stability of the circuit on the given voltage for a specific clock speed, this allow them to have a tight control on what voltage is required for a specific clock frequency, if Tahiti had this they would not ship with a conservative clock speed, or a high voltage.
Of course this will kill any overclock at the stock voltage...

Nvidia will most probably maintain the advantage on clock speed in the 14nm/16nm node.

Sorry for the offtopic.

Good point. I should clarify that when I say shipping clocks, I mean measured clockspeed(s) and not whatever clockspeed they print on the box these days. With all the dynamic boosting and underclocking techs out there these days the default clock speed isn't very meaningful

Mahigan · Feb 23, 2016

Dygaza said:
Mahigan!!! Why you replaced your good informative post about AMD/nvidia arch with letter to Santa?

Too many big words in it. I'm working to simplify the information. The post will be longer but far easier to understand.

The gist of it is that AMD will be increasing the size of the command buffer in Polaris. So more commands can be fetched and queued at a time.

This will help alleviate GPU stalls caused by the CPU being too busy, with other work, to effectively feed commands to the GPU.

Two parts to the fix, software side and hardware side. Software side the Command buffer will be increased in the AMD driver, hardware side the Command Processor will either gain more queues or a larger queue.

This will boost Polaris' performance, relative to previous GCN architectures, under single threaded scenarios (ex. DX11).

In easy speak, more draw calls under DX11 than previous GCN architectures.

Glo. · Feb 23, 2016

Mahigan, not only that. We all know that there will be improved color compression and cache. All of this will have huge impact on performance not only DX11 but also DX12.

The numbers will be interesting.

Mahigan · Feb 23, 2016

Glo. said:
Mahigan, not only that. We all know that there will be improved color compression and cache. All of this will have huge impact on performance not only DX11 but also DX12.

The numbers will be interesting.

Most likely a Larger L2 cache and the ability to perform instruction prefetching. So fetching instructions from memory and onto the R/W L2 cache. This should significantly reduce latencies over the use of the framebuffer for the same operation.

Improved color compression we saw a bit of on Tonga and Fiji. It does help boost Pixel fill rate by compressing framebuffer pixel data thus saving on memory bandwidth. Polaris apparently pushes this further.

guskline · Feb 23, 2016

Will the next Ashes update support muti-gpus?

Spjut · Feb 24, 2016

Swedish site Nordichardware just posted a new performance test for Ashes of Singularity Beta 2

http://www.nordichardware.se/Grafik...12-och-multi-gpu/Prestandatester.html#content

It also includes tests for multi-GPU with Nvidia + AMD combos

computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Senior member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Golden Member

Senior member

Member

Diamond Member

Golden Member

Senior member

Diamond Member

Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member