Full Skylake reveal result? Waiting for Zen.

Page 11 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

cytg111

Lifer
Mar 17, 2008
23,216
12,857
136
For single threaded IPC, 40% over Excavator is about 0.5% over Haswell.

It should beat an i5 easily (AMD usually doesn't do much differentiation based on enabled features, so I expect all of their CPUs will have SMT enabled - except, maybe, Athlons and Semprons when/if those arrive with Zen cores).

The real chore for Zen is to match or beat the Haswell i7s in single threaded, and maybe catch up with Skylake in multithreaded loads (AMD's usually scale better than Intel, and if Zen's data fabric and cache layout are as claimed, this should be no exception).

That comes down to its SMT, though, or willingness to throw more cores into the fray. I'd be very surprised if AMD's SMT was equal to Intel's SMT, but they have probably learned a great deal from the CMT design and may well have kept many elements of Bulldozer's front-end to leverage for SMT. With the instructions being fed and retired single-file they would lose most of the module-induced pipeline stages, so I'd expect a 14~17 cycle pipeline on Zen, much like Sandy Bridge.

What?
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
BTW, what is the base of the 40% increase? Average, lowest, or highest XV IPC in many or some benchmarks? Or SPEC CPU as usual?

Doing two 256b AVX ops per cycle vs. 128b doesn't increase IPC (technically).

Oh, I wish we knew!

I've been working entirely on IPC being approximately equal to nominal instructions retired per cycle, which is the easiest way for a CPU engineer to envision performance.

Hopefully the design doesn't require SMT to be at play to realize that 40% boost. If so, single threaded performance could see a mere 10% or so boost... but, we'd get that just from dropping the module pipeline overhead... and we'd get more still from going with wider cores... So it would only make sense that AMD was aiming at a true 40% per core, single threaded, per cycle improvement.

On the FPU front, with AMD, doubling the FPU, itself, allows more operations to be executed at once.

It's really all about the addressable execution units, rather than an "FPU." Bulldozer's FPU is said to be 128x2, but is, in reality, 4x64 FMAC + 4x64 FADD...

Each 64-bit unit can perform one operation with two 32-bit floating point values, or can operate on one 64-bit float. But it, really, all comes down to what the FPU scheduler can support, as well as the FPU's result storage mechanism's capabilities. Though, these have long been able to handle their execution units pretty optimally.

So, Bulldozer can, in theory, do the following with the same complexity:

8x 32-bit FLOPs
4x 64-bit FLOPs
2x 128-bit FLOPs
1x 256-bit FLOPs

Due to the expense in ganging the two halves of the 256-bit FPU, which might be split between two threads in a CMT design, I would suspect there would be another cycle lost to lock the other half for execution. The rest should have the same cost.

Of course, not all FLOPs are created equal, FADD vs FMAC units are a whole other discussion.

Zen, if it is 512-bit, will likely not have the ganging overhead, as I'd suspect the SMT overhead would be handled with a flag in the pipeline. It should be capable of the following:

16x 32-bit FLOPs
8x 64-bit FLOPs
4x 128-bit FLOPs
2x 256-bit FLOPs
1x 512-bit FLOPs

This would put it on par with Intel, capability wise. Bulldozer's FPU is on par with Sandy Bridge's, believe it or not, but they have so much module overhead (more than I talk about here) that things can get quite ugly, quickly.
 

Abwx

Lifer
Apr 2, 2011
10,953
3,474
136
Hopefully the design doesn't require SMT to be at play to realize that 40% boost. If so, single threaded performance could see a mere 10% or so boost... but, we'd get that just from dropping the module pipeline overhead... and we'd get more still from going with wider cores... So it would only make sense that AMD was aiming at a true 40% per core, single threaded, per cycle improvement.


That sound logical, EXV IPC is forcibly for one thread, a comparison 1C/1T with 1C/2T would be nonsense.

On the FPU front, with AMD, doubling the FPU, itself, allows more operations to be executed at once.

Think about it differently, one EXV core manage to use the FPU at 60% only, the two cores manage to max it at close to 100%, there s no need to double the FPU to extract more FP throughput out of a single core.

It's really all about the addressable execution units, rather than an "FPU." Bulldozer's FPU is said to be 128x2, but is, in reality, 4x64 FMAC + 4x64 FADD...

What must be doubled is the ALU + AGLU count, FP ops are managed by the (ALUs + AGLUs) since these latters control the load store unit and check the completion of the processed operations, the FPU is just the unit that execute the mathematical computation.

As for the FPU width it s 4 64bits units but they can be used simultaneously, indeed there s no 128bit unit that exist in any CPU, these are exclusively 64bits units, when there s two such units and that they can execute each one operation/cycle using a single instruction then it is called 128bit, wich is somewhat not accurate technicaly speaking.
 
Last edited:

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Think about it differently, one EXV core manage to use the FPU at 60% only, the two cores manage to max it at close to 100%, there s no need to double the FPU to extract more FP throughput out of a single core.

Sure, this is some of the module overhead of which I speak. AMD could see a decent floating point boost without doing anything other than divorcing the unit from the module. But I doubt they will stop at that, it wouldn't get them anywhere in the server world unless they could seriously reduce power usage to well below Intel's.

What must be doubled is the ALU + AGLU count, FP ops are managed by the (ALUs + AGLUs) since these latters control the load store unit and check the completion of the processed operations, the FPU is just the unit that execute the mathematical computation.

I don't think FPU performance has been bound by AL[G]U performance with AMD since either the K8 or K10. The FPU has its own scheduler directly from the dispatcher, after decoding. The FPU has been an AMD strong point for some time, only the module design hinders its true performance from being revealed.

As for the FPU width it s 4 64bits units but they can be used simultaneously, indeed there s no 128bit unit that exist in any CPU, these are exclusively 64bits units, when there s two such units and that they can execute each one operation/cycle using a single instruction then it is called 128bit, wich is somewhat not accurate technicaly speaking.

Yeah, it's not all about the operand and execution unit sizes. K10, IIRC, could perform a max of two FLOPs/cycle since it couldn't pull in any more information at once. Bulldozer has a higher limit, but I can't remember what it was, but I don't think they could fully flood the FPU without SIMD.
 

Abwx

Lifer
Apr 2, 2011
10,953
3,474
136
Sure, this is some of the module overhead of which I speak. AMD could see a decent floating point boost without doing anything other than divorcing the unit from the module. But I doubt they will stop at that, it wouldn't get them anywhere in the server world unless they could seriously reduce power usage to well below Intel's.

GF 14nm should almost halve the power numbers at equal design, there s no doubt that the FPU will be beefed up.


I don't think FPU performance has been bound by AL[G]U performance with AMD since either the K8 or K10. The FPU has its own scheduler directly from the dispatcher, after decoding. The FPU has been an AMD strong point for some time, only the module design hinders its true performance from being revealed.


Yeah, it's not all about the operand and execution unit sizes. K10, IIRC, could perform a max of two FLOPs/cycle since it couldn't pull in any more information at once. Bulldozer has a higher limit, but I can't remember what it was, but I don't think they could fully flood the FPU without SIMD.

Scheduler does only a small part of the work as a X86 instruction imply several uops, in the diagram below we can see that each core has its own L/S unit, as said the operands and operation that are used by the FPU are managed by the thead parent core and maths operations completion is done by the ALU + AGU.

586px-AMD_Bulldozer_block_diagram_%28CPU_core_bloack%29.PNG
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Scheduler does only a small part of the work as a X86 instruction imply several uops, in the diagram below we can see that each core has its own L/S unit, as said the operands and operation that are used by the FPU are managed by the thead parent core and maths operations completion is done by the ALU + AGU.

Well, the x86 instruction can be the uop itself, or a macro-op which gets divided up by the scheduler. But that is besides the point.

The ALU/AGU have effectively no impact on FPU performance, beyond thread/application control logic, of course. If you have an unrolled loop, or SIMD instruction, the only part the integer block in the module plays for FPU is in the LSU (load/store unit). This is simply because the FPU can't run its own thread and is just an integrated a co-processor - it relies on the thread-control logic build into the integer cluster to reorder (or even discard) solutions from the FPU.

The instructions don't travel through the ALU or AGU at all, but those are responsible for the logic paths that may (and usually do) need the results from the FPU. I am not sure at what point a pipeline instruction chain is discarded in the event of a failed branch detection, but I imagine it is the job of the AGU/ALU to create and compare the results used to determine if the predicted path was valid. In this way, the AGU and ALU performance can effect the FPU's utilization, but not its core performance capabilities.

PS: hope that made sense, I'm exhausted :':)thumbsdown:

Oh, only if sleep wasn't required to live :hmm::sneaky::wub:
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
There are NDAs to cover possible leaks. But as I know a long time GF'er (working there since first Fab 30 production), I know that at that stage there isn't much to know about the microarchitecture, but more about power and frequency. So it likely happens at many design levels. And ASMedia just needs to know the interfaces, nothing more.

14nm production should be covered by AMD as they integrate the ASMedia stuff. The latter might get the design rules and tooling, but don't need to create their own chips. As you've seen the last years, there are final product designs already with the second stepping. Mask set prices drove pre-production testing efforts.

Well there will be two different designs, one 14nm FF for the integrated FCH and another design for a separated GPP Chipset to be installed on motherboards. This creates a problem for ASMedia, they will have to design two different ICs, for two different manufacturing processes. Because I dont believe the GPP Chipset will be using the very expensive 14nm FF. Thus making the project more than 2x times more expensive than only having a single 28nm design.
And since they will not be able to leverage a single design for both the FCH and the GPP Chipset and because of the low volume of both the FCH and the Chipsets due to low volume of the AMD products, the project could not be financially viable or they would have to ask a very high price making the deal not suitable for AMDs needs.
You dont want to pay more for something you could design yourself before the deal.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Just a warning: Whoever put my speculative uarch schematic with added colors on Wikipedia, didn't correct my early wrong assumption about the FP units' functionalities. It's been there for years already. ;)

I thought it was odd that it didn't show any core logic divisions, the "FP Ld Buffer." the MMX units, the BTBs, the uCode ROM, the [D]TLBs, ... the list goes on, of course ;-)

But if you put in too much, it just becomes o_O and :confused:.

To make it right, the FPU would have to shown with its symmetry in place, with its two internal pipelines. Though, I've taken it as being generally accurate for the number of execution units in the FPU. If this is not the case, it would be nice to know its internal design more accurately.
 

Abwx

Lifer
Apr 2, 2011
10,953
3,474
136
The ALU/AGU have effectively no impact on FPU performance, beyond thread/application control logic, of course. If you have an unrolled loop, or SIMD instruction, the only part the integer block in the module plays for FPU is in the LSU (load/store unit). This is simply because the FPU can't run its own thread and is just an integrated a co-processor - it relies on the thread-control logic build into the integer cluster to reorder (or even discard) solutions from the FPU.

The instructions don't travel through the ALU or AGU at all, but those are responsible for the logic paths that may (and usually do) need the results from the FPU. I am not sure at what point a pipeline instruction chain is discarded in the event of a failed branch detection, but I imagine it is the job of the AGU/ALU to create and compare the results used to determine if the predicted path was valid. In this way, the AGU and ALU performance can effect the FPU's utilization, but not its core performance capabilities.

But it s the thread control logic, hence the ALU + AGU, that limit the FP ST perf of a core...

In Cinebench 11.5 a Steamroller core score about 1 at 3.7GHz and a module does 1.83, obviously it s not the FPU that limit a single core FP throughput, if the core had enough management ressources the score in ST and for a module would be the same.

So they can use the existing FPU almost unchanged, to extract more throughput out of a single core they would have to increase management ressources by 50% to squeeze the 35% FP throughput that are unused in ST, but as you rightly pointed it that would let only 10% better throughput with SMT, so a fourth ALU + AGUs is needed to increase the theorical number up to 30-35%...

For integer code that s another matter as a 3 ALUs design would be enough to get at parity or even beat the competition, but as in the previous case it would limit the eventual improvements brought by SMT.



Just a warning: Whoever put my speculative uarch schematic with added colors on Wikipedia, didn't correct my early wrong assumption about the FP units' functionalities. It's been there for years already. ;)

It may not be accurate physicaly, but it s right about the logical behaviour as usual schematics do not highlight the fact that the FPU is under control of the ALUs, this has lead to extreme confusion as people are generaly unaware that the ALUs are the cores of a core.
 
Last edited:

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I thought it was odd that it didn't show any core logic divisions, the "FP Ld Buffer." the MMX units, the BTBs, the uCode ROM, the [D]TLBs, ... the list goes on, of course ;-)

But if you put in too much, it just becomes o_O and :confused:.

To make it right, the FPU would have to shown with its symmetry in place, with its two internal pipelines. Though, I've taken it as being generally accurate for the number of execution units in the FPU. If this is not the case, it would be nice to know its internal design more accurately.

It just contained informations gathered from patent filings and papers w/o knowing the exact relevance. Some units weren't mentioned, others were ideas based on papers, some were just left due to the abstraction level.

The MPR or RWT articles or the optimization manual should have better pictures. There are two 128b FMA pipelines and one FMISC since SR.

Back at home I'll look for more.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
But it s the thread control logic, hence the ALU + AGU, that limit the FP ST perf of a core...

In Cinebench 11.5 a Steamroller core score about 1 at 3.7GHz and a module does 1.83, obviously it s not the FPU that limit a single core FP throughput, if the core had enough management ressources the score in ST and for a module would be the same.

That's actually related to FPU scheduler limitations due to the module design. Nothing to do with the ALU/AGU performance. The cores, even on Bulldozer, are fast enough to have most control logic well executed before the FPU scheduler even gets its instructions. Cinebench, for example, is by no means logic limited. If it was, Bulldozer would score much better than it does and Intel CPUs would show much smaller improvements in their scores. It is FPU limited, so Bulldozer does poorly.

You are right though, that same FPU, modified for a single core's use, could easily deliver drastically higher performance. As it stands, a single thread can't effectively use the entire FPU, which greatly reduces its ability to perform optimally on a single thread.

AMD could leave the AGUs and ALUs completely alone and see the full benefit of the FPU. You have to think of the FPU as a separate processor, since that is how AMD treats it. Intel does it differently, evenly distributing its different types of execution units off a unified scheduler. They align their execution units based on application profiling, so you can't simultaneously access a certain ALU or instruction type while you are using another... possibly floating point instruction. That doesn't happen in the AMD world, you can have free run of integer instructions and floating point instructions independently.


It may not be accurate physicaly, but it s right about the logical behaviour as usual schematics do not highlight the fact that the FPU is under control of the ALUs, this has lead to extreme confusion as people are generaly unaware that the ALUs are the cores of a core.

The FPU is not under control of the ALU, exactly. Programs are compiled into instructions. The only role of the ALU for the FPU is to execute arithmetic logic in the thread that allows the pipeline to know what instructions are needed. For example:

Code:
L3DObject *object = new L3DObject(gWorld_3d);    // all ALU/AGU
if (object != NULL) {                          // tested by ALU
    double x, y, z, yaw, camber, pitch;

    if (object->GetPosition(&x, &y, &z, &yaw, &camber, &pitch)) {
            // above was just a call, several copies, and a logic test

      // aside from 'call' instructions, and some allocations, everything below is handled by the FPU without any assistance from the ALU/AGUs

        pair<double> rollDelta = object->RotateZ(fmod(x, y) + fmax(x, y) * fdim(x+y, z) / fabs(camber) * acosh(pitch));

        object->MoveTo2D(rollDelta.first + x, rollDelta.second + y);
        object->Transform3D(rollDelta);

    }


}

When the above is profiled, ~90% of its compiled instructions should be floating point instructions that do not require any ALU or AGU help. Due to specific optimizations, it is actually possible that, beyond library logic for memory allocations, the ALU can sit idle while the FPU is fully utilized.

PS: I just made all that code up, it's just an example :whiste:
 

BurnItDwn

Lifer
Oct 10, 1999
26,074
1,554
126
Well, my Phenom 2 with 4gb of ddr2 is being replaced with a 4690k and 16gb of ddr3. When Zen comes out, if its what its been hyped to be, I may replace my other machine (i5-2400)