Design changes in Zen 2 (CPU/core/chiplet only)

DrMrLordX · Jan 23, 2019

amd6502 said:
I think SMT4 would be great for mobile.

I'm not so sure about that. You'd have to be very confident that you could load up at least 3-4 threads at all times to make that one wide SMT4-capable core "worth it". Power gating has its limitations. In the phone/tablet sector, big.LITTLE and DynamiQ have essentially taken over, which is an open repudiation of Intel-style one-core-fits-all and SMT/CMT. Expect to see that technology hitting the laptop sector soon. With SMT4, AMD would be going in the opposite direction, using one core to try and handle multiple thread loads and power states instead of having a collection of cores with simplistic boost tables and power states handling loads based on the demands of the thread(s).

Intel is sort of following suit with Lakefield: one (presumably) SMT-capable Sunny Cove (I think?) and four SMT-incapable Tremonts.

It would make more sense for AMD to have maybe one or two SMT-capable cores running at full power to handle intense tasks and 2-4 lower-power cores with SMT disabled to handle lighter loads. It would take more die space (big.LITTLE and DynamiQ always do) but it would better fit the current direction of mobile computing.

amd6502 · Jan 23, 2019

aSMT and aCMT have all the advantages of big.little (plus more) except for possible area and transistor savings, which for lower thread count (8 or less) won't be a huge disadvantage. (You should ask Nosta what CMT is capable of ; )

DrMrLordX said:
You'd have to be very confident that you could load up at least 3-4 threads at all times to make that one wide SMT4-capable core "worth it".

On the contrary, SMT4 would actually be excellent in sparse thread. With one more ALU and one more AGU I don't know if the single threat IPC uplift would be vastly noticable over Zen1/Zen2's (4+2). On full load, four threads, the 5 ALU and 3 AGUs would struggle, but it sure would have an advantage over 2c/2t Stoney XV that is context switching all the time between the four threads.

If you look at Zen2 (from what we know so far), it has massive number of transistors. Mega FPU and massive caches.

At some point XV/dozer won't be able address the budget segment. So you need a new efficient core without a gigantic number of transistors.

So back to the four threads, on a 1c/4t APU, if we cut out the L3 or at least decrease it, we can assume that for a typical four thread load, one of the threads is going to be very likely stalled by either:

waiting for memory after a cache miss, or
waiting in the FPU queue or waiting for the FPU result

And so for the three remaining active threads, the 5ALU and 3AGU are probably decently matched. In comparison, three XV cores (1.5 modules) would have 6ALU and 6AGU would outperform it, but only slightly.

So you could have a single core APU that would just outclass a 2c/2t Stoney with its 4ALU+4AGU, all while having the same number of Pipes (just more complex pipes due to much more scheduling logic and the quadrupled up registers). It would trade blows with 4c/4t BR winning at sparse loads, and at least roughly match BR all the way up to 3 threads (clock for clock).

NostaSeronx · Jan 23, 2019

Olikan said:
This patent was granted in the same day of new horizon event... and it "surprisingly fits" most front end changes in Zen2

Bulldozer class patent, repurposed for use in Zen1. A lot of things in 14LPP 17h are repurposed from 20HPP/14XM 15h designs.

Zen2 call to fame is the 256-bit FPU datapath, 256-bit FPU PRFs, 32B-width Load/Store. Which was in fact slotted for the 14XM 80CPP/8.25T post-XV design.

It makes sense why the two can re-utilize steps;
Zen is the big core we never got to see in the Module core approach.

Module core approach re-uses;
Front-end for the core(s) and FPU for the core(s).

Which is shown in the priority date of module approach patents:
1998 => Two execution cores (40A-40B are the same)
or
1999 => One integer execution core(40A) and one FPU execution core(40B).

Then you have in later patents which says to have the two smaller twin cores use the big core FPU. Rather, than have a split FPU design between the big core and small core.

If we are looking at Zen's CMT based on the 1998 version;
Each core would have 2 ALUs, 1 Load AGU, 1 Store AGU, 1 FMUL+1FADD, 1 Store Data(for FPU ops)

Later version;
Each core would have 2 ALUs, 1 Load AGU, 1 Store AGU, and share the full Zen FPU.

All modern patents that don't reference the Zen core([4 ALU, 2 AGU], 4 FPU) design are for tweaks/fixes/optimizations of Bulldozer.

somethingclever · Jan 23, 2019

There's no way SMT4 would work well on x86. Decoding x86 is already ridiculously expensive and a decoder/uop cache fast enough for 4 threads per core would be too expensive.

From my perspective it just sounds like Bulldozer all over again.

The reason other architectures were able to implement SMT4 or higher is because they have fixed length instructions which make prefetch and decode extremely easy and relatively cheap.

amd6502 · Jan 23, 2019

somethingclever said:
There's no way SMT4 would work well on x86. Decoding x86 is already ridiculously expensive and a decoder/uop cache fast enough for 4 threads per core would be too expensive.

From my perspective it just sounds like Bulldozer all over again.

The reason other architectures were able to implement SMT4 or higher is because they have fixed length instructions which make prefetch and decode extremely easy and relatively cheap.

I don't see the decode being a big issue. It's so cheap that from PD to SR they just doubled up on decoders even though Piledriver went from being better balanced and narrow in front to being oddly overwidened in the front. XV became very efficient, so they may have tricks to keep idle decoders from over consuming.

NostaSeronx said:
Then you have in later patents which says to have the two smaller twin cores use the big core FPU. Rather, than have a split FPU design between the big core and small core.

So you think they will accompany a Zen2 (or similar successor) core with something like two little (int) cores? That seems like a pretty decent thought actually. Slow cores (16h) or fast cores (15h)?

somethingclever · Jan 23, 2019

amd6502 said:
I don't see the decode being a big issue. It's so cheap that from PD to SR they just doubled up on decoders even though Piledriver went from being better balanced and narrow in front to being oddly overwidened in the front.

Variable length x86 instructions really do make prefetch and decode very expensive because of how difficult it is to tell where the boundaries are between one instruction and the next. That is why both AMD and Intel have uop caches to reduce the load on the decoder, and decode already has the potential to stall the pipeline if there's a cache miss there!

Please correct me if I'm misunderstanding that the point of SMT is to utilize one physical pipeline to execute other threads when a thread has become stalled.

Ofc that means widening cores so there's enough execution units to execute instructions from all the logical threads. But, that also means beefing up the frontend so it can fetch and decode deep enough to keep the (unstalled) threads moving, and execution units fed.

At some point it's cheaper to just add another independent pipeline with it's own decoder, which intel has always defined as another core, and why there's the core count controversy with bulldozer.

NostaSeronx · Jan 23, 2019

somethingclever said:
Variable length x86 instructions really do make prefetch and decode very expensive because of how difficult it is to tell where the boundaries are between one instruction and the next. That is why both AMD and Intel have uop caches to reduce the load on the decoder, and decode already has the potential to stall the pipeline if there's a cache miss there!

AMD does predecode at the L1i and L2. So, the instructions received to the decoder is aligned.

amd6502 said:
So you think they will accompany a Zen2 (or similar successor) core with something like two little (int) cores? That seems like a pretty decent thought actually. Slow cores (16h) or fast cores (15h)?

If its a Zen successor it will be two large int clusters.
Rather than going 6-wide ALUs, 4-wide AGUs. AMD can go 2x the current cluster; 2x4 ALUs and 2x2 AGUs.

However, they will not completely go CMT, instead slide into CSMT. This would allow a single thread to execute on both clusters. The new PRF design in Zen would allow 0-cycle bypass between two integer clusters. Both clusters would share the retire queue. Each cluster would have two (Store-Load Allocation Queues/L0 LSUs) SLAQs between the AGUs and LSU. SLAQ would emulate the two AGUs in the standard design in routing. Thus, allowing no need for two LSUs, etc.

2 AGUs -> SLAQ0
2 AGUs -> SLAQ1

SLAQ0 => AGU0-like buffer and SLAQ1 => AGU1-like buffer. Thus, the LSU can remain the same or better yet... the LSU can do general purpose gather/scatter from the SLAQs. Aka, 32B Load/Store to the SLAQ, then 8B Load/Store to ALU/AGU, or however it is done.

If we are talking about a companion core, built for the budget market. Then, two small Int cores and be fast cores. Faster than 15h, at minimum. Going from speed demon to at least speed devil. However, premium and budget for AMD will never be on the same node. Premium will be bleeding edge nodes; 14LPP, N7, N5/3GAAE... while, Budget would be for low-cost nodes; 22FDX, 12FDX, etc.

somethingclever · Jan 24, 2019

NostaSeronx said:
AMD does predecode at the L1i and L2. So, the instructions received to the decoder is aligned.

After some more reading, you are correct. However that doesn't save the decoders work of decoding the instruction into uops.

Edit: I've conflated prefetching with decoding. Both prefetch/fetch and decode need to know the instruction length and boundaries. Decode on top of that needs to determine what instruction it is and issue the uops, which is made more expensive with variable length instructions.

TheELF · Jan 24, 2019

somethingclever said:
Please correct me if I'm misunderstanding that the point of SMT is to utilize one physical pipeline to execute other threads when a thread has become stalled.

No what you talk about is called Switch on event multithreading and is a software solution that does not need any special hardware, home micros in the 80ies did this on CPUs that where running in Mhz.

somethingclever said:
Ofc that means widening cores so there's enough execution units to execute instructions from all the logical threads. But, that also means beefing up the frontend so it can fetch and decode deep enough to keep the (unstalled) threads moving, and execution units fed.

Why would it mean wider cores if you would still only execute the instructions of one single thread?!
You need wider cores for SMT because you do run instructions from two threads at the same time.
(Plus possibly ooo and ilp instructions)

somethingclever · Jan 24, 2019

TheELF said:
No what you talk about is called Switch on event multithreading and is a software solution that does not need any special hardware, home micros in the 80ies did this on CPUs that where running in Mhz.

My mistake; SMT executes instructions from all threads all the time; not just when one thread stalls.

TheELF said:
You need wider cores for SMT because you do run instructions from two threads at the same time.
(Plus possibly ooo and ilp instructions)

That's what I said. And it doesn't change my point that more smt requires a better frontend that would be more expensive.

Just to clarify though, does x86 have instructions to explicitly execute out of order or in parallel? I was under the impression that oooe and ilp were hidden from the programmer.

NTMBK · Jan 24, 2019

amd6502 said:
On the contrary, SMT4 would actually be excellent in sparse thread. With one more ALU and one more AGU I don't know if the single threat IPC uplift would be vastly noticable over Zen1/Zen2's (4+2). On full load, four threads, the 5 ALU and 3 AGUs would struggle, but it sure would have an advantage over 2c/2t Stoney XV that is context switching all the time between the four threads.

You're not just splitting up the execution resources, you're splitting every resource in the core 4 ways. Issue, branch prediction, D$, I$, TLB, register file. You're going to be heavily impacting per-thread performance if you just load up another two threads on the core without adding more resources.

DrMrLordX · Jan 24, 2019

amd6502 said:
aSMT and aCMT have all the advantages of big.little (plus more) except for possible area and transistor savings, which for lower thread count (8 or less) won't be a huge disadvantage. (You should ask Nosta what CMT is capable of ; )

I'm actually gonna have to disagree here. Despite going to great lengths to figure out how to power gate underutilized/unutilized elements of their Core processors, Intel has never quite gotten mobile power usage low enough for their Core line to be taken seriously in tablets or phones. They weren't able to make multicore Atom work in those applications either, albeit perhaps for slightly different reasons. The appeal of big.LITTLE/DynamiQ is that you can burn up some extra transistors adding extra cores without the intention of ever using them all at the same time (usually). Then if you want to achieve a lower power state during light loads, you gate off entire cores and run the low-power, low frequency cores to handle your code. The Intel equivalent is to try to achieve lower power draw by gating off core elements you aren't using (which isn't perfect) and by running at lower clocks. The proof is in the pudding. You are never going to be able to take a single SMT4 core and get it to run in as low a power state as, let's say, one A76 and four A57 cores or . . . whatever. The DynamiQ implementation is actually worse in terms of area/transistor savings, but you have the added benefit of being able to completely shut off cores that you aren't using instead of having idle silicon burning power that you can't gate off in a SMT4 scenario.

I had thought that SMT4 or SMT8 would be a fun feature for a console CPU. POWER chips have found their way into consoles before (Xbox 360, PS3), so I thought POWER9 or POWER10 in a 1c/8t configuration might make sense. But in mobile? Nah.

NTMBK said:
You're not just splitting up the execution resources, you're splitting every resource in the core 4 ways. Issue, branch prediction, D$, I$, TLB, register file. You're going to be heavily impacting per-thread performance if you just load up another two threads on the core without adding more resources.

Yup. You also have a baseline power draw from a chip like that no matter how many threads you load up, and you can't escape it. So that hurts efficiency when the core isn't fully-committed to 3-4 threads or if you aren't somehow hogging all execution resources in one SIMD thread.

TheELF · Jan 24, 2019

somethingclever said:
Just to clarify though, does x86 have instructions to explicitly execute out of order or in parallel? I was under the impression that oooe and ilp were hidden from the programmer.

Why would they need to be explicit?
What does the programmer have to do with anything?
If the CPU looks ahead and sees an addition that will come up in 100cycles it can execute it right now and have the result ready in cache beforehand,it's the same instructions and the programmer doesn't need to be aware of it.

somethingclever · Jan 24, 2019

TheELF said:
Why would they need to be explicit?
What does the programmer have to do with anything?
If the CPU looks ahead and sees an addition that will come up in 100cycles it can execute it right now and have the result ready in cache beforehand,it's the same instructions and the programmer doesn't need to be aware of it.

Please read my post in its entirety. I'm aware that it has nothing to do with the programmer, but the person I quoted claimed there were instructions to execute stuff in parallel or out of order.

Also, the cpu doesn't "look ahead 100 cycles"; that makes no sense. All out of order execution and ilp happens after decode. For x86 that would be uops in the uop queue.

Toettoetdaan · Jan 24, 2019

This discussion is interesting, even if everything is just hypothetical, it is a good read. Thanks guys!

Question, why no SMT3? I have seen SMT4 and SMT8, but SMT doesn't have to double for each step. If they add another AGU and double the FPU (dual 128-bit like bulldozer), then SMT2 might not be able to fully utilize the pipeline. SMT4 might be a bit too much, but SMT3 could be the sweet spot? (Although I imagine this would be a server feature)

maddie · Jan 24, 2019

Why have more than 1 core? Just increase the functional pipelines, decoders, cache, etc, and use as needed. Keep going wider.

This is not really to be taken seriously, although it did pop into my head.

TheELF · Jan 24, 2019

somethingclever said:
Also, the cpu doesn't "look ahead 100 cycles"; that makes no sense. All out of order execution and ilp happens after decode. For x86 that would be uops in the uop queue.

http://www2.ece.rochester.edu/projects/acal/docs/Thesis/parihar.PhD16.pdf
page 29

On the other hand, a more general form of look-ahead (we call it decoupled look-ahead) executes all or part of the program to ensure the look-ahead control flow and data access streams accurately reflect those of the original program

There are all sorts of look ahead and some reach far farther then just 100 cycles,they can execute the whole program ahead of time.
Calling something out of order if it only applies to what is up for execution right now is what doesn't make sense.

itsmydamnation · Jan 24, 2019

Toettoetdaan said:
This discussion is interesting, even if everything is just hypothetical, it is a good read. Thanks guys!

Question, why no SMT3? I have seen SMT4 and SMT8, but SMT doesn't have to double for each step. If they add another AGU and double the FPU (dual 128-bit like bulldozer), then SMT2 might not be able to fully utilize the pipeline. SMT4 might be a bit too much, but SMT3 could be the sweet spot? (Although I imagine this would be a server feature)

Because binary, you would do all the work and then just not use one permutation.

SMT4 only makes sense if your core is stalled all the time, no other time does it make sense. You are far better off power wise by addeding an extra core and running lower voltage+clock.

somethingclever · Jan 24, 2019

TheELF said:
There are all sorts of look ahead and some reach far farther then just 100 cycles,they can execute the whole program ahead of time.
Calling something out of order if it only applies to what is up for execution right now is what doesn't make sense.

That paper is about researching alternatives to the current method of speculative execution (branch prediction). I didn't see any mention of any commercial cpus that use decoupled-look ahead.

Also please read a textbook. The way OoOE and ILP are implemented in current architectures is based off Tomasulo's algorithm. Instructions are issued in order and then executed/retired out of order.

https://en.m.wikipedia.org/wiki/Tomasulo_algorithm

Edit for pendantic note: it doesn't make sense for a cpu to "look ahead 100 cycles" because the cpu can't know its state in 100 cycles without executing those cycles. If you're talking about predicting execution paths in the future then that's speculative execution.

TheELF · Jan 25, 2019

somethingclever said:
That paper is about researching alternatives to the current method of speculative execution (branch prediction). I didn't see any mention of any commercial cpus that use decoupled-look ahead.

Also please read a textbook. The way OoOE and ILP are implemented in current architectures is based off Tomasulo's algorithm. Instructions are issued in order and then executed/retired out of order.

https://en.m.wikipedia.org/wiki/Tomasulo_algorithm

Edit for pendantic note: it doesn't make sense for a cpu to "look ahead 100 cycles" because the cpu can't know its state in 100 cycles without executing those cycles. If you're talking about predicting execution paths in the future then that's speculative execution.

So what is branch prediction in your mind?
It takes a branch,commands between an if else for example, and executes that whole branch be it 100 or 1000ths cycles worth of instructions.

TheELF · Jan 25, 2019

somethingclever said:
Edit for pendantic note: it doesn't make sense for a cpu to "look ahead 100 cycles" because the cpu can't know its state in 100 cycles without executing those cycles. If you're talking about predicting execution paths in the future then that's speculative execution.

Hello!That's the whole point!To know beforehand something that the CPU wouldn't be able to know otherwise.

somethingclever · Jan 25, 2019

TheELF said:
So what is branch prediction in your mind?
It takes a branch,commands between an if else for example, and executes that whole branch be it 100 or 1000ths cycles worth of instructions.

First of all, the unit of measure here should be instructions. The cpu doesn't execute "cycles" it executes instructions. The cpu can take a variable number of clock cycles to execute different instructions.

Second, everyone wishes branch prediction is that easy. What if you execute the code in the if-else statement but then you discover that the condition wasn't actually true? What if within that if-else statement there was another nested if-else statement?

TheELF said:
Hello!That's the whole point!To know beforehand something that the CPU wouldn't be able to know otherwise.

My point was the CPU can't know with 100% certainty what is supposed to be executed beforehand all the time.

NTMBK · Jan 25, 2019

TheELF said:
Hello!That's the whole point!To know beforehand something that the CPU wouldn't be able to know otherwise.

I think his point is that trying to count it "in cycles" makes no sense. The amount that the CPU can look ahead is limited by instruction count and size (or rather uOp count), not by how long those instructions take to execute. A load may take 100 times longer to complete if it has to go to main memory, instead of hitting L1$, for instance.

NostaSeronx · Jan 25, 2019

https://patents.google.com/patent/US5809273A/en

The core reference is K6.
"To hasten the decoding process, the three short decoders SDec0 410, SDec1 412 or SDec2 414 are operated in parallel by including separate instruction length lookahead logic that quickly, although serially, determines the instruction lengths ILen0 and ILen1 using the predecode bits associated with the instruction bytes in each instruction buffer 408."

Since, that is K6 I assume we have evolved past the above. To something that can do parallel lookahead per decode unit.

Bulldozer has two 408s per module; 16-entry each for 256B per core.
Zen has one 408 per core; 20-entry for single thread for 320B and 10-entry for dual thread for 160B per thread.

https://patents.google.com/patent/US7840786B2
"FIG. 4 illustrates, in block diagram form, a processor that includes an L1, cache, an L2 cache, and an L3 cache that are configured according to an embodiment of the present disclosure; and

FIG. 5 illustrates, in block diagram form, a portion of the processor of FIG. 4."
^-- David Neal Suggs is the same person as David Suggs. Who is the chief architect for Zen2 and Zen5.

There is plenty of lookahead potential.

moinmoin · Jan 25, 2019

DrMrLordX said:
I'm actually gonna have to disagree here. Despite going to great lengths to figure out how to power gate underutilized/unutilized elements of their Core processors, Intel has never quite gotten mobile power usage low enough for their Core line to be taken seriously in tablets or phones. They weren't able to make multicore Atom work in those applications either, albeit perhaps for slightly different reasons. The appeal of big.LITTLE/DynamiQ is that you can burn up some extra transistors adding extra cores without the intention of ever using them all at the same time (usually). Then if you want to achieve a lower power state during light loads, you gate off entire cores and run the low-power, low frequency cores to handle your code. The Intel equivalent is to try to achieve lower power draw by gating off core elements you aren't using (which isn't perfect) and by running at lower clocks. The proof is in the pudding. You are never going to be able to take a single SMT4 core and get it to run in as low a power state as, let's say, one A76 and four A57 cores or . . . whatever. The DynamiQ implementation is actually worse in terms of area/transistor savings, but you have the added benefit of being able to completely shut off cores that you aren't using instead of having idle silicon burning power that you can't gate off in a SMT4 scenario.

I had thought that SMT4 or SMT8 would be a fun feature for a console CPU. POWER chips have found their way into consoles before (Xbox 360, PS3), so I thought POWER9 or POWER10 in a 1c/8t configuration might make sense. But in mobile? Nah.

Yup. You also have a baseline power draw from a chip like that no matter how many threads you load up, and you can't escape it. So that hurts efficiency when the core isn't fully-committed to 3-4 threads or if you aren't somehow hogging all execution resources in one SIMD thread.

My impression is that the Zen cores already do a very good job power gating unused areas, and that the big challenge ahead for AMD is optimizing the whole uncore for mobile usage while their uncore is primarily optimized for scalability first (which is another reason why their monolithic APU designs launch last each gen).

Design changes in Zen 2 (CPU/core/chiplet only)

Lifer

Senior member

Diamond Member

Member

Senior member

Member

Diamond Member

Member

Diamond Member

Member

Lifer

Lifer

Diamond Member

Member

Junior Member

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Member

Lifer

Diamond Member

Diamond Member