DrMrLordX
Lifer
It only sounds bizarre because most people still haven't grasped how pervasive ML workloads are becoming.
I doubt ML workloads will creep into the desktop/workstation sector very quickly. But we'll see.
It only sounds bizarre because most people still haven't grasped how pervasive ML workloads are becoming.
Another trick AMD could do is shared FPU ala Bulldozer. Two cores would be
12xpipes (6x FPU 256-bit) instead 8+8xpipes (8xFPUs). Such a configuration would save a lot of transistors while producing similar performance. The cost is radical uarch change (cannot be done as Zen2 evolution). But it's new uarch and AMD has experience from Bulldozer so who knows. Such a shared FPU (and a front-end) has one nice advantage: 4-core CCX becomes 8-core CCX (L2$ is shared by two cores). This configuration is less probable IMHO.
IPC calculations of SPECint2006:
- - 9900K .... 54.28/5 GHz = 10.86 pts/GHz
- - 3950X .... 50.02/4.6 GH = 10.87 pts/GHz
- - A76 ........ 26.65/2.84 GHz = 9.38 pts/GHz
- - A77 ........ 33.32/2.84 GHz = 11.73 pts/GHz ...... +8% IPC over 9900K
- - A11 ........ 36.80/2.39 GHz = 15.40 pts/GHz .... +42% IPC over 9900K
- - A12 ........ 45.32/2.53 GHz = 17.91 pts/GHz .... +65% IPC over 9900K
- - A13 ........ 52.82/2.65 GHz = 19.93 pts/GHz .... +83% IPC over 9900K
View attachment 14993
Allowing FP128 instructions to also execute on FP4-7 is the easier to implement option, with instant IPC growth for FP128/legacy SSE2+ workloads.
I don't recognize Zen2 core looking like this. Source?
Looks interesting, but what is FP4-7?
There is no realistic setup for a test like this right now. But you will never convince the ARMada that they're comparing bananas to oranges every day, saying the these bananas are much tastier bananas than those oranges. Well, they should be! They are bananas, for God's sake. But they are not very good as oranges and vica versa.As far as this table goes, it's a bit unfair to the 4 and 5 ghz contenders because scaling is a very crude approximation. You should run the 9900k and 3950X at 2.5GHz. Otherwise it's as if you handicapped these competitors with very high latency memory (about 2x latency of whatever RAM the A12 was using).
It is the FPU, the source is every high-res Matisse die shot ever.I don't recognize Zen2 core looking like this. Source?
Floating Point Pipe 4: Replicated Floating Point Pipe 0 for bits 128-255 (FP256: Upper 4x32-bit/ Upper 2x64-bit)Looks interesting, but what is FP4-7?
It's not about doing ai badly, Amd have gpus to do matrix multiply on workloads that suite higher latency. The point of a matrix multiply unit on a cpu would be for workloads that require low latency / serial dependancy. It would also be transistors that could easily be power gated off when not in use.Adding a bunch of hardware to do AI badly on the CPU seems like a silly idea, when anyone doing serious amounts of AI work in server will have accelerators better suited to the task.
If there is 8x pipes with 128-bit it must be seen in performance too: Is 128-bit code significantly faster than 256-bit?I am counting it right.
View attachment 14993
It isn't native FP256 like how Intel does it. It is literally 8x 128-bit datapaths, FP0-3 being low 128-bit and FP4-7 being high 128-bit. It would be relatively simple to switch 4x 128-bit / 4x 256-bit to 8x 128-bit / 4x 256-bit. Increasing FPU availability to the rest of 128-bit datapaths absolutely will give higher IPC than AVX512. AVX512 requires a new ISA and requires to use full-width 512-bit instructions to max out usage. Much like to max out usage on Zen2, to use all datapaths the instruction must be AVX256.
Allowing FP128 instructions to also execute on FP4-7 is the easier to implement option, with instant IPC growth for FP128/legacy SSE2+ workloads.
Exactly, this was one of the good things at terrible Dozer. And that's one possible explanation of high FP IPC increase we see on Zen 3 leaks.That's a very interesting idea. Dozer was actually kind of a hybrid. It was CMT on integer side and SMT on FPU side.
Now Zen could mimic this pattern and remain 2x SMT2 on the integer side while going SMT4 on the FPU side. This could be a huge sparse thread boost for very heavy FPU code.
Unfair at iso-clock IPC but fair at peak performance IPC comparison where the CPU were designed for. 9900K and 3950X has pipeline depth and memory subsystem designed for such a high clock (caches). When we look at 64-core EPYC2 running at 2.5 GHz (base 2.25 Ghz) there is no significant IPC benefit at lower clock. So the majority of IPC comes from uarch design. And +82% IPC advantage is almost twice as fast. Funny that people cannot believe +17% IPC gain of Zen 3 when there is actually huge +82% IPC deficit/potential. Should we be happy to get just 17% out of 82%?As far as this table goes, it's a bit unfair to the 4 and 5 ghz contenders because scaling is a very crude approximation. You should run the 9900k and 3950X at 2.5GHz. Otherwise it's as if you handicapped these competitors with very high latency memory (about 2x latency of whatever RAM the A12 was using).
A ridiculous comparison - once you bring huge many core, many die server chips into the equation you are talking total MT performance more than ST IPC, and the majority of that performance comes from the 64 cores!When we look at 64-core EPYC2 running at 2.5 GHz (base 2.25 Ghz) there is no significant IPC benefit at lower clock. So the majority of IPC comes from uarch design.
When discussing designated TDP performance, I think a major portion of the emphasis is on task power per stock TDP. This is an area in which major contenders take a heavy blow by the underdogs(Apple vs. Qualcomm & Intel vs. AMD). Intel for instance cuts short the frequency bins when running AVX512. This is good.Unfair at iso-clock IPC but fair at peak performance IPC comparison where the CPU were designed for. 9900K and 3950X has pipeline depth and memory subsystem designed for such a high clock (caches). When we look at 64-core EPYC2 running at 2.5 GHz (base 2.25 Ghz) there is no significant IPC benefit at lower clock. So the majority of IPC comes from uarch design. And +82% IPC advantage is almost twice as fast. Funny that people cannot believe +17% IPC gain of Zen 3 when there is actually huge +82% IPC deficit/potential. Should we be happy to get just 17% out of 82%?
Microsoft is pushing Windows Hello, especially on their Surface products. They now use AMD APUs there as well. So some ML integration coming at some point really is not that far fetched unless it takes too much silicon area.I doubt ML workloads will creep into the desktop/workstation sector very quickly. But we'll see.
I think AMD has a different problem with AVX512 because there is as much as 20 instruction subsets. That's huge number in compare to SSE4 (two subsets SSE4.1 and 4.2), and AVX1&2 with one subset. Even the newest IceLake supports 14 out of 20. https://en.wikipedia.org/wiki/AVX-512That part is under duress as to why playing fast follower to Intel is beneficiary to straight-on challenging AVX512 performance.
You are too pessimistic here. It would be even worse for x86 if we would compare two 64-core chips. Hypothetical 64-core A13 would be much faster with much lower TDP on top of that (slow A76 in Neoverse N1 gives x86 world a lot of headache in 64-core Graviton2 and 80-core Ampere eMag). IMHO that's exactly what is the source of motivation for all those engineers established Nuvia corp.A ridiculous comparison - once you bring huge many core, many die server chips into the equation you are talking total MT performance more than ST IPC, and the majority of that performance comes from the 64 cores!
128-bit SIMDs are more present in general purpose. The average consumer if they are running SIMD it will mostly be 128-bit.Is 128-bit code significantly faster than 256-bit?
I guess you wanted to write April 2020. Other than that, Zen 3 = 7nm EUV. It's set in stone. It was designed for that node.There are a few really noisy elephants in the room that everyone is forgetting about:
Based on the above, I'm calling it (with good fun, I could very likely be wrong): Zen 3 will be a unified chiplet design with a CCD on 5nm and the IO die on 7nm EUV. Expected clock speeds will reach or exceed 5 GHz. I also do not believe that AMD will offer any additional features outside of a 10-15% IPC increase.
- TSMC N5 hits volume production Q1 of 2020 and has a dedicated 'HPC' path that mobile chips won't use.
- 7NM EUV offers very little over 7nm except increased margins for AMD and up to 10% higher clock speeds.
- Intel is rumored (and the rumors have come from a reliable source) to be upping their core count as well as their clock speed, with base/boost numbers for Core i3s, i5s, and i7s far exceeding AMD chips. These chips are rumored for release around April 2019.
- TSMC's guidance appears to state that 7nm customers should transition to 6nm or 5nm as N7+. The N6 node is ramping up slower than N5.
- AMD definitely won't be releasing Zen 3 before July of next year, as July 7th will mark the 1 year anniversary of Zen 2.
CHOOCHOO!
If it is set in stone it will also have SMT4 and AVX512. Family K18.2 = Zen3 before it was used for Dhyana.It's set in stone. It was designed for that node.
Man, half your comments are fairy tales, I don't always know what to take seriously. But if you say so, why not.If it is set in stone it will also have SMT4 and AVX512. Family K18.2 = Zen3 before it was used for Dhyana.
Arden/Nedra/Anaconda X-series => Zen3 AVX512 + RDNA2, yadda yadda. However it is using DUV 7nm. <== Also, used the 18h/24 family in the early Microsoft APUs.
Retapeouts(RTO) are only possible between N7 -> N7P -> N6. It is more likely to see a tape out on 7nm and a partial NTO w/ 6T N6(higher density than N7+ 6T). Similiar to Excavator evolution from 13T-28SHP(Steamroller) to 9T-28A(XV).
HPC and Mobile share the same track height in N5, however HPC Fins use more extensive Ge stressors.
Microsoft is pushing Windows Hello, especially on their Surface products. They now use AMD APUs there as well. So some ML integration coming at some point really is not that far fetched unless it takes too much silicon area.
There are a few really noisy elephants in the room that everyone is forgetting about:
Intel is rumored (and the rumors have come from a reliable source) to be upping their core count as well as their clock speed, with base/boost numbers for Core i3s, i5s, and i7s far exceeding AMD chips. These chips are rumored for release around April 2019.
TSMC's guidance appears to state that 7nm customers should transition to 6nm or 5nm as N7+. The N6 node is ramping up slower than N5.
Based on the above, I'm calling it (with good fun, I could very likely be wrong): Zen 3 will be a unified chiplet design with a CCD on 5nm and the IO die on 7nm EUV. Expected clock speeds will reach or exceed 5 GHz. I also do not believe that AMD will offer any additional features outside of a 10-15% IPC increase.
CHOOCHOO!
We discussed Zen2 FPU whether it has 8x pipes or 4x pipes. Sorry, I didn't write that question clear enough. So corrected question: Is Zen 2 running 128-bit code significantly faster than 256-bit?128-bit SIMDs are more present in general purpose. The average consumer if they are running SIMD it will mostly be 128-bit.
We discussed Zen2 FPU whether it has 8x pipes or 4x pipes. Sorry, I didn't write that question clear enough. So corrected question: Is Zen 2 running 128-bit code significantly faster than 256-bit?
This but there is more.It isn't, and shouldn't, because while there are 8 128-bit ports, only 4 of them are attached to the low-order bits of registers. The other 4 only read from the second RF which contains bits 128..255 of each AVX register.
N7+ requires a new from scratch design as AMS(SerDes/IO) is incompatible, SRAM(+PRF/CAM/etc) is incompatible, and Logic is incompatible.N6 is nothing but a "cheap" node for 7nm customers that don't want to switch to the 7nm+ design rules.
I saw this on another forum. This guy on twitter apparently is listing new AMD patents that belong to Zen 3/RDNA2.
The guy that posted this on the other forum said that Zen 3 will be able to do based on this patent dump:
Zen 3
4x FP Mul+Add
Compared to Zen 2:
Zen2
2xFP Mul+Add + 2x FPadd
Bananas and oranges is more like CPU and GPU. I don't see any problem to compare performance between two CPUs on different ISAs. If you are company like Amazon and you run your web/SQL servers on for example Linux&MySQL (you have both binary for ARM and x86) then it's very easy for them to compare performance at REAL load. Very easy. They did it and decided create their own server ARM called Graviton. Do you guys really think that Amazon invested huge amount of money into something incomparable? Do you think that people in Amazon don't see that huge +82% IPC advantage delivered by Apple's ARM CPU? Did Apple switched ISA from PowerPC to x86 because it was incomparable?There is no realistic setup for a test like this right now. But you will never convince the ARMada that they're comparing bananas to oranges every day, saying the these bananas are much tastier bananas than those oranges. Well, they should be! They are bananas, for God's sake. But they are not very good as oranges and vica versa.