I no longer use AMD for motherboard/cpu or graphic cards these days which is easier to keep up on the Intel and Nvidia products I use and recommend whenever I build a system.
you can pretty much consider both doubled( varies from instruction to instruction, 0% more fp div for example) but integer was significantly more bottlenecked to begin with.
Most of the time, the extra multi-threaded performance is all you need. Games and many browser benchmarks, however, do not scale well, if at all, so higher IPC is certainly more desirable.
I saw a 35W Excavator to 9.85% better. I think The Stilt's numbers are TDP limited, so aren't valuable for direct comparison to higher power Steamroller parts.
Not gonna happen. I don't think AMD could make Zen slower than Excavator if they tried.
Up until examining the pipeline assignments I was certain AMD would have an inferior SMT design, but I seriously think they have the potential to match, or even exceed, Hyperthreading.
I no longer use AMD for motherboard/cpu or graphic cards these days which is easier to keep up on the Intel and Nvidia products I use and recommend whenever I build a system.
That's the reason why I brought this up and mentioned $/perf/W, too. As posted earlier, HSW doing heavy HPC stuff (Livermore loops) has a power consumption comprised of ~75% fixed cost / static power (burnt anyway if not power gated) and 25% caused by instruction execution itself. A large part of that fixed cost is caused by the big core: 256b wide datapaths, avx2 prfs, wide decoders, lots of heavily mixed issue ports etc.
If AMD created a small core (I'm right in the analysis phase of this hypothesis), this mix could shift to their advantage.
Besides having seen ring busses mentioned somewhere in AMD patents/research publications, I think they might do an hierarchical approach + maybe rings somewhere. They might create 4C+L3 slice blocks and combine them + GPU + MCs + NB + SB via a new XBar or a ring bus.
I no longer use AMD for motherboard/cpu or graphic cards these days which is easier to keep up on the Intel and Nvidia products I use and recommend whenever I build a system.
how do you power gate 1/2 a dath path?No, you are not right. We are not in the infancy of the power gating, it's already a very established feature and one feature that Intel processors simply excel
No what he is saying is that for x64int, x64fp , mmx , SSE1/2/3... AVX 128 amd will have more execution units. For AVX256 and FMA they will have less throughput per core. For Server and consumer workloads this is a good trade off.. By going with a weaker core and having to add more to reach the same throughput doesn't remind you of something? You are basically saying that by going with a smaller core AMD will have enough advantage
again how do you power gate 1/2 a data path, 1/2 an adder, 1/2 a div or mul unit.to offset intel power gating on the cases not optimal for Intel architectures (which will have to be optimum for Zen) and that the handicap where all units on Intel processors will be used Zen won't be at much handicap.
Why not, looks at the components of bulldozer to excavator, the prefetchers/predictors, decoders, schedulers, PRF and load store system, etc are all quality. As i have said before im willing the bet a lot of these components will be evolutions of these parts or parts from jaguar.Tell me, do you really believe this is slightly possible? don't forget it's AMD we're talking about, not Apple, not Samsung, not Qualcomm, not IBM, not SUN. AMD.
Oh, meaning a complex cache structure in order to offset the poor interconnect. I'm feeling a dejá vu right now.
Unfortunately I agree too many gotcha things you can get caught up in. Graphics not so much AMD doesn't work as good but its priced better.
how do you power gate 1/2 a dath path?
again how do you power gate 1/2 a data path, 1/2 an adder, 1/2 a div or mul unit.
Hehe, did I say, that I'm right about something? I just brought some points up. Power gating comes with a cost in cycles (as mentioned by Tuna-Fish). Clock gating is the method, which can be done on a per cycle basis.No, you are not right. We are not in the infancy of the power gating, it's already a very established feature and one feature that Intel processors simply excel. By going with a weaker core and having to add more to reach the same throughput doesn't remind you of something? You are basically saying that by going with a smaller core AMD will have enough advantage to offset intel power gating on the cases not optimal for Intel architectures (which will have to be optimum for Zen) and that the handicap where all units on Intel processors will be used Zen won't be at much handicap.
Tell me, do you really believe this is slightly possible? don't forget it's AMD we're talking about, not Apple, not Samsung, not Qualcomm, not IBM, not SUN. AMD.
Isn't using SIMD also a way to reduce the instruction management overhead/execution ratio?Just to extend: Part of the foundational idea of SIMD, that makes it worth it to actually implement in silicon instead of more scalar units despite it's inflexibility, is that while each lane of a SIMD array receives identical instructions, there doesn't need to be any cross-lane communication which means that the individual lanes can be physically very far from each other, allowing more efficient data routing and easier layout. This also allows efficient partial use of SIMD arrays.
AMD already had a similar design in BD -- each SSE pipe had two floating point units, one 80 bits (for x87) and one a little over 64. Combined, they worked on 128-bit quantities.
There are 80 bit wide paths (+margin).I thought it was 64+64, with both ganging for anything larger (incl 80-bit), which was why code with x87 instructions performed so relatively poorly. I seem to remember something AMD no longer making special hardware just for x87.
There are 80 bit wide paths (+margin).
Hiroshige Goto published the detailed FPU description from an AMD ISSCC paper back then.
Yep. There is another one, more like an overview:I see, 91+73. Surprised I never noticed that![]()
Isn't using SIMD also a way to reduce the instruction management overhead/execution ratio?
I thought it was 64+64, with both ganging for anything larger (incl 80-bit), which was why code with x87 instructions performed so relatively poorly.
This was a reminder, because this ratio could also be changed by reducing the overhead.Yes.
Wait, 14/16nm Cat cores are still happening? Whaaat
Consoles, embebbed and AM1+! If that is true AM1+ is still alive, but maybe targeted only to low cost machines until their successor appears.Wait, 14/16nm Cat cores are still happening? Whaaat
Notice that the work on 16nm Zen ended before the summer of 2015. We know it taped out shortly after. The 14nm SOCs are currently under development (began May this year)
![]()
14nm SOC design/Cheetah is this one
Notice that the work on 16nm Zen ended before the summer of 2015. We know it taped out shortly after. The 14nm SOCs are currently under development (began May this year)
![]()
14nm SOC design/Cheetah is this one
Wait, 14/16nm Cat cores are still happening? Whaaat
Cheetah With GCN 2.0, HSA+ and Massive Compute Capabilities
Regardless of the process design, AMD’s x86 Cheetah architecture would introduce new technologies geared towards compute. It is clearly mentioned that AMD’s x86 Cheetah cores won’t be enough to power compute needs which the APU is geared towards, hence for each CPU core there would be one dedicated 64-bit ARM core that analyses the incoming tasks & offloads them to the GPU. This is obviously not as efficient as running tasks that are using OpenCL (or any GPU acceleration), but it works with every application so even older programs like an old video editor or a pure CPU benchmark will use the GPU to do most of the work.
The x86 Cheetah core architecture would also be highly scalable with different products in 2, 4, 6, 8 x86 core variants. Since each CPU core has one dedicated ARM core, we are also looking at 2,4,6,8 ARM cores on the x86 Cheetah based APUs. The single-threaded performance of these next-generation APUs would be better then past generation cores since the ARM cores would be able to offload tasks to the GCN cores while the other cores stay in idle state to conserve power.
The Cheetah based APUs would be geared towards consumers first and hence it will feature GCN 2.0 graphics core architecture. The amount of GCN 2.0 cores is the multiple of the amount of x86 cores which essentially means that an 8 Core model will have 8 x86 Cheetah cores, 8 ARM cores and several GCN compute units which will span the low-end mobility to the most high-end desktop SKUs. The Athlon equivalent of the core would adopt around 128-384 GCN processors while the high-end variants can include upto 1024 stream processors and clock speeds of around 900-1000 MHz. The specifications for SKUs aren’t finalized yet but we are looking at a top-to-bottom SKU lineup featuring the Cheetah and GCN 2.0 core design.
The APU will give the ability to set the maximum amount of GPU cores being used for GPGPU computing through AMD drivers because the more GPU cores assigned to help the CPU, the less cores will be available for the “real” GPUtasks, although a “dynamic” scaling system is in planning as well. AMD has also developed a bypass for GCN cores which can now be used coherently with a discrete graphics card. Previous generation APUs disable the iGPU when the system is running on dedicated graphics except for a few Dual Graphics solution options available on the Radeon R7 lineup. With Cheetah APUs, this limitation will be bypassed an all integrated GCN 2.0 cores would be scheduled to handle GPGPU computing while the dedicated GPU will work along side to handle GPU tasks.
![]()
AMD is also allegedly planning to introduce HSA+ solution or Advanced HSA on their next generation x86 and ARM based APUs. The new HSA set will enable that all x86 cores, ARM cores and GPU cores would be able to share the main memory in a coherent manner rather than just the main x86 cores and GPU.
The x86 Cheetah cores and APU are currently under research and development phase and it would be some time before they see the light of day and most probably under a different name. But the recent ambidextrous project roadmap highlights that AMD is giving a strong emphasis to APU and SOC featuring both ARM and x86 cores on the same die and that is where their future lies.
The source was also keen to share details that AMD is very well planning a new x86 high-performance core which will mark the return of FX series in 2015-2016 with much better performance obtained through an architecture built from the ground up and that GCN 2.0 architecture is coming this year. We have an article on that coming up soon but the details mention that GCN 2.0 is not much powerful in terms of GPGPU computation but provides a huge jump in power efficiency. So the next two- three years for AMD really sound great and stay tuned for more information.
