Enjoy...
Enjoy...
2.35GHz 64c unknown rome vs highest freq 32 core 7601 that tops out at at 2.2GHz.
We need to know tdp but they hardly cant push 4 times as much fp at same tdp.
We need to know tdp but they hardly cant push 4 times as much fp at same tdp.
That gives 2.4 TeraFLOPS per CPU, which likely means there are additional compute accelerators in the system.
Yeah, 16 SP FLOPS / core / cycle sounds about right: 2.35 GHz x 2 MUL FP units x (256-bit AVX2 / 32-bit SP) x 64 cores = 2.406 TFLOPShttps://www.anandtech.com/show/13598/amd-64-core-rome-deployment-hlrs-hawk-at-235-ghz
At 2.4 TF/CPU that amount to 37.6GFlops/core or about 15 flops/cycle/core, and that s not even double precision, this is perfectly achievable with FMA and AVX2 full width exe units, i guess that the peak is actually 16 flops/cycle/core.
Yeah, 16 SP FLOPS / core / cycle sounds about right: 2.35 GHz x 2 MUL FP units x (256-bit AVX2 / 32-bit SP) x 64 cores = 2.406 TFLOPS
It still may very well be that 14LPP is good enough for their needs and there is no need for a large L4 cache.
Also Vattila, if you could draw this topology, it would be much appreciated.
What's the point of introducing IO chiplet in Rome, only to take it away in Milan?
How, if at all, do these various topologies potentially affect power draw?
Good spotted. You are right ofc - damn i should have seen it even with my limited knowledge - makes sense in the grander scene anyway to keep same tdp. Well if 2.35 is base then certainly we are in for some crazy fp numbers then.Considering that supply voltage must go down because of 7nm, there is simply not much room in SP3 to push TDP up.
It still may very well be that 14LPP is good enough for their needs and there is no need for a large L4 cache.
If so, it is curious that they wouldn't use 12LP, isn't it? With 12LP libraries they should get 15% better density and better performance (source). In addition, that they state "14nm" (not 14LPP) is a strong hint that they may be using the 14HP process that GlobalFoundries acquired from IBM.
Perhaps instead of 16 dual CCX chiplets, Milan will have 12 dual CCX chiplets. Think your Rome design and add 2 dual CCX chiplets on each side
Here are some alternative topologies for interconnecting the L4 slices:
IC: With all the memory controllers on the IO die we now have a unified memory design such that the latency from all cores to memory is more consistent?
MP: That’s a nice design – I commented on improved latency and bandwidth. Our chiplet architecture is a key enablement of those improvements.
IC: When you say improved latency, do you mean average latency or peak/best-case latency?
MP: We haven’t provided the specifications yet, but the architecture is aimed at providing a generational improvement in overall latency to memory. The architecture with the central IO chip provides a more uniform latency and it is more predictable.
(f) cube with upper and lower sides fully connected, and (g) same cube topology as (f), but with the lower quad flipped vertically.
Topology (f) is used in Naples to connect the chiplets in package and across sockets (illustration). Since it looks like the Rome design is two Naples chips crammed into one socket and optimised, I suspect a similar topology is used for the L4 cache in Rome. Considering the upper and lower halves of the IO chiplet are mirrored in large parts, alternative (g) seems to fit the bill.
Hmm, A High-Speed Napels based CPU arriving in 2019Q1 and Some EPYC PCIe v4 motherboards arriving in Q3. Points pretty clearly towards a possible launch somewhere in Q2 with mass-availability in Q3.
Could just be that the Cloud Companies are going to get the first wave exclusively and you won't see any OEM products until Q3.
So to be clear said:Edit:[/B] I know that IBM has been fabless since 2015 but not too long ago they were still relying on GloFo to develop their own custom smaller lithography processes. And since they like eDRAM so much, it might not be so easy to adapt. I don't see 7 nm processes supporting eDRAM and neither 10 nm processes.
AMD has the experience designing in partially depleted SOI. But do they have the experience and time to design large eDRAM macros? If Global is offering the process, their ASIC team (probably mostly from IBM) would have the designs and AMD could perhaps simply buy them as well.
https://www.globalfoundries.com/new...custom-14nm-finfet-technology-for-ibm-systems
GLOBALFOUNDRIES will also become IBM's exclusive server processor semiconductor technology provider for 22 nanometer (nm), 14nm and 10nm semiconductors for the next 10 years.
IC: AMD has had a strong relationship with TSMC for many years which is only getting stronger with the next generation products on 7nm, however now you are more sensitive to TSMC’s ability to drive the next manufacturing generation. Will the move to smaller chiplets help overcome potential issues with larger or dies, or does this now open cooperation with Samsung given that the chip sizes are more along the lines of what they are used to?
MP: First off, the march for high performance has brought us to Zen 2 and the ability to leverage multiple technology nodes. What we’re showing with Rome is a solution with two foundries with two different technology nodes. It gives you an idea of the flexibility in our supply chain that we’ve built in, and gives you explicit example of how we can work with different partners to achieve a unified product goal. On the topic of Samsung, we know Samsung very well and have done work with them.
As part of this Agreement, GLOBALFOUNDRIES will gain substantial intellectual property including thousands of patents, making GLOBALFOUNDRIES the holder of one of the largest semiconductor patent portfolios in the world.
GLOBALFOUNDRIES also will benefit from an influx of one of the best technical teams in the semiconductor industry, which will solidify its path to advanced process geometries at 10nm and below. Additionally, the acquisition opens up business opportunities in industry-leading radio frequency (RF) and specialty technologies and ASIC design capabilities.
The ring bus is actually far from uniform (which is why Intel switched to mesh with Skylake-X/SP), on Skylake already the latency increases from 34ns at best to 85ns at worst, that's with up to 18 cores. As far as actual uniform latency goes the 4 core CCX is about the best possible approach there is, everything beyond that is all but uniform.All topologies have their limitations but Intel's ring bus can sure support more cores (up to a point) in a more uniform way than AMD's CCXs connected through Infinity Fabric.
The ring bus is actually far from uniform (which is why Intel switched to mesh with Skylake-X/SP), on Skylake already the latency increases from 34ns at best to 85ns at worst, that's with up to 18 cores. As far as actual uniform latency goes the 4 core CCX is about the best possible approach there is, everything beyond that is all but uniform.
(Jeez, this updated forum will take time getting used to, so uncomfortable...)
2.5GHz, curiously that s about 13.6% over an Epyc 7601, as much as the alleged IPC improvement stated by BitsChips.