- Jun 1, 2017
That Rpeak would make it sit on 10th place in the just released Top500 (of course it doesn't show the more important Rmax). 10000 chips...
Shouldn't TDP be the same 180W assuming socket and MB compability? Or am I missing something here and super computers have different TDP requirements.2.35GHz 64c unknown rome vs highest freq 32 core 7601 that tops out at at 2.2GHz.
We need to know tdp but they hardly cant push 4 times as much fp at same tdp.
At 2.4 TF/CPU that amount to 37.6GFlops/core or about 15 flops/cycle/core, and that s not even double precision, this is perfectly achievable with FMA and AVX2 full width exe units, i guess that the peak is actually 16 flops/cycle/core.That gives 2.4 TeraFLOPS per CPU, which likely means there are additional compute accelerators in the system.
Yeah, 16 SP FLOPS / core / cycle sounds about right: 2.35 GHz x 2 MUL FP units x (256-bit AVX2 / 32-bit SP) x 64 cores = 2.406 TFLOPShttps://www.anandtech.com/show/13598/amd-64-core-rome-deployment-hlrs-hawk-at-235-ghz
At 2.4 TF/CPU that amount to 37.6GFlops/core or about 15 flops/cycle/core, and that s not even double precision, this is perfectly achievable with FMA and AVX2 full width exe units, i guess that the peak is actually 16 flops/cycle/core.
If so, it is curious that they wouldn't use 12LP, isn't it? With 12LP libraries they should get 15% better density and better performance (source). In addition, that they state "14nm" (not 14LPP) is a strong hint that they may be using the 14HP process that GlobalFoundries acquired from IBM.It still may very well be that 14LPP is good enough for their needs and there is no need for a large L4 cache.
Here are some alternative topologies for interconnecting the L4 slices:Also Vattila, if you could draw this topology, it would be much appreciated.
One way to look at it is that the interposer is the IO chiplet. That said, active interposer and 128 cores in the socket — that is probably pie in the sky for Milan. And when they start designing on active interposers, they may radically change the architecture and topology (ref. research papers by Gabriel Loh). My Milan diagram is just musings on how the current architecture can be extended based on my "quad-tree" ideas — in particular, filling out the current dual-CCX clusters to quads.What's the point of introducing IO chiplet in Rome, only to take it away in Milan?
This is far beyond my expertise, but I expect a more complex topology to have more overhead due to more complex and more numerous routers. Other than that, it comes down to the type of link and its physical and electrical characteristics (serial/parallel, length, width, impedance, voltage, frequency). Ideally, the network should consume no power if there is no traffic, so power management probably plays a huge part.How, if at all, do these various topologies potentially affect power draw?
Good spotted. You are right ofc - damn i should have seen it even with my limited knowledge - makes sense in the grander scene anyway to keep same tdp. Well if 2.35 is base then certainly we are in for some crazy fp numbers then.Considering that supply voltage must go down because of 7nm, there is simply not much room in SP3 to push TDP up.
They didn't utilize that 15% better density in Pinnacle Ridge at all and, as far as we know, the upcoming Polaris 30 don't do that either. So they only ever used that node for a little better performance.If so, it is curious that they wouldn't use 12LP, isn't it? With 12LP libraries they should get 15% better density and better performance (source). In addition, that they state "14nm" (not 14LPP) is a strong hint that they may be using the 14HP process that GlobalFoundries acquired from IBM.It still may very well be that 14LPP is good enough for their needs and there is no need for a large L4 cache.
Yeah, I see. You may be able to extend the Rome design by cramming in another two 8-core chiplets on each side of the IO chiplet, provided the package substrate has the metal layers and space for the IF links. The L4 cache would grow by 50%, both in size and the number of slices. With more nodes in the L4 you probably need a change in topology, and you might get worse latency. The size of the L4 would probably no longer fit in a 14nm IO chiplet of the same size, and a bigger chiplet may not fit in the package. One crazy solution may be to move the L4 cache to a 7nm chiplet mounted on top of the 14nm IO chiplet. Then the footprint of the IO chiplet could come down dramatically.Perhaps instead of 16 dual CCX chiplets, Milan will have 12 dual CCX chiplets. Think your Rome design and add 2 dual CCX chiplets on each side
Very nice. Choosing the right topology isn't everything, though, and the implementation matters even more. All topologies have their limitations but Intel's ring bus can sure support more cores (up to a point) in a more uniform way than AMD's CCXs connected through Infinity Fabric. To be honest, it has more to do with monolithic vs. MCM design than with anything else, although topologies do matter. I'm sure that there are many improvements AMD has made in IF 2.0 since there are at least a few weak points in their first take on it.Here are some alternative topologies for interconnecting the L4 slices:
IC: With all the memory controllers on the IO die we now have a unified memory design such that the latency from all cores to memory is more consistent?
MP: That’s a nice design – I commented on improved latency and bandwidth. Our chiplet architecture is a key enablement of those improvements.
IC: When you say improved latency, do you mean average latency or peak/best-case latency?
MP: We haven’t provided the specifications yet, but the architecture is aimed at providing a generational improvement in overall latency to memory. The architecture with the central IO chip provides a more uniform latency and it is more predictable.
The difference between topologies (f) and (g) comes into the picture when you look at them on a 2D plane. Then (f) should have shorter (and more uniform) physical distances between longer hops and should therefore likely be preferred on-die. For illustrations they are practically the same if you assume that all L4 slices are equal.(f) cube with upper and lower sides fully connected, and (g) same cube topology as (f), but with the lower quad flipped vertically.
Right. It's one fairly balanced possibility. Some kind of eDRAM buffer could also be attached directly to memory controllers/in each MC, we don't know for sure. Hope AMD still has the ability to use 14HP process to do such things without using too much space.Topology (f) is used in Naples to connect the chiplets in package and across sockets (illustration). Since it looks like the Rome design is two Naples chips crammed into one socket and optimised, I suspect a similar topology is used for the L4 cache in Rome. Considering the upper and lower halves of the IO chiplet are mirrored in large parts, alternative (g) seems to fit the bill.
Could just be that the Cloud Companies are going to get the first wave exclusively and you won't see any OEM products until Q3.
So to be clear said:Edit:[/B] I know that IBM has been fabless since 2015 but not too long ago they were still relying on GloFo to develop their own custom smaller lithography processes. And since they like eDRAM so much, it might not be so easy to adapt. I don't see 7 nm processes supporting eDRAM and neither 10 nm processes.
Did some more digging and IBM planned to use GloFo exclusively for the next ten years (until 2024) until 10nm node:AMD has the experience designing in partially depleted SOI. But do they have the experience and time to design large eDRAM macros? If Global is offering the process, their ASIC team (probably mostly from IBM) would have the designs and AMD could perhaps simply buy them as well.
That's even more restricting than AMD's WSA to GloFo. Although 10nm didn't happen and they quite likely have renegotiated that deal since Power10 is supposed to be 7nm TSMC now. Since everyone except Intel is heading there, there might soon be supply constraints, and Samsung might be a good option for smaller chiplet designs as Ian suggested.GLOBALFOUNDRIES will also become IBM's exclusive server processor semiconductor technology provider for 22 nanometer (nm), 14nm and 10nm semiconductors for the next 10 years.
But back to that IBM-GloFo deal where it also states:IC: AMD has had a strong relationship with TSMC for many years which is only getting stronger with the next generation products on 7nm, however now you are more sensitive to TSMC’s ability to drive the next manufacturing generation. Will the move to smaller chiplets help overcome potential issues with larger or dies, or does this now open cooperation with Samsung given that the chip sizes are more along the lines of what they are used to?
MP: First off, the march for high performance has brought us to Zen 2 and the ability to leverage multiple technology nodes. What we’re showing with Rome is a solution with two foundries with two different technology nodes. It gives you an idea of the flexibility in our supply chain that we’ve built in, and gives you explicit example of how we can work with different partners to achieve a unified product goal. On the topic of Samsung, we know Samsung very well and have done work with them.
As part of this Agreement, GLOBALFOUNDRIES will gain substantial intellectual property including thousands of patents, making GLOBALFOUNDRIES the holder of one of the largest semiconductor patent portfolios in the world.
I didn't really read the whole text thoroughly but it doesn't seem like IBM reserved rights to use any of the nodes exclusively, rather was the other way around meaning IBM couldn't have used any other fabs for the next 10 years. Since GloFo canceled both 10nm (10HP?) and 7nm thats no longer the case.GLOBALFOUNDRIES also will benefit from an influx of one of the best technical teams in the semiconductor industry, which will solidify its path to advanced process geometries at 10nm and below. Additionally, the acquisition opens up business opportunities in industry-leading radio frequency (RF) and specialty technologies and ASIC design capabilities.
The ring bus is actually far from uniform (which is why Intel switched to mesh with Skylake-X/SP), on Skylake already the latency increases from 34ns at best to 85ns at worst, that's with up to 18 cores. As far as actual uniform latency goes the 4 core CCX is about the best possible approach there is, everything beyond that is all but uniform.All topologies have their limitations but Intel's ring bus can sure support more cores (up to a point) in a more uniform way than AMD's CCXs connected through Infinity Fabric.
True. But AMDs current CCX implementation is really only uniform for 4 cores with a reasonable amount of direct links between them, and while Infinity Fabric is meant to scale up well, there are these latency issues especially related to AMD's MCM design and IFOS. Also the cross-CCX latency in each Zeppelin die isn't that good either. But it's true that Intel's ring bus only scales up to only a certain number of nodes before there's performance issues and for those big core count ones, mesh topology is more predictable. There are pros and cons for all designs and topologies but Intel's biggest advantage is overall lower latencies and those are not strictly related to topologies and implementation quite likely matters much much more. As many have pointed out, AMD doesn't really like to use ring bus and therefore some other topology is more likely. But overall, I agree with you.The ring bus is actually far from uniform (which is why Intel switched to mesh with Skylake-X/SP), on Skylake already the latency increases from 34ns at best to 85ns at worst, that's with up to 18 cores. As far as actual uniform latency goes the 4 core CCX is about the best possible approach there is, everything beyond that is all but uniform.
(Jeez, this updated forum will take time getting used to, so uncomfortable...)
IPC = Instructions Per Clock.2.5GHz, curiously that s about 13.6% over an Epyc 7601, as much as the alleged IPC improvement stated by BitsChips.
|Thread starter||Similar threads||Forum||Replies||Date|
|Annotated hi-res core die shots of Zen 1 and 2||CPUs and Overclocking||14|
|Core i7 1065G7 shows up partially as 1035G7||CPUs and Overclocking||3|
|Article Intel 10nm ‘Ice Lake SP’ 14 Core Server CPU Gets First Selfie On SiSoft Sandra, Over 54% IPC Improvement||CPUs and Overclocking||20|
|H||Discussion [WCCF] Intel Claims Xeon Cascade Lake-AP 56 Core CPU Up To 84% Faster Than AMD’s 64 Core EPYC Rome 7742 in Real-World HPC Benchmarks||CPUs and Overclocking||45|
|J||Discussion ZEN2 2-CCD chip inter core latencies = EPYC advances in server rooms||CPUs and Overclocking||2|
|Annotated hi-res core die shots of Zen 1 and 2|
|Core i7 1065G7 shows up partially as 1035G7|
|Article Intel 10nm ‘Ice Lake SP’ 14 Core Server CPU Gets First Selfie On SiSoft Sandra, Over 54% IPC Improvement|
|Discussion [WCCF] Intel Claims Xeon Cascade Lake-AP 56 Core CPU Up To 84% Faster Than AMD’s 64 Core EPYC Rome 7742 in Real-World HPC Benchmarks|
|Discussion ZEN2 2-CCD chip inter core latencies = EPYC advances in server rooms|