Ryzen, Skylake, and everything that's coming next. (MEGA discussion thread)

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

ub4ty

Senior member
Jun 21, 2017
749
898
96
How can you compare the two as if they're even close to being the same. IF is way beyond what EMIB attempts to do which is to simply connect two die.

I can see someone arguing that EMIB is better than Silicon Interposers, even though EMIB really uses tiny interposers, but comparing EMIB positively to IF means you haven't a clue as to deep level of integration IF allows.

ee194817d04b.jpg

Exactly. Because Infinity Fabric really is the interconnect to enable HSA. This is a big game changer and the future of computing. Thus, if you aren't setting up such an interconnect and have a capability to expose it to the outside world, you're dead in the water as far as future computing is concerned. Expose of the CPU to these kind of interconnects is already on the rise :
nvidia-nvlink-2.0-ibm-slide.jpg

It's going towards CPU direct connect to accelerators over data fabric not buses.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Yeah. I guess the end result of connecting multiple die is the most important part of it though.

Nope, the bigger picture and more important one is HSA whereby you need a fabric like connection exposed to the outside world beyond your CPU's dies but tightly integrated with them. Nothing short of doing so will be able to compute on throughput, latency, or performance in the near future of computer architectures. You're either doing something like this and have it in your roadmap or you're going to get BTFO. There's a reason for Nvidia's push of NVLink. They at least smell what's coming.
 
  • Like
Reactions: stockolicious

ub4ty

Senior member
Jun 21, 2017
749
898
96
No, EV6 bus wasn't a normal FSB. It was a point to point connection. DDR communication is standard on just about every interconnect from processor interconnection networks like QPI/HT to networking physical layers like ethernet. EV6 supports 1 processor die per bus connected to a chipset. It doesn't allow 2 cpu dies to share it. And if you consider HT to be FSB, then IF is FSB. HT was used to connect both multiple sockets and multiple dies within a socket. In fact, the network map between 8 Zen dies and 8 opteron dies looks pretty much the same.



No that isn't what I'm saying NOR is it what you said earlier. We've had interfaces that allow cache sharing between cpu's on different sockets longer than you've likely been alive. This was in fact true for K7, k8, and bd, all allowed cache sharing between CPU's. IBM mainframes have literally allowed it for decades. That is literally the definition of a coherent multi-processor which as I said, have likely existed longer than you've been alive. THAT is nothing to do with what you are claiming makes IF special. Unless you want to say that IF doesn't do anything that wasn't already done back in the 60s.
What's different here is that you have no detailing of how IF is implemented and thus know nothing about its limitations, what runs over it, what can run over it, and what can be addressed over it. I'll tell you this right now from how Fabric is used in certain industries : You can run anything over it as long as you encap/decap what you put on it and as long as you have a pathway to the intended endpoint. It's just a data conduit with addressing. So, if implemented correctly, it is the conduit for full blow HSA. You can see clearly how its set up for that. You hang your encap/decap modules off IF or a forwarding module and you can implement whatever protocol you want and send whatever you want over it.

Yes, this has been done before. It is used in tons of industries for various reasons. Intel has a version of it. It was fundamentally fleshed out in the mainframe era. What you're not pointing out is that no one brought this tech to the desktop computer. It all existed in HPC and enterprise solution. Yeah, intel has QPI and they don't expose it to the outside world in desktop processors. They only package it and expose it to varying levels for insanely priced high end xeons and in very tightly controlled methods.

So, AMD has decided to implement a similar technology, use it all throughout their interconnects, and eventually expose and open it up to any 3rd party individual who wants to tie into it and meet the protocol standards. This is a game changer. Intel could have done this themselves. They chose not to in order to protect their enterprise profits. So, kudos to AMD for changing the game and bringing enterprise tech to the desktop. Everyone else could have but didn't. So, no one wants to hear about what tech they sat on for all these years where they could have been pushing the envelop of desktop performance. This is what happens to companies who sit on technology and progress to protect their existing business model. Someone comes buy and cuts you at the knees.
0rjNpUQy.jpg
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
I find this kind of annoying ( just like the ryzen ECC crusade). IF actually has nothing to do with Interfaces, IF is interface agnostic. IF is a data/control plane specification. But when people say IF they are talking the entire methodology of how AMD is building the interconnects between major blocks within SOCs, between SOC's on the same package and between packages. Now what really annoys me about your 1/2 truths above is that as a part of that package of technologies they have different interface and physical implementations for those different interconnects.

The on SOC has an interconnect
The on package has GMI interconnect ( appears to be custom 16bit interfaces), according to the stilt they run 2x the clock rate of GMIx ( could be upto 25ghz)
The inter package has GMIx which runs on the DesignWare Enterprise 12G PHY, which can run Express 3.1, 40GBASE-KR4, 10GBASE-KR, 10GBASE-KX4, 1000BASE-KX, 40GBASE-CR4, 100GBASE-CR10, XFI, SFI (SFF-8431), QSGMII, and SGMII.

Now what AMD is doing that is completely different to Intel is using these common interconnects to connect major building blocks across a very broad range of building blocks all with the common dataplane/control packet (probably) based transport scheme:

Vega will use GMI to go on package with CPU
Vega uses IF for onchip interconnects
Navi will probably use IF for on package GPU cross connect
Navi will probably use IF for onchip interconnects
Zepplin uses it for onchip, on package and inter package
Any semi custom SOC's will use it

That's the differentiation between what has come before (hyper transport/QPI etc).
The other thing to recognize is that even between packages the extra latency is dominated by cache coherency not by encode/transmit/receive/decode. Zeppelin is already paying that probe/directory lookup price between the two CCX's so its likely that the only increase in latency will be the physical so we are looking at something like 15ns. Intel's latency grows with every core added. When the 6core CCX, 48 core server parts come out, intra CCX latency will probably go up a little and then every other latency will stay the same ( with all things being equal).

The on SOC has an interconnect
The on package has GMI interconnect ( appears to be custom 16bit interfaces), according to the stilt they run 2x the clock rate of GMIx ( could be upto 25ghz)
The inter package has GMIx which runs on the DesignWare Enterprise 12G PHY, which can run Express 3.1, 40GBASE-KR4, 10GBASE-KR, 10GBASE-KX4, 1000BASE-KX, 40GBASE-CR4, 100GBASE-CR10, XFI, SFI (SFF-8431), QSGMII, and SGMII.
You seem to have some chops and know what you're talking about .. I can no doubt assume why and where your chops are centered.

I see similar commentary here :
https://forum.beyond3d.com/threads/amd-ryzen-cpu-architecture-for-2017.56187/page-105

Give a quick summary to the community here of what is possible to push over this building block and how insanely it can be scaled and how much can be connected over it. They need to 'see' what this can be used to enable and what this is groundbreaking what AMD is doing especially for the consumer PC market.
6225977_orig.jpeg
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Except that what's come before HAS either done that or can do that. Using a protocol layer on a different physical layer, been done. Using a transport layer on multiple different physical layers, been done. It appears that ignorance of what has already been done is causing people to think that IF is something completely new and revolutionary, but it isn't. It is an interface specification that like literally every interface specification has multiple layers with multiple hand off points.



No, going off chip involves significantly more than enc/xmit/rcv/dec. As far as on chip latencies, CCX also isn't revolutionary, in fact multiple vendors including Intel ship or have shipped clustered core cache coherent designs. They have advantages and disadvantages just like everything else.

PHY : Addressing, Rx + Tx Roundtrip Latency (including clock-crossings and playout buffer)
~89.6 nanoseconds
ccx-2133.png

meme_wizard.jpg


Off-chip adds additional latency but its far lower than PCI-E. Thus, it can be a game changer for HSA/Accelerator tech as it already is in the enterprise.

https://www.nextplatform.com/2016/10/12/raising-standard-storage-memory-fabrics/
AMD, like mad men are bringing HPC enterprise level micro-architectures to the desktop and Intel and others are going to get BTFO if they don't get on board and stop making their comparable tech available only on $8,000 XEON chips. This is why AMD is a game changer in doing what they're doing. Their giving you tech that was locked away to HPC systems. Aspects of the actually compute die are starting to become commoditized.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,901
12,967
136
So you think that y-cruncher might be the best tool to stress-test Zen at the moment?

For short-term testing, yes. I only use the pi tests, though the other functions in y-cruncher do have their uses. It's become very popular with some folks over at OCN, for what it's worth.

It all depends on what it is you're really trying to test. I like y-cruncher because I know it will push my CPU's limits wrt SIMD ISAs. There isn't anything much more demanding than it. Now I normally do not use anything but the binary selected by the main executable, so for Summit Ridge, I run the ADX binary all the time, though it is instructive to note that the SSE4.1 binary produces more heat and power draw. So if you're testing a cooling solution for Summit Ridge, I haven't seen anything more brutal than that!
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Locked away from consumers is Accelerator/HSA architectures:
IBM-Power9-acceleration-slide-Hot-Chips.png

IBM-Power9-accelerator-bandwidth-Hot-Chips-1200x.png

ibm-hot-chips-power9-die-block.jpg


ibm-hot-chips-power9-bandwidth.jpg


https://en.wikipedia.org/wiki/POWER9

AMD's bringing the Juice to the desktop computing platform via Infinity Fabric -> HSA.
This is a game changer. Others haven't done it because its their enterprise cash cow.

AMD to Desktop consumers : Welcome to the big boy table
CsR-dSdVUAAWfkU.jpg
 
  • Like
Reactions: Space Tyrant

DrMrLordX

Lifer
Apr 27, 2000
22,901
12,967
136
Nvidia hasn't done it because they can't get anyone with a significant desktop presence to integrate NVLink into their platforms. Of course they could have opened up the standard to PCI-SIG but they didn't. So it stays on POWER platforms, and that's about it.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Nvidia hasn't done it because they can't get anyone with a significant desktop presence to integrate NVLink into their platforms. Of course they could have opened up the standard to PCI-SIG but they didn't. So it stays on POWER platforms, and that's about it.

It's a significant performance technology that they've chosen to restrict to their premium products that bring in more enterprise shekels.

https://www.pcworld.idg.com.au/arti...ro-gp100-gpu-brings-nvlink-windows-computers/

quadro_nvlink_bridge_4.jpg


That being said, while remaining fair/honest, it remains to be seen how AMD will externally expose Infinity Fabric to enable their Accelerator/HSA pronounced vision. And someone correct me if i'm wrong, but they can't magically enable it with Ryzen/Epyc/Threadripper(tbd) in its current released form as there is no physical access... So, IF usage in way of enabling HSA will come in future chips/mobos(possibly). Or, can AMD pull a rabbit out of the hat and do something wild with ROCM that enables this over PCI-E like they allude to in marketing slides :
AMD-INSTINCT-VEGA-VideoCardz.com-9.jpg
 
Last edited:

stockolicious

Member
Jun 5, 2017
80
59
61
Yeah, EMIB. Which is (on paper at least) much better than IF.

Yields at 14 nm are probably so good at this point that it's not that big of a problem. But at 10 (and especially 7) it would be.

JP if EMIB was all that and a bag of chips INTC would have been eager to use it and keep the same crazy pricing. It looks like INTC gave up on the MCM approach and really figured AMD was going away anyways - they were looking at other companies as their real competition. AMD took a hail mary pass on MCM and seemed to connect for a big win here.
 

stockolicious

Member
Jun 5, 2017
80
59
61
Did many really know this was coming? I remember seeing a lot of negative posts, articles, etc, even after early test results were revealed and I can't remember reading much at all about the remarkable scaling ability of IF [A huge advantage].

Intel appears to have payed the price of not having to fight for too long. You see that in all aspects of life. Business, the military, sports, etc. The wrong people [who tend to be politically very astute] get put in charge as even an average leader will still look good. Competitors reorganize, return stronger and pummel the previous champ. This X platform release is an indication of the chaos happening right now. The next couple of years will be very interesting.

Totally agree and if you have 3 minutes you will see exactly what is happening to INTC - this dynamic could end up being a business case in college.
https://www.youtube.com/watch?v=_1rXqD6M614
 

Space Tyrant

Member
Feb 14, 2017
149
115
116
For short-term testing, yes. I only use the pi tests, though the other functions in y-cruncher do have their uses. It's become very popular with some folks over at OCN, for what it's worth.

It all depends on what it is you're really trying to test. I like y-cruncher because I know it will push my CPU's limits wrt SIMD ISAs. There isn't anything much more demanding than it...
After reading your post up-thread, I revalidated my various configurations with y-cruncher bench 1b. Five (or fewer) chained executions of '1b' broke them all -- so I had to raise my voltages on the manual ones and lower my frequency on my pstate config.
 

stockolicious

Member
Jun 5, 2017
80
59
61
No, I haven't seen anything about IF that is in any way really different than any other interconnection network and I've worked with and designed multiple cpu interconnection networks and studied them for literally decades. And beyond that, there are no actual details available about IF that are the least bit interesting, just marketing slides. We still don't know internal chip details, we still don't know the coherency protocol nor do we know actual speeds and feeds and signalling. There just isn't enough detail to even get to the interesting parts of an interconnection network: what are the packet formats, what are the outstanding request limitations, what are the routing abilities, what are the simple and complex protocol flows, etc.

"No, I haven't seen anything about IF that is in any way really different than any other interconnection network"

One difference is AMD is using one and they are getting really good yields. from a business model perspective Intel will have to drop what they are doing "Soon" and go with a similar approach or they will really get creamed.
 

jpiniero

Lifer
Oct 1, 2010
16,799
7,249
136
JP if EMIB was all that and a bag of chips INTC would have been eager to use it and keep the same crazy pricing.

Who's to say they won't? The best part is that with the Lego-esque approach they can crank the segmentation machine up to 11.
 

DrMrLordX

Lifer
Apr 27, 2000
22,901
12,967
136
After reading your post up-thread, I revalidated my various configurations with y-cruncher bench 1b. Five (or fewer) chained executions of '1b' broke them all -- so I had to raise my voltages on the manual ones and lower my frequency on my pstate config.

Oof. Yeah it's a stinker. Another thing you can do with razor's edge setups is to try and run the shorter pi tests and work up to longer ones. The longer tests use more of your memory. So if you're on 16GB of RAM, for example, you want the 2.5g test all in memory. The program is designed to use storage arrays if you run out of memory space, though you need relatively quick arrays to run it any any appreciable speed.

In any case using more of your RAM tests RAM and IMC stability. Selecting a longer test keeps the CPU working longer, so it stays under heat/stress continuously for that period, making it more likely to crash. All good fun!
 

wildhorse2k

Member
May 12, 2017
180
83
71
What are the actual problems with Threadripper (X399)? I haven't heard of any

I used future tense in my post. Since its the same thing multiplied problems will be the same - VMware ESXi, GCC compilation bug (caused by uop cache apparently). Not too great fast memory compatibility. Even more NUMA due to having to access remote memory/PCIe through two cross bars instead of just one. People pay Intel to avoid those kind of problems.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,730
136
I used future tense in my post. Since its the same thing multiplied problems will be the same - VMware ESXi, GCC compilation bug (caused by uop cache apparently). Not too great fast memory compatibility. Even more NUMA due to having to access remote memory/PCIe through two cross bars instead of just one. People pay Intel to avoid those kind of problems.
I'm regularly surprised by this implicit assumption that NUMA issues are somehow unworkable and there's no way of getting around the problems presented by it, which completely ignores the progress that has been made over the decades in making programs NUMA-aware, especially with respect to HPC applications.
And what is lacking about fast memory on TR? Board specifications across manufacturers mention 3600MHz as being supported as a maximum. This is AMDs first integrated IMC for DDR4, and they already get near-theoretical bandwidth. Those 4000MHz+ speeds on X299 are only achievable with Kaby Lake-X.
 

wildhorse2k

Member
May 12, 2017
180
83
71
I'm regularly surprised by this implicit assumption that NUMA issues are somehow unworkable and there's no way of getting around the problems presented by it, which completely ignores the progress that has been made over the decades in making programs NUMA-aware, especially with respect to HPC applications.

And what is lacking about fast memory on TR? Board specifications across manufacturers mention 3600MHz as being supported as a maximum. This is AMDs first integrated IMC for DDR4, and they already get near-theoretical bandwidth. Those 4000MHz+ speeds on X299 are only achievable with Kaby Lake-X.

NUMA issues are not unworkable, but if you have the budget to buy more expensive product to avoid them then why not.

I know 3600Mhz works on Ryzen in 2x8GB configuration with good memory + good board + some fiddling + luck. But if you have the budget to buy CPU with IMC that supports faster memory than Ryzen then why not. I'm considering 4x16GB or 8x8GB which I highly doubt will work on Threadripper at 3600Mhz with good latencies.
 

scannall

Golden Member
Jan 1, 2012
1,960
1,678
136
NUMA issues are not unworkable, but if you have the budget to buy more expensive product to avoid them then why not.

I know 3600Mhz works on Ryzen in 2x8GB configuration with good memory + good board + some fiddling + luck. But if you have the budget to buy CPU with IMC that supports faster memory than Ryzen then why not. I'm considering 4x16GB or 8x8GB which I highly doubt will work on Threadripper at 3600Mhz with good latencies.
It would appear that your budget is pretty far away from the norm for the majority of people putting together their own system. Nothing wrong with that, but it's not really a valid consideration for most people.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,730
136
NUMA issues are not unworkable, but if you have the budget to buy more expensive product to avoid them then why not.

I know 3600Mhz works on Ryzen in 2x8GB configuration with good memory + good board + some fiddling + luck. But if you have the budget to buy CPU with IMC that supports faster memory than Ryzen then why not. I'm considering 4x16GB or 8x8GB which I highly doubt will work on Threadripper at 3600Mhz with good latencies.
So more money gives you a monolithic chip, but it doesn't mean you don't have to care about core-topology. Making programs aware of the topology of the architecture is one of the biggest challenges of getting the most multi-threading efficiency out of them. Why else do you think Broadwell-E Xeons have Cluster-on-Die and Skylake-SP has Sub-NUMA clustering, which Intel themselves recommend that you use to get the best latencies? Contrary to what you might think, depending on the application there's a better chance that you'll get better parallel efficiency out of 16-core TR than 18C Skylake-X, because desktop Skylake-X doen't have these features to split the cores into groups of two NUMA nodes.

Most consumer applications, even the things that these high core count platforms are being targeted at, aren't topology-aware. Now whether its about NUMA on TR or the Mesh on Skylake-X, you'll eventually run into the same problems as you keep on increasing the core count, if multi-threaded performance is your primary concern.
 
  • Like
Reactions: Drazick

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
I know 3600Mhz works on Ryzen in 2x8GB configuration with good memory + good board + some fiddling + luck. But if you have the budget to buy CPU with IMC that supports faster memory than Ryzen then why not. I'm considering 4x16GB or 8x8GB which I highly doubt will work on Threadripper at 3600Mhz with good latencies.

Even with DDR3, there was a trade-off between the low latency dual DIMM modules and high latency dual rank quad high capacity DIMM modules. There probably was with DDR2, but I cannot remember.
 

DA CPU WIZARD

Member
Aug 26, 2013
117
7
81
Totally agree and if you have 3 minutes you will see exactly what is happening to INTC - this dynamic could end up being a business case in college.
https://www.youtube.com/watch?v=_1rXqD6M614

Bingo. I posted essentially the same thing to a similar comment on r/hardware a few weeks ago. This is such a powerful video, and I have been reminded of it on a daily basis since I first saw it years ago. It might actually be the most powerful video relating to corporations I have ever seen and I am not exaggerating.

This is especially relevant for Intel not only because they let much weaker competition catch up, but due to their reaction once this occurred, and in their half assed efforts to succeed in anything not currently a "cash cow". Don't get me wrong, I am not selling AMD short... They have done great work in closing the gap, but I can not help but think what the market would be like had Intel spent the last decade more aggressively widening the gap. Might it be possible that Krzanich leaves Intel with the same legacy as his predecessor? We will see!
 
Last edited:

coercitiv

Diamond Member
Jan 24, 2014
7,354
17,423
136
NUMA issues are not unworkable, but if you have the budget to buy more expensive product to avoid them then why not.
So more money gives you a monolithic chip, but it doesn't mean you don't have to care about core-topology. Making programs aware of the topology of the architecture is one of the biggest challenges of getting the most multi-threading efficiency out of them. Why else do you think Broadwell-E Xeons have Cluster-on-Die and Skylake-SP has Sub-NUMA clustering, which Intel themselves recommend that you use to get the best latencies?
The following was posted a while ago by Ian Cutress on Twitter:
Intel released a new optimization manual. To get the best local latency, enable on-die NUMA/clustering. 4 NUMA nodes per CPU. Chapter 8.
Sombody even remarked it's the same NUMA clustering as EPYC.
 
  • Like
Reactions: moinmoin