Intel "mesh" vs intel "Ring"

csbin · Jun 19, 2017

You need to put a description of the image in the post. This is an official warning
Markfw
AT Moderator

Edrick · Jun 19, 2017

Sacrifice latency for more bandwidth. Probably more important in the server level tasks these chips were designed for.

krumme · Jun 19, 2017

Edrick said:
Sacrifice latency for more bandwidth. Probably more important in the server level tasks these chips were designed for.

Yeaa. We have seen that before.

Ferzerp · Jun 19, 2017

You're forgetting something that a ring bus necessitates, and that is a linear latency increase with each added node. If you tried to scale the ring up to those counts, the ring latency would be greater, so in fact no, they haven't "sacrificed latency for more bandwidth". It's actually reduced latency for what a ring would have.

NTMBK · Jun 19, 2017

You're looking at too low a core count. The ring bus was killing them in the 24-core die last time, they had to use two rings tied together with buffered queues:

The mesh is designed to scale much better to those core counts. The mesh latency should scale as sqrt(n), instead of linearly.

I'm curious to see what the next Xeon D uses. My guess is that it will stick with the "consumer" cores with no AVX512 and use a ringbus.

moinmoin · Jun 19, 2017

The change from ring to mesh is absolutely a necessity as the amount of cores increases. What's concerning is the huge increase in die size that comes along with it. Or is that down to AVX512?

majord · Jun 19, 2017

What is said die size?

jpiniero · Jun 19, 2017

That's the weird thing about the mesh. I was thinking the latency would be faster, not slower, since it could get to any other core in less hops.

This is a mess.

moinmoin · Jun 19, 2017

majord said:
What is said die size?

Was thinking of this post:
https://forums.anandtech.com/thread...-19-9-00-am-et.2428363/page-496#post-38943496

Hans de Vries said:
jpiniero said:

BTW, the die size has also grown quite a bit.

Broadwell-E 10C die is like 240; Skylake-X 10C die is around 325.

By my totally reliable estimations by looking at the die pict, the core size (inc L3) is about 17 mm2. Kaby Lake is by the same estimation 12.2 mm2.

Click to expand...

It has very significantly grown to 17.0 mm2 indeed.

lolfail9001 · Jun 19, 2017

moinmoin said:
Or is that down to AVX512?

Down to AVX512 probably, the register file alone must have increased fourfold or something.

Excessi0n · Jun 19, 2017

lolfail9001 said:
Down to AVX512 probably, the register file alone must have increased fourfold or something.

According to the Anandtech review, the register file takes up more space than an entire Atom core...

T1beriu · Jun 20, 2017

Souce: PCPer

nathanddrews · Jun 20, 2017

Also from PCPer, for reference:

Atari2600 · Jun 20, 2017

Hopefully adds some context for the clowns yapping on about Zen, IF and latency.

Shivansps · Jun 20, 2017

Atari2600 said:
Hopefully adds some context for the clowns yapping on about Zen, IF and latency.

The clowns were the ones ignoring or dismissing the issue.

wildhorse2k · Jun 20, 2017

Actually the Ryzen latencies make more sense in the long run. You want very low latencies for low thread count applications and applications needing more threads need to optimize. Giving bad latency to all cores is a bad idea.

The problem with Ryzen is that perhaps the cross CCX latencies are way too high. We will see how it goes for Threadripper, but cross die latencies will probably be even higher than 140ns.

mtcn77 · Jun 20, 2017

wildhorse2k said:
Actually the Ryzen latencies make more sense in the long run. You want very low latencies for low thread count applications and applications needing more threads need to optimize. Giving bad latency to all cores is a bad idea.

The problem with Ryzen is that perhaps the cross CCX latencies are way too high. We will see how it goes for Threadripper, but cross die latencies will probably be even higher than 140ns.

You purchase low latency ram - problem solved.

moinmoin · Jun 20, 2017

mtcn77 said:
You purchase low latency ram - problem solved.

On that matter for Intel, we know that the ring bus didn't improve as much with higher speed/lower latency ram. Do we already have numbers on how the mesh behaves with higher speed/lower latency ram?

wildhorse2k · Jun 20, 2017

moinmoin said:
On that matter for Intel, we know that the ring bus didn't improve much with higher speed/lower latency ram. Do we already have numbers on how the mesh behaves with higher speed/lower latency ram?

PcPer review shows it. DDR4 2400 you get about 105ns latency. DDR4 2800/uncore 2800 (stock uncore is 2.4Ghz) you get about 95ns. I saw comment from der8auer on youtube that you can run uncore up to 3-3.2Ghz. According to him the highest DDR4 frequency is 3600-4000. With that you could get down to perhaps 90ns-85ns (absolute best case). But TDP will very likely increase.

It would be unfair to only point out the negatives. According to http://www.tomshardware.com/reviews/intel-core-i9-7900x-skylake-x,5092-3.html cache has superior multi thread throughput.

On Ryzen by using DDR4 3200 the inter CCX latency reportedly drops to about 105ns which isn't too bad (not shown in review).

lolfail9001 · Jun 20, 2017

moinmoin said:
On that matter for Intel, we know that the ring bus didn't improve as much with higher speed/lower latency ram. Do we already have numbers on how the mesh behaves with higher speed/lower latency ram?

Uncore is not really tied to memory clock on Intel CPUs.

Atari2600 · Jun 21, 2017

Shivansps said:
The clowns were the ones ignoring or dismissing the issue.

You'd think that now the two main x86 design houses have decided to trade-off latency for improved bandwidth or scaleability then folks would accept that maybe latency isn't as important as they think.

Obviously not.

Well - I guess we'll know when benchmarks come out of SKL-X on games (since a high proportion of clowns use these exclusively as reflective of "performance") as to how important latency is when the software, compiler and scheduler are optimised to keep threads on the same cores where possible and prefetch data where possible.
Oh actually no. What will happen is the same clowns will pick out inappropriate results from inappropriate code and use it to justify their pre-conceived idiocy.

itsmydamnation · Jun 21, 2017

wildhorse2k said:
Actually the Ryzen latencies make more sense in the long run. You want very low latencies for low thread count applications and applications needing more threads need to optimize. Giving bad latency to all cores is a bad idea.

The problem with Ryzen is that perhaps the cross CCX latencies are way too high. We will see how it goes for Threadripper, but cross die latencies will probably be even higher than 140ns.

I think it'll be about 160ns to go inter die on there test, the big bit of the latency number will be cache coherency, which shouldnt grow compared to a single zepplin.

ub4ty · Jun 22, 2017

itsmydamnation said:
I think it'll be about 160ns to go inter die on there test, the big bit of the latency number will be cache coherency, which shouldnt grow compared to a single zepplin.

itsmydamnation · Jun 22, 2017

ub4ty said:

There is a paper that i have a copy of ( i dont know where i got it) that has bulldozers NUMA latencies. Im basing my guesses of that. For BD going intra package was 41ns. Going one socket hop was only an extra 7ns on that. But its interconnects where much slower and its internal system topology is significantly worse.

But even in Ryzen the cache directory when going inter CCX will be checked to see if the dest memory address in is the other CCX or the fetch is to main memory, so that time shouldn't grow much as its just the physical transmit latency to add when querying a remote cache directory/memory controller + then the extra physical latency for the returned data/result.

Thus my guess of an extra 20ns.

beginner99 · Jun 23, 2017

Atari2600 said:
Well - I guess we'll know when benchmarks come out of SKL-X on games (since a high proportion of clowns use these exclusively as reflective of "performance") as to how important latency is when the software, compiler and scheduler are optimised to keep threads on the same cores where possible and prefetch data where possible.
Oh actually no. What will happen is the same clowns will pick out inappropriate results from inappropriate code and use it to justify their pre-conceived idiocy.

We will see how latency and mesh vs ringbus affects gaming when Skylake-X and coffelake gaming reviews are here. The 6-core coffelake can then be directly compared to 7800kand my bet is coffelake will be the clear winner (even at same clocks).

Intel "mesh" vs intel "Ring"

Senior member

Golden Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Senior member

Lifer

Diamond Member

Golden Member

Member

Member

Graphics Cards, CPU Moderator

Golden Member

Diamond Member

Member

Member

Diamond Member

Member

Golden Member

Golden Member

Diamond Member

Senior member

Diamond Member

Diamond Member