Speculation: The CCX in Zen 2

Vattila · Aug 12, 2017

William Gaatjes said:
Well, Intel is still the biggest player and their architecture is the deciding factor to optimize software for and will remain so for some time.

Ok. I see your argument. By aligning themselves to Intel's architecture, AMD will get better performance on software optimized for Intel. And by going to a larger CCX with a mesh interconnect, similar to Intel's, they will hence benefit. Possibly.

But even if AMD moves to a 6-core or 8-core CCX with a mesh topology, you'd still have cross-CCX latency to deal with. It only changes the partition size a little. To really take your argument to its logical conclusion, AMD should eliminate CCXs altogether and create a monolithic die like Intel. Personally, I think the design trend will be the other way around, i.e. Intel moving to modular designs using their EMIB multi-die interconnect technology.

With regards to the 4-core vs 6-core question, you have to consider the system as a whole and the workloads you care about. Say, a 48-core CPU with 4-core CCXs vs a 48-core CPU with 6-core CCXs. The first system will have faster core-to-core latency in the CCX, although some workloads may suffer a little more cross-CCX latency due to a few more CCXs. To evaluate which is the overall best topology, you would have to test/simulate the systems on a variety of workloads you care about and make a judgement. My bet is on a 4-core CCX.

William Gaatjes said:
Of course direct connections are best, and the big question is of course : Will a workload always fit in that cache. I doubt that.

The workload does not need to fit in the cache to avoid cross-CCX latency. As long as it does not need to access remote memory (connected to another CCX), nor need to snoop another remote cache (e.g. due to shared memory/locks), it will be fine. In other words, as long as it accesses memory connected to the local memory controller only, and does not share memory with cores outside the CCX, it will not suffer cross-CCX latency, as I understand it.

William Gaatjes said:
And besides, the whole issue is that threads are migrated from core to core as several kinds of software is running

That's a good point. For optimal performance, the operating system will need to be NUMA-aware and schedule threads according to the system topology. I am not sure how well (or badly) operating systems do on this today.

William Gaatjes said:
Server cpu can be running multiple virtual machines that run programs with lots of threads. And that is where the inter CCX communication hurts.

On the other hand, if the virtual machine is allocated a single CCX, then you should see no penalty. In fact, it should be as good as it gets, since all cores in the virtual machine are direct-connected. For larger partition sizes, it will depend on the workload, I guess. NUMA-aware and latency-insensitive workloads should run well.

William Gaatjes said:
it does seem that the scalable data fabric is also used to connect the different L3 caches. But that still does not explain the average latency explanation between L3 caches. I cannot understand that yet.

The way I understand it, the interleaving scheme, which I described in my last post, ensures a consistent average latency. For a memory address that hits the local cache slice, the latency is lower, and for addresses that hit the remote slices in the other cores, the latency is higher. However, the interleaving ensures that the latency averages out. And it may increase bandwidth as every core will use all 4 slices of the L3. (This is similar to the UMA mode for Threadripper, in which memory is interleaved across all memory controllers, causing a higher average latency, but also higher bandwidth.)

William Gaatjes · Aug 12, 2017

Vattila said:
Ok. I see your argument. By aligning themselves to Intel's architecture, AMD will get better performance on software optimized for Intel. And by going to a larger CCX with a mesh interconnect, similar to Intel's, they will hence benefit. Possibly.

But even if AMD moves to a 6-core or 8-core CCX with a mesh topology, you'd still have cross-CCX latency to deal with. It only changes the partition size a little. To really take your argument to its logical conclusion, AMD should eliminate CCXs altogether and create a monolithic die like Intel. Personally, I think the design trend will be the other way around, i.e. Intel moving to modular designs using their EMIB multi-die interconnect technology.

Well, as i mentioned in my post you have responded to.

"
A best of both worlds is an 8 core CCX connected with more DF links for higher inter CCX communication and more L3 cache of course.
"
I too think what AMD has done is a good solution. Small dies partitioned as core complexes. But that does not mean that while process nodes advance, that the CCX will always remain at 4 cores.
That makes no sense.
As i have written in my other post before, I think that a 4 core CCX will probably remain here for some time. Especially for consumer products. But the DF between the 4 core CCX will probably be upscaled for more bandwidth because it might fall under "low hanging fruit".
But for server cpus it makes sense to go to more cores for a complex.

With regards to the 4-core vs 6-core question, you have to consider the system as a whole and the workloads you care about. Say, a 48-core CPU with 4-core CCXs vs a 48-core CPU with 6-core CCXs. The first system will have faster core-to-core latency in the CCX, although some workloads may suffer a little more cross-CCX latency due to a few more CCXs. To evaluate which is the overall best topology, you would have to test/simulate the systems on a variety of workloads you care about and make a judgement. My bet is on a 4-core CCX.

I still think AMD will go for a power of 2 which is not 6.
It will either remain 4 or 8.

The workload does not need to fit in the cache to avoid cross-CCX latency. As long as it does not need to access remote memory (connected to another CCX), nor need to snoop another remote cache (e.g. due to shared memory/locks), it will be fine. In other words, as long as it accesses memory connected to the local memory controller only, and does not share memory with cores outside the CCX, it will not suffer cross-CCX latency, as I understand it.

To be honest, Cross CCX latency is about 140 ns and gets down to about 110ns when using high speed DDR of 3200. But main memory access through a memory controller connected to the ccx is also between 80ns and 100ns. That is kind of a moot point.

That's a good point. For optimal performance, the operating system will need to be NUMA-aware and schedule threads according to the system topology. I am not sure how well (or badly) operating systems do on this today.
On the other hand, if the virtual machine is allocated a single CCX, then you should see no penalty. In fact, it should be as good as it gets, since all cores in the virtual machine are direct-connected. For larger partition sizes, it will depend on the workload, I guess. NUMA-aware and latency-insensitive workloads should run well.

The problem is that if you run a virtual machine on a single CCX and have several programs and processes running in that virtual machine, you still have the same problem, not optimal using the cpu because some programs need more threads to perform optimally than others.

The way I understand it, the interleaving scheme, which I described in my last post, ensures a consistent average latency. For a memory address that hits the local cache slice, the latency is lower, and for addresses that hit the remote slices in the other cores, the latency is higher. However, the interleaving ensures that the latency averages out. And it may increase bandwidth as every core will use all 4 slices of the L3. (This is similar to the UMA mode for Threadripper, in which memory is interleaved across all memory controllers, causing a higher average latency, but also higher bandwidth.)

I do not know if it is similar.
I am still reading up on threadripper and how it handles its quad ram connection. I cannot comment on that.

moinmoin · Aug 12, 2017

We should keep in mind that we are speculating based on the very first iteration of the Zen design. No doubt AMD already has a list of potential improvements that they work through for the future iterations. Zen+ is the next iteration, and by all indications Raven Ridge will be based on it. Let's see what kind of changes they introduce there, low hanging fruits and all that.

William Gaatjes said:
And besides, the whole issue is that threads are migrated from core to core as several kinds of software is running and that is where AMD is behind Intel.

But we all were told that the Windows Scheduler is doing nothing wrong with that boneheaded behavior! /s
(Never mind that Windows Server does that in a lower frequency than Windows 10.)

Schmide · Aug 12, 2017

William Gaatjes said:
(snip)

"
A best of both worlds is an 8 core CCX connected with more DF links for higher inter CCX communication and more L3 cache of course.
"
I too think what AMD has done is a good solution. Small dies partitioned as core complexes. But that does not mean that while process nodes advance, that the CCX will always remain at 4 cores.
That makes no sense.
As i have written in my other post before, I think that a 4 core CCX will probably remain here for some time. Especially for consumer products. But the DF between the 4 core CCX will probably be upscaled for more bandwidth because it might fall under "low hanging fruit".
But for server cpus it makes sense to go to more cores for a complex.

I still think AMD will go for a power of 2 which is not 6.
It will either remain 4 or 8.

If a key feature of a CCX is the communication between cores (and L3 partitions) requiring a bi-directional link between each of them, how can it still be called a CCX if by force of number it can no longer maintain this level of communication.

Remember. The number of pathways between n objects is the combination C(n,2).

So for objects

2 = 1
3 = 3
4 = 6
5 = 10
6 = 15
7 = 21 <- edit: was 31
8 = 28
16 = 120

In addition for every way in a cache you must search to see if that tag contains the memory requested. For a 4 ccx group this should be 4x16=64 searches assuming 256B cache line. I would imagine that the logic for these searches are buried in the links and thus anything over 4 would lead to massive complexity.

Vattila · Aug 13, 2017

William Gaatjes said:
But that does not mean that while process nodes advance, that the CCX will always remain at 4 cores. That makes no sense.

On the contrary, I posit that the 4-core CCX is a fundamental building block, which makes sense for the topology I outlined in my original post — i.e. a hierarchy of direct-connected units of (max) 4. To me it seems this scalable quad-tree topology has merit, and it looks to be a fundamental part of what AMD is doing, but feel free to disagree.

tamz_msc · Aug 13, 2017

Draw a graph with six vertices in a 2x3 grid and connect each adjacent pair of vertices with an edge. What is the (edge) distance between the farthest pairs of vertices?

That's why I believe that 6-core CCX is sub-optimal, and @Vattila, your objections are based on this same logic, right?

Vattila · Aug 13, 2017

@tamz_msc, yes, more or less. 4 cores are easy to direct-connect, i.e. link every pair of cores (6 links). Beyond that you have to settle for a sub-optimal topology. Intel now uses a mesh. AMD uses what looks to me to be a hierarchical quad-tree. In this topology, you have (max) 4 direct-connected nodes at each level in the tree.

Vattila · Aug 13, 2017

iBoMbY said:
My guess is, Zen2 will look similar to this:

I see 3 CCXs in there. Nice! What else? Any graphics?

William Gaatjes · Aug 13, 2017

Schmide said:
If a key feature of a CCX is the communication between cores (and L3 partitions) requiring a bi-directional link between each of them, how can it still be called a CCX if by force of number it can no longer maintain this level of communication.

Remember. The number of pathways between n objects is the combination C(n,2).

So for objects

2 = 1
3 = 3
4 = 6
5 = 10
6 = 15
7 = 21 <- edit: was 31
8 = 28
16 = 120

I see your point.
I am going to play advocate of the devil here.
But how has Intel or IBM it been doing all this time with multi core cpus and still have such fast cpus ?

In addition for every way in a cache you must search to see if that tag contains the memory requested. For a 4 ccx group this should be 4x16=64 searches assuming 256B cache line. I would imagine that the logic for these searches are buried in the links and thus anything over 4 would lead to massive complexity.

Well, i would like to answer that with a counter question because i do not know everything about how caches caches and how snooping works.
How are the current caches examined for a memory address ?
Is it done sequentially , one cache by one or is it done in parallel ?
It makes sense to me that it is parallel because the L3 is partitioned and multiported, there can also be searched for 4 tags at once.
Is that the 4 of 4x16 searches ?
Why can that not be expanded to 8x16 in parallel ?
It is in its most basic form not duplication of logic that compares bits and sets a flag when equal.
That flag signals the presence of an address that has been searched for.

Vattila said:
On the contrary, I posit that the 4-core CCX is a fundamental building block, which makes sense for the topology I outlined in my original post — i.e. a hierarchy of direct-connected units of (max) 4. To me it seems this scalable quad-tree topology has merit, and it looks to be a fundamental part of what AMD is doing, but feel free to disagree.

If AMD will continue the road of more 4 core CCX as you described , there will be a point of diminishing returns. At that point there will be too much software optimization needed to handle that configuration while Intel can get more work done without needing software to track all those CCX and to make sure that software is run only on CCX and adjacent CCX to reduce the hops in between CCX as much as possible.
It is a good solution for now since most consumers do not even have more than 4 cores. It will be a while before 8 physical cores is the bare minimum on desktop pc.
More multithreaded software that can actually make use of all those logical cores, that is also still in development.
the consoles are also based on a (2x)4 core jaguar complex. Having a sort of global architectural compatibility and good selling point for possible future contracts.
Also, the way it is now is that AMD can produce lots of small dies that makes the chance on an error much smaller and let them sell more dies.
So, from a business perspective, 4core CCX is perfect.
But it will not stay that way.

Topweasel · Aug 13, 2017

William Gaatjes said:
If AMD will continue the road of more 4 core CCX as you described , there will be a point of diminishing returns. At that point there will be too much software optimization needed to handle that configuration while Intel can get more work done without needing software to track all those CCX and to make sure that software is run only on CCX and adjacent CCX to reduce the hops in between CCX as much as possible.

Well you are wrong on that for a couple reasons. 1. You are right on diminishing returns but not in the way you think. AMD can still minimize hops by having every CCX have a point to point IF. It means more transistors. Where if they move to 3, 4 core CCX solutions for Zen 2. They only need to add one IF connection to each CCX. 2. How do you see Intel beating AMD? Have you looked at the SL-X mesh numbers? I don't know what EMIB is going to be like, but Intels current mesh pretty much the same latency as IF between CCX complexes. Which means its the opposite of what you suggest. People are so focused on the differences in latency, from core to core, to CCX to CCX, to Die to Die, that they miss the obvious. The inter CCX communication is fantastic and sure you want to keep as much inter-CCX as possible. But realistically the CCX to CCX communication isn't that bad. It sub optimum for some tasks but not enough to really worry about. I mean even when looking at early Ryzen benches, in the games it wasn't that bad, a little lower than the IPC implied, but really not much and it was easy to work around when people started patching for Ryzen. 3. Know both of these really were is Intel going to make it up? As it is CFL might be the last of the ring bus CPU's. Their Mesh is on level with CCX to CCX latency. Cept for intel it's core to core. In Latency driven tasks AMD will be ahead? Do they have a CCX like module system in the works? Are those modules going to be more than 4 cores? If they go this route are they going to have quicker communication between modules?

William Gaatjes · Aug 13, 2017

Topweasel said:
Well you are wrong on that for a couple reasons. 1. You are right on diminishing returns but not in the way you think. AMD can still minimize hops by having every CCX have a point to point IF. It means more transistors. Where if they move to 3, 4 core CCX solutions for Zen 2. They only need to add one IF connection to each CCX. 2. How do you see Intel beating AMD? Have you looked at the SL-X mesh numbers? I don't know what EMIB is going to be like, but Intels current mesh pretty much the same latency as IF between CCX complexes. Which means its the opposite of what you suggest. People are so focused on the differences in latency, from core to core, to CCX to CCX, to Die to Die, that they miss the obvious. The inter CCX communication is fantastic and sure you want to keep as much inter-CCX as possible. But realistically the CCX to CCX communication isn't that bad. It sub optimum for some tasks but not enough to really worry about. I mean even when looking at early Ryzen benches, in the games it wasn't that bad, a little lower than the IPC implied, but really not much and it was easy to work around when people started patching for Ryzen. 3. Know both of these really were is Intel going to make it up? As it is CFL might be the last of the ring bus CPU's. Their Mesh is on level with CCX to CCX latency. Cept for intel it's core to core. In Latency driven tasks AMD will be ahead? Do they have a CCX like module system in the works? Are those modules going to be more than 4 cores? If they go this route are they going to have quicker communication between modules?

1.
I also think Zen 2 or zen+ will be a 4 core CCX. But what comes after will be an 8 core CCX for the server business.
The consumer version will remain a 4 core CCX for as long 8 cores is not the minimum standard in the consumer world.
That is my view of the future.

2:
Well, AMD has a new architecture and i agree it is fantastic what they have done, but as it seems they have some room for improvement, as to be cheered by the reviewers as the king of cpu computation.
For Intel , they have a new mesh for core to core communication and it is new, so i am sure they also have room for improvement.
If AMD can improve on iterations of their architecture, why should Intel not be allowed to do the same ?
The mesh is new. There are going to be good times coming for us.

3.
I am wondering about that.
I can safely assume that bandwidth and latency is also with the IF not the same.
And that is where i get confused.
I was thinking with an example:
1byte every nanosecond, is still 1 * 10^9 bytes/second.
but 4 bytes at once every 4 nanoseconds is also 10^9 bytes/second.
If i think to understand IF correctly, the latency is fixed for a single link.
Even for the SDF.
So if more links are added in parallel, the latency stays the same only the bandwidth increases.
4 bytes every 1 nanosecond is 4 * 10^9 bytes/second.
But it is still a nanosecond.
With large amounts of bytes, that does not matter, then smart prefetching and caching hides the latency. But for looking up a single address in a tag it does.
I wonder how they solved that with the scalable data fabric in between the 4 L3 caches.
IF is it seems not only a high speed serial link. It is also a parallel link. I cannot imagine that the L3 are connected through serial links in between the CCX.
I like understanding the hardware.

Topweasel · Aug 13, 2017

1. Doubtful AMD chooses another solution. Slightly greater chance that 7nm+ they might pull a coffeelake and use the lack of general improvements to redo the CCX. But I doubt AMD goes to far down the multiple die approach. They need keep their lineups simple if they want to survive. So if anything 7nm+ or something they go 6cores per CCX, 2-3 CCX, 12-18 cores. But there is a lot of diminishing returns there. 12 cores is getting really close to pointless for a general user, 18 would be way too many and would have better performance going with less cores clocked higher.

2. Not saying Intel can't get better. But you can't turn it around in 1-2 generations. This is a 3-5 year piece of work so unless they knew they need to do this they are going to be behind the 8 ball. It probably means having to do redesign the arch something they haven't done since SB. So if the current mesh was the path they are going it's 2020 before we see them possibly catching up.

3. Latency isn't locked in IF it's adaptive. Based on the clockspeed of the memory. Whereas Intel has it locked to a certain clock speed. It is a highly parallel interconnect every link is two way from every connection and just about every link is point to point to every link. Intel's mesh is as well. But that's part of it's problem every core is separate and therefore every core is like a CCX on SL-X. All of this may matter little. Intel does have EMIB in the works. It might be closer to IF then we think.

jpiniero · Aug 13, 2017

The thing I don't get it why the memory speed even matters on the core latency.

Topweasel said:
but Intels current mesh pretty much the same latency as IF between CCX complexes.

Maybe at DDR-5000. At 3200 it's around 110 cross-CCX vs 80-90 for SKX's mesh and like 70 for BWE's ring.

Topweasel · Aug 13, 2017

jpiniero said:
The thing I don't get it why the memory speed even matters on the core latency.

Maybe at DDR-5000. At 3200 it's around 110 cross-CCX vs 80-90 for SKX's mesh and like 70 for BWE's ring.

Easy IF is clocked at memory speed faster speed means faster IF clock speed which means lower latency.

At 2400 IF was about 140 between CCXs. 3400 had last I saw at the mid 90's. I was seeing in to original benchmarks of about 100-110 on SLX. So realistically at 3200 they are about the same. Pushing 4000 RYZEN would be quite a bit faster than SL-X.

Ajay · Aug 13, 2017

Nvm - damn iPads seriously suck sometimes.

maddie · Aug 13, 2017

We know that AMD is migrating to 7nm very early in it's development.

Speculation:
Yield is poor for the first year.
Zen2 on 7nm has an 8 core CCX with a [2 core + L3] shared cache unit replacing a [1 core+L3] leading to an identical topology as Zen
Use only 6 cores/CCX due to expected defects and lower yields including parametric yield on full die.
Can upgrade to full 8 cores/CCX without redesign as process improves.

edit:
The actual area penalty by not using [2 cores + associated L3] /CCX is <10%

swilli89 · Aug 13, 2017

We have plenty of cores now for years to come. AMD will now focus on increasing frequency and latencies.

Schmide · Aug 13, 2017

William Gaatjes said:
I see your point.
I am going to play advocate of the devil here.
But how has Intel or IBM it been doing all this time with multi core cpus and still have such fast cpus ?

Everything in computing is a balance of complexity vs speed. IBMs power 8 seems to use the most efficient core complex of 3. Note - AMD and IBM worked closely together in the past. I think AMDs current design reflects quite a bit on the Power8 arch.

Intel on the other hand used a ring bus until Skylake-SP/X.

The ring bus has an advantage of relatively flat latency up to 8 cores (~80ns). However, it is slower than AMDs 4 ccx (~40ns), yet faster than IF (~90-140ns). However once the ring bus was extended to 16 or more cores it exhibits the same distance type latency that AMD has. I do not know the numbers.

So Intel is going with a mesh bus. Again I do not know the numbers but I would surmise it would be a nearest neighbor factor. The farther you go the more you incur.

William Gaatjes said:
Well, i would like to answer that with a counter question because i do not know everything about how caches caches and how snooping works.
How are the current caches examined for a memory address ?
Is it done sequentially , one cache by one or is it done in parallel ?
It makes sense to me that it is parallel because the L3 is partitioned and multiported, there can also be searched for 4 tags at once.
Is that the 4 of 4x16 searches ?
Why can that not be expanded to 8x16 in parallel ?
It is in its most basic form not duplication of logic that compares bits and sets a flag when equal.
That flag signals the presence of an address that has been searched for.

Oh I'm sure they use parallel hardware searches.

The nature of a cache is you have a block of memory, you subdivide it by way and then map memory into each block. What we have seen the less ways you have, the faster the cache can be. The same can be said for cache size. AMD has gone a step further and subdivided the cache per core. I would imagine that this has the effect of making it like a 4x16 = 64 way cache with many restrictions. The most notable being that each core can only write its its own victim area.

The whole nature of this speculation is - What is the optimal complexity to run at the fastest speed.

scannall · Aug 13, 2017

swilli89 said:
We have plenty of cores now for years to come. AMD will now focus on increasing frequency and latencies.

Exactly. There is a lot more bang for the engineering buck cleaning up this brand new uarch. Without adding more possible complications. I'm not sure why there is this rush to even more cores. This time last year, NOBODY thought we'd have 8 excellent cores at mainstream prices.

Vattila · Aug 13, 2017

maddie said:
Speculation: [...] Zen2 on 7nm has an 8 core CCX with a [2 core + L3] shared cache unit replacing a [1 core+L3] leading to an identical topology as Zen

Interesting take. In essence you are subdividing on the lowest level in the quad-tree. So you would then have a CCX consisting of, not 4 cores like now, but 4 modules — to reuse terminology from Bulldozer — where each module is a dual-core configuration connected to a shared L3 controller. And like before, that L3 controller is direct-connected (6 links) to the L3 controllers of the 3 other modules in the CCX.

Of course, that configuration could have non-uniform characteristics due to the contention for the shared L3 controller owned by a module, hence you would want to schedule threads on different modules if you can (like in Bulldozer). And you would likely have to beef up the links within the CCX to carry traffic from the dual-cores (compared to just single-cores before), as well as increase the size of the L3 cache to support twice the number of cores.

An similar alternative is to beef up a 4-core CCX with a wider core and 4-thread SMT (like Power7), i.e. add virtual cores rather than physical.

maddie · Aug 13, 2017

swilli89 said:
We have plenty of cores now for years to come. AMD will now focus on increasing frequency and latencies.

scannall said:
Exactly. There is a lot more bang for the engineering buck cleaning up this brand new uarch. Without adding more possible complications. I'm not sure why there is this rush to even more cores. This time last year, NOBODY thought we'd have 8 excellent cores at mainstream prices.

Perhaps this slide is why there is this speculative thread? 2018, 7nm, Starship x86, 48/96 cores/threads.

Looks like AMD themselves are saying this.

turtile · Aug 13, 2017

If you look at the slide, it shows the Zen 2 APU still using 4 cores. AMD also showed a high core count future for the construction cores which never happened.

Won't AMD need to add another memory channel if they add another CCX/more cores? They'll have higher clocked and higher IPC cores with the same memory standard. How will that work when current motherboards have 16 memory banks and 8 channel support...

Arachnotronic · Aug 13, 2017

swilli89 said:
We have plenty of cores now for years to come. AMD will now focus on increasing frequency and latencies.

You don't want to increase latencies. Do you mean reduce instruction latencies?

eek2121 · Aug 14, 2017

Early AMD slides stated Ryzen using '14nm+' for 2018. At the time (can't find the exact link these days, but I remember reading it at the time), AMD stated that there was another 10-15% IPC increase to be had by taking care of low hanging fruit. In addition, the 14nm+ was supposed to allow for higher clocks. A couple review sites claimed that AMD told them to expect a 15% IPC increase. I spoke with someone at AMD later on and he stated the target would actually be closer to 20%. That was a few months ago. He could have been pulling my leg, of course, especially since he was likely under NDA. That being said, I doubt the CCX size itself is going to change at all.

2019 will see 7nm Ryzen chips. You won't see much in the way of change for the die shrink. More than likely it will be the Zen/14nm+ chip with faster clocks. This is my speculation only. You won't see a CCX change until the introduction of a new socket, if at all.

maddie · Aug 14, 2017

turtile said:
If you look at the slide, it shows the Zen 2 APU still using 4 cores. AMD also showed a high core count future for the construction cores which never happened.

Won't AMD need to add another memory channel if they add another CCX/more cores? They'll have higher clocked and higher IPC cores with the same memory standard. How will that work when current motherboards have 16 memory banks and 8 channel support...

APU is a different layout to the pure CPU design.
Construction cores, due to their underperformance, and all derivatives, were abandoned to concentrate on Zen. It's fairly certain that AMD will be sticking with Zen designs for a while.
Ryzen already can use DDR4 3600. We already have DDR4 4800 speeds, admittedly as a very expensive item. Might we have DDR4 5000+ by mid-late 2018? Can it work in motherboards?
A large benefit of higher ram speeds is the IF latency reduction. Might they increase the clock of IF relative to memory?

All speculation based on assuming that present motherboards will be usable with Zen2.

edited to change DDR4 speeds [increased]

Speculation: The CCX in Zen 2

How many cores per CCX in 7nm Zen 2?

4 cores per CCX (3 or more CCXs per die)

6 cores per CCX (2 or more CCXs per die)

8 cores per CCX (1 or more CCXs per die)

Senior member

Lifer

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Golden Member

Diamond Member

Golden Member

Senior member

Diamond Member

Senior member

Lifer

Diamond Member

Diamond Member