Should AMD spin off high-level SIMD to a separate chip?

superstition · Aug 21, 2016

Consider the following:

1) CPUs used to do the work of GPUs. We can see from the evolution of the discreet GPU the advantages of having them separate:

a) nimbleness – upgrade GPU without having to upgrade CPU/board

b) more chip space – more space to use for transistors from having separate chips

c) better tailoring – people who need GPU power can invest in that without being forced to pay for it on the chip (usually... Broadwell C is an exception, with half the CPU being graphics)

d) more power – separate CPU and GPU makes it easier to cool the same wattage with less noise, generally, thanks to much larger area to work with

e) better hardware tailoring and power – separate VRAM versus system ram with different attributes

2) Intel has to downclock AVX-2/AVX-512 already and Intel enthusiasts are in a quandary about stability testing with Prime (AVX-2 or not?). That doesn't seem very optimal. By having the SIMD chip separately powered and cooled it could be cranked up.

3) Intel seems to be using AVX tech as a way to separate itself from Zen. There is already talk about Zen not being optimal for AVX-2 and not having AVX-512 at all. People say "well, it will be forever before AVX-512 is a thing" but that doesn't change the fact that people said the same thing about regular AVX. It will be a thing and it will take up transistor real estate, power consumption, cooling, etc.

By spinning off AVX-2 and higher to a SIMD chip AMD could let people counter Intel's instructional expansion moves. It could also propel AVX-512 into the mainstream, with AMD being able to lead the pack by cranking up the power instead of having to downclock.

I know CPUs have eaten the MMU and FPU but they have mostly coughed the GPU out. HBM2 may make the APU more viable but there is always going to be the problem of concentrating so much heat into a small area, yields, and the fact that a chip can only be so large. One can put more than one chip into a die, though.

4) Should the separate SIMD chip be cooled by the CPU cooler, with a hybrid socket that can fit both chips but also operate with just the main CPU installed? Or, should it go fully separate like a northbridge chip or even have a slot like a GPU?

5) Would the latency penalty be too great, even if the chips are close to each other on the board?

6) Would it make sense to dump all SIMD into the separate chip or leave AVX and below on for mainstream users? It seems most sensible to do the latter to not force people into buying a second chip. However, that also makes for more fragmentation. I can hear the whining now, though "With Intel I only need to buy one CPU. With AMD I have to buy TWO!!!" (even if both are cheaper together than that one chip is alone).

AMD has the opportunity, with AM4+ at least, to counter Intel's instruction advantage strategy. If Intel follows suit and spins off the SIMD then enthusiasts are the winners regardless because they won't have the overclocking trouble they're faced with now due to AVX-2. As Intel continues to put more power consumption into SIMD it will cause increasing issues with overclocking — as long as it's in the same chip, eh?

It's probably too late for the AM4 boards but Zen+ could go to AM4+. It would be nicer for enthusiasts to not have to switch chipsets/boards but it's not like Intel isn't frequently changing sockets.

VirtualLarry · Aug 21, 2016

This cannot be discussed without mentioning HSA. The whole idea was for SIMD-type / vector-type instruction sequences to be offloaded to the GPU / iGPU, with the iGPU on AMD's APUs being much closer and lower-latency to the CPU cores, rather than a dGPU that requires traversing the PCI-E bus.

superstition · Aug 21, 2016

How successful will AMD be in pushing an alternative standard, though?

Also, since you mentioned the iGPU... removing SIMD to a separate chip gives more space for iGPU transistors.

Imagine consumers being able to buy an inexpensive SIMD chip that has AVX-512 on it while Intel customers have to pay a lot for Xeon.

VirtualLarry · Aug 21, 2016

What you suggest makes no sense. AMD's solution, allows software to take adavantage of the massive SIMD number-crunching ability of the processor arrays in the iGPU. Why you would want to make the CPU core's intrinsic SIMD support (as defined by Intel) onto another piece of silicon, far away from the CPU cores, is beyond me.

Look at the poor memory performance of a dual-die solution with Lynnfield (I think that's the one, maybe I'm off a little bit). I'm trying to speak of the dual-cores of the socket 1156 era, with the memory controller / iGPU on a separate piece of silicon than the CPU cores.

superstition · Aug 21, 2016

So you're saying Zen is going to support AVX-512 and have AVX-2 performance that meets or exceeds Intel's — by virtue of AMD's solution?

As for being far away... I did mention the idea of a hybrid socket. How far away is that?

And, putting AMD aside for a moment... Remember the point about Intel downclocking AVX-2/512? What's the solution to that? Also, what about the issue for enthusiasts of having AVX-2 interfering with their overclocking?

The central questions are:

1) Is the latency impact too great, no matter where the chip is located?

2) Is the additional cost of having a separate chip too great?

3) If AMD's alternative is so great, where is the evidence? Also, by saying than iGPU is required for this solution it means AMD must dedicate a lot of transistors on every chip it makes for iGPU going forward. And, since Iris surpassed AMD's iGPU performance why hasn't Intel done this if it's superior?

VirtualLarry · Aug 21, 2016

superstition said:
So you're saying Zen is going to support AVX-512 and have AVX-2 performance that meets or exceeds Intel's — by virtue of AMD's solution?

Not sure how you got that out of what I wrote. I was speaking of AMD's technologies, and not Zen in particular. In fact, I could not have been speaking about Zen, because it lacks an iGPU.

Edit: The point being, was that AMD has, and plans to with Zen APU, have a powerful iGPU on the same die as the CPU cores, and those iGPUs have arrays of vector processors, which can, and arguably should, be used for vector computations, rather than bloating the CPU cores themselves, with bigger and bigger AVX pathways.

See, Intel doesn't really have much iGPU tech., so they put their vector-processing tech into their CPU cores themselves. But AMD and NV have much more powerful GPU tech, so they try to leverage that (with HSA and CUDA) where they can. It really comes down to software compatibility.

And yes, going off-chip incurs a horrendous latency penalty, for individual vector computations.
If you have a large batch to do, then you can upload them to a dGPU, but when you have only a few... HSA to the rescue!

NTMBK · Aug 22, 2016

The benefit of using CPU SIMD instead of a GPU is that you have nice low latency and high integration. If you're offloading it to a separate chip, just use a GPU already.

I think AMD has the right idea- have great support for 128-bit SIMD, split 256-bit ops over multiple 128-bit ops to support them for compatibility, and for serious vector-math-heavy workloads point them at the GPU.

superstition · Aug 22, 2016

NTMBK said:
The benefit of using CPU SIMD instead of a GPU is that you have nice low latency and high integration. If you're offloading it to a separate chip, just use a GPU already.

What about a hybrid socket or having the chip close to the CPU? A GPU is much further away.

Is there any data that proves that the latency cost is too great?

NTMBK said:
I think AMD has the right idea- have great support for 128-bit SIMD, split 256-bit ops over multiple 128-bit ops to support them for compatibility, and for serious vector-math-heavy workloads point them at the GPU.

So AMD stands to outperform Intel soon with its solution, including support for AVX-512?

superstition · Aug 22, 2016

VirtualLarry said:
Not sure how you got that out of what I wrote. I was speaking of AMD's technologies, and not Zen in particular. In fact, I could not have been speaking about Zen, because it lacks an iGPU.

Zen doesn't have an iGPU, yes. That was one of the points I made in rebuttal.

VirtualLarry said:
AMD has, and plans to with Zen APU, have a powerful iGPU on the same die as the CPU cores, and those iGPUs have arrays of vector processors, which can, and arguably should, be used for vector computations, rather than bloating the CPU cores themselves, with bigger and bigger AVX pathways.

But Summit Ridge doesn't have that.

VirtualLarry said:
Intel doesn't really have much iGPU tech.

Really? I thought Broadwell C outperforms AMD's APUs.

VirtualLarry said:
so they put their vector-processing tech into their CPU cores themselves

So, you think licensing AMD's graphics tech is what Intel will have to do going forward to fix the SIMD problems?

VirtualLarry said:
It really comes down to software compatibility.

So, the solution, which Zen (Summit Ridge, for instance) lacks, isn't a good one because of software compatibility issues — or is that going to change soon?

VirtualLarry said:
And yes, going off-chip incurs a horrendous latency penalty, for individual vector computations.

If, for instance, AMD were to use a hybrid socket and a chip that supports AVX-2 and AVX-512 at high clocks would that be faster than not supporting AVX-512 at all and having to downclock AVX-2? Summit Ridge doesn't have an iGPU, after all.

newkopi · Aug 22, 2016

NTMBK said:
I think AMD has the right idea- have great support for 128-bit SIMD, split 256-bit ops over multiple 128-bit ops to support them for compatibility, and for serious vector-math-heavy workloads point them at the GPU.

agreed with you

Cogman · Aug 22, 2016

This is a bad idea.

Latency would kill all the benefits of moving it off chip. The reason we use simd today is because it integrates really well with current programming models. To use it effectively off chip, you have to do what we do with GPUs, that is, create a batch of commands, segment off data to be worked on, blast that over, and wait for a response. That is the reason things like CUDA and OpenCL exist and why current compilers can't just "use the GPU" going to the GPU is disruptive.

A better idea is to start moving towards a more EPIC style architecture. I think intel had it right with Itantium, they just came before the compilers were ready. Today, I think the compilers are ready. Making an instruction batch to be ran in parallel is where the biggest bang for your buck is going to come from. Where right now the compilers and CPUs reorder instructions in order to make good use of pipelining, compilers could do even more work to batch up parallel executing batches. It is a more powerful abstraction than SIMD currently is.

superstition said:
Is there any data that proves that the latency cost is too great?

Good 'ole engineering know how. If you've ever looked at a CPU, you'll notice that all of the logic gates are really close together and the caches tend to be on the periphery. But further, L1 tends to be closer than L2, and L2 closer than L3. Why is this? It is simply because of physics limitations. Signals can't travel faster than the speed of light. In one clock of a 3ghz CPU, a signal can travel roughly 3 centimeters (1/3 speed of light * 1/3GHz). However, you aren't just sending a signal, you also have to interpret that signal, that means you need decoders. You may need to boost the signal, that means amplifiers and you'll need to coordinate the signal, that means either another line or reduced throughput. And then there is the back and forth that must happen which effectively cuts your range in half. Off chip means introducing latency, and a lot of it. There is a reason why RAM is stationed so close to the CPU. There is a reason why CPUs have an IMC. There is a reason why CPUs have caches in the first place. And there is a reason why CPUs request and send data to and from ram in batches rather than just one byte at a time. It is because latency is a killer of performance.

The fact that L1 cache has a latency of about 4 cycles should tell you something important about latency. Here is a structure that CPU manufactures want to be as fast as possible, it lives right next to the logic gates, and it is reduced in size to increase performance. Yet even though it shares such a close proximity, they still need 4 cycles to get and store data to it. This isn't sloppy engineering, this is a physical limitation.

When distance becomes a problem, the only solution is batching (ala OpenCL and Cuda). And that is something that is untenable for SIMD.

superstition · Aug 22, 2016

I'm a little confused about this. You said Intel had the right idea with Itanium by trying to move toward batching but then said batching is untenable for SIMD:

Cogman said:
The reason we use simd today is because it integrates really well with current programming models.

A better idea is to start moving towards a more EPIC style architecture. I think intel had it right with Itantium, they just came before the compilers were ready. Today, I think the compilers are ready. Making an instruction batch to be ran in parallel is where the biggest bang for your buck is going to come from.

So, you're saying SIMD should be killed altogether in favor of a batch-style process that can be handled by a GPU off the chip?

Would having a separate chip on the same die be too much latency?

Cogman said:
There is a reason why RAM is stationed so close to the CPU. There is a reason why CPUs have an IMC

Yes, but RAM isn't in the CPU and FPUs used to be separately socketed. Underclocking is a less than elegant solution so I just wanted to know if there might be a better one. The fragmentation issue is also less than ideal, with segmentation being used to cause it.

Cogman · Aug 23, 2016

superstition said:
I'm a little confused about this. You said Intel had the right idea with Itanium by trying to move toward batching but then said batching is untenable for SIMD:

Yup. There is a difference in order and location. Intel's itanium batches were around 6 instructions long. It didn't matter too much if most of the batch was nops (even though you didn't see great performance gains.

It is the same reason why SIMD makes sense on a CPU. It is only operating on 4 or 8 values at once (or are they up to 16?) which means you are creating a fairly small batch per execution. If Intel readopted some of the Itanium design, you could expose more functionality in a batch than what is currently supported by the SIMD instruction set, all running in parallel. It works when it is in the same region because the latency is very low. It would not work off chip.

superstition said:
Would having a separate chip on the same die be too much latency?

Speed of light. Same answer I gave before. If you add significant distance from one part to the next, you have to worry about introducing communication protocols, synchronization, and of course, the speed of light problem. It isn't a simple matter of taking one part and putting it in a new location.

And then, what would the benefit be? Why are you adding all this extra complexity for the same die? It makes sense to do that with a GPU because it is a very distinct part from the CPU. But taking an instruction and saying "Lolz, we are going to process this over here now" is just madness.

Ultimately, SIMD takes up little die space. It effectively uses existing parts of the CPU (the FPU). and it is low latency. Moving it to a new chip is pointless because you invalidate pretty much all of those benefits.

superstition said:
Yes, but RAM isn't in the CPU and FPUs used to be separately socketed.

And why do you think the FPU is no longer separately socketed and is now deeply integrated in the CPU logic? The reason it was made on a separate chip was twofold. First, there was profit to be made. Second, the node size was so big that an FPU would take up a significant proportion of the CPU die space. The reason was not for performance.

superstition said:
Underclocking is a less than elegant solution so I just wanted to know if there might be a better one.

It is cheap and easy to do. Intel downclocking while AVX instructions are chugging is really not a terrible solution, in fact, it is likely a power saving feature. More than likely, they are running the AVX parts at a full clock rate (because they are fairly separate from everything else) but downclock the rest of the core because nothing can get through while the AVX instructions are processing. In other words, they are reducing power usage of the rest of the CPU because it would be spent busy waiting (which wastes power).

Ideally, an AVX instruction wouldn't take so long to process. It is a sign that they probably don't have enough FPU resources on the CPU to handle AVX instructions (or that the memory bandwidth is too limited and they need to increase L1 sizes to accommodate). If I were betting, I would say it is probably a memory constraint. AVX instructions are likely saturating the line between the CPU and memory.

superstition · Aug 23, 2016

cogman said:
And why do you think the FPU is no longer separately socketed and is now deeply integrated in the CPU logic? The reason it was made on a separate chip was twofold. First, there was profit to be made. Second, the node size was so big that an FPU would take up a significant proportion of the CPU die space. The reason was not for performance.

The reason was for performance if the node size prohibited adequate FPU circuitry. But, the fact that the FPU was able to be separate was the point I was trying to make. I guess the speed of light didn't make it impossible.

cogman said:
And then, what would the benefit be? Why are you adding all this extra complexity for the same die? It makes sense to do that with a GPU because it is a very distinct part from the CPU. But taking an instruction and saying "Lolz, we are going to process this over here now" is just madness.

It was a question about latency differences.

NTMBK said:
I think AMD has the right idea- have great support for 128-bit SIMD, split 256-bit ops over multiple 128-bit ops to support them for compatibility, and for serious vector-math-heavy workloads point them at the GPU.

Does that GPU have to be an iGPU or can this serious vector-math-heavy workload be done more distantly?

Lepton87 · Aug 23, 2016

superstition said:
The reason was for performance if the node size prohibited adequate FPU circuitry. But, the fact that the FPU was able to be separate was the point I was trying to make. I guess the speed of light didn't make it impossible.

Do you remember the clock speed then and now? About two orders of magnitude is rather significant.

Lepton87 said:
Do you remember the clock speed then and now? About two orders of magnitude is rather significant.

Of course it can be done, it's all a question of performance and performance would depend very heavily on the workload. Anyway for some workloads integrated GPU with the HSA support can be an order of magnitude faster possibly more.

Cogman · Aug 23, 2016

superstition said:
The reason was for performance if the node size prohibited adequate FPU circuitry.

No, the reason it was separate was physical limitations, not performance. The reason to have a FPU was because any FPU is better than no FPU. However, having it on a separate package was far from ideal.

We are no longer in an era where logic circuits take up a significant portion of die space. Logic circuits are some of the smallest parts of a CPU die, so shipping off different logic portions into the boondocks of a die is, frankly, a stupid idea.

superstition said:
But, the fact that the FPU was able to be separate was the point I was trying to make. I guess the speed of light didn't make it impossible.

Who said anything about it making things impossible? We have server clusters working on the same tasks that aren't in the same chassis. We have GPUs which can work on jobs separate from the CPU. Moving instruction processing to a new location is not impossible. However, if you have a choice "put it in the processor" and "Move it out of the processor", putting it in the processor is always going to be the best choice. It is easier to work with from a hardware perspective, there is less coordination that has to happen, and you have less latency concerns. That is right, less. Even within the small space of the processor, you still have to deal with the fact that switching isn't instantaneous and that data will arrive at different times.

superstition · Aug 23, 2016

Cogman said:
No, the reason it was separate was physical limitations, not performance. The reason to have a FPU was because any FPU is better than no FPU. However, having it on a separate package was far from ideal.

I'm not debating the last statement. I'm just pointing out that you get the performance by having the transistors dedicated to doing FPU. If you don't have space for those transistors in the CPU because of the node as you said then you get more FPU performance by having an external chip.

Cogman said:
We are no longer in an era where logic circuits take up a significant portion of die space. Logic circuits are some of the smallest parts of a CPU die, so shipping off different logic portions into the boondocks of a die is, frankly, a stupid idea.

So, what's the solution to downclocking?

How can AVX-2 support be changed to stop interfering with overclocking stability testing?

Why doesn't Zen include AVX-512 support? Why is Zen rumored to be on the weak side in AVX-2 performance? All of those things suggest that there is a considerable price to be paid for including these transistors.

Cogman said:
Who said anything about it making things impossible?

You said several things can't be done.

Cogman said:
Moving instruction processing to a new location is not impossible. However, if you have a choice "put it in the processor" and "Move it out of the processor", putting it in the processor is always going to be the best choice. It is easier to work with from a hardware perspective, there is less coordination that has to happen, and you have less latency concerns. That is right, less. Even within the small space of the processor, you still have to deal with the fact that switching isn't instantaneous and that data will arrive at different times.

Of course there is less latency and of course there is latency within a processor.

Cogman said:
However, if you have a choice "put it in the processor" and "Move it out of the processor", putting it in the processor is always going to be the best choice.

If that were true then we wouldn't have discreet GPUs, sound hardware on motherboards, NICs on motherboards, etc.

Cogman said:
If I were betting, I would say it is probably a memory constraint. AVX instructions are likely saturating the line between the CPU and memory.

Has anyone noticed improvement from the 128 MB L4 cache in Broadwell C or is it inadequate for this purpose because it's a victim cache?

Also, if memory is the problem would putting HBM2 into the CPU be the best option?

superstition · Aug 23, 2016

Lepton87 said:
Of course it can be done, it's all a question of performance and performance would depend very heavily on the workload. Anyway for some workloads integrated GPU with the HSA support can be an order of magnitude faster possibly more.

So, is it a mistake to release a CPU that doesn't have HSA/iGPU tech?

NTMBK · Aug 24, 2016

superstition said:
So, is it a mistake to release a CPU that doesn't have HSA/iGPU tech?

In the long term, probably yes. Right now? No, because GPU programming models are not built around unified memory pools.

DrMrLordX · Aug 26, 2016

superstition said:
Really? I thought Broadwell C outperforms AMD's APUs.

Take away the eDRAM and . . . ehhhh not really. More specifically, a 512-shader GCN 1.2 iGPU with color compression is probably still the fastest integrated GPU on the market, though Gen10 is now nipping at AMD's heels. BUT

but

talk to anyone who works with OpenCL (particularly 2.0) and they will bitch about how bad are Intel's drivers. Or at least they were about six months ago.

So, you think licensing AMD's graphics tech is what Intel will have to do going forward to fix the SIMD problems?

Intel has no SIMD problem, really.

If, for instance, AMD were to use a hybrid socket and a chip that supports AVX-2 and AVX-512 at high clocks would that be faster than not supporting AVX-512 at all and having to downclock AVX-2? Summit Ridge doesn't have an iGPU, after all.

No, particularly not for AVX2. The bus speed limitations and cache coherency issues alone would make AVX2 on a separate die make zero sense. AVX2 lets you effectively handle 8 32-bit operations in one shot (let's say, 8 non-dependant 32-bit fp adds, for the sake of simplicity). That remote chip would have to receive the instruction block, process the block, and return the results in good order before the thread submitting the block from the main CPU to the vector unit could continue on with other work. Latency would be a killer. A non-SIMD CPU at the same clockspeed with the same uarch could probably retire all 8 instructions without the benefit of a SIMD ISA in the same amount of time it would take to send/receive to/from the external vector unit.

superstition said:
I'm not debating the last statement. I'm just pointing out that you get the performance by having the transistors dedicated to doing FPU. If you don't have space for those transistors in the CPU because of the node as you said then you get more FPU performance by having an external chip.

Who said they didn't have the space? People like Keller have calculated that spending the available chip real-estate on "moar coars" (or more cache) was a better investment. The overall die size has been kept small to improve yields and reduce costs.

So, what's the solution to downclocking?

Umm, not really sure on the Intel side. Their chips have to be kept at lower clockspeeds to avoid overheating during heavy SIMD usage because maxing out the chip's capabilities shows us in no uncertain terms what are the consequences of having such high heat flux density on modern chips. AMD may well have had similar problems had they actually gone with 256-bit FMACs.

On the AMD side, one way to to "pre-split" all AVX/AVX2 instructions in software. For Con cores, I think this essentially meant asking the software developer to support xOP (more or less), but xOP is gone from Zen so I don't know what viable options are there.

AMD dumped xOP because they seemed to think nobody would support it. Wouldn't you know, DX12 might support it . . . at least on paper anyway.

Why doesn't Zen include AVX-512 support?

Same reason why BD and PD don't support AVX2? It's Intel's baby, and despite cross-licensing it takes awhile for AMD to adapt to Intel's SIMD ISAs.

Why is Zen rumored to be on the weak side in AVX-2 performance?

It has no 256-bit FMACs which are generally required for 256-bit vector ISAs. It would need something special in hardware to quickly identify, split, and re-organize 256-bit vector instructions to get them running on their 128-bit FMACs without taking away extra clock cycles to do so. I see no indication that Zen has that, meaning it'll probably take up an extra clock cycle to split AVX/AVX2 instructions for processing.

Should AMD spin off high-level SIMD to a separate chip?

Should AMD spin off high-level SIMD to a separate chip?

I don't have enough knowledge to answer this with much certainty, so I won't answer

I don't have enough knowledge to answer this with much certainty but sure, why not?

I don't have enough knowledge to answer this with much certainty but no, I doubt it

I am moderately knowledgeable in these areas so I think it could make sense

I am moderately knowledgeable in these areas so I think it's probably not a good idea

I am moderately knowledgeable in these areas and think it's too complex to decide

I have fairly expert knowledge in the relevant areas and think it's a worthwhile idea

I have fairly expert knowledge in the relevant areas and think it's a poor idea

I have fairly expert knowledge in the relevant areas and am on the fence about this

Platinum Member

No Lifer

Platinum Member

No Lifer

Platinum Member

No Lifer

Lifer

Platinum Member

Platinum Member

Junior Member

Lifer

Platinum Member

Lifer

Platinum Member

Platinum Member

Lifer

Platinum Member

Platinum Member

Lifer

Lifer