Intel Chips With “Vega Inside” Coming Soon?

cbn · Oct 12, 2017

PeterScott said:
I read the NVidia paper a few months ago, and IIRC, they were applying this to compute space, not Graphics, and a major focus of the research was limiting the inter communications between the modules.

So you do still have the memory duplication issue for textures for graphics with this approach. Each GPU tile has its own memory controller, and they all likely need much the same textures. So either you duplicate the textures in each chips memory pool (thus wasting memory) or you treat it as one big pool but with lot more latency, and huge contention issues given the huge appetite for texture memory.

I am not convinced having a SYS + I/O chip solves the need for any SW involvement either. If that was the case, why couldn't standard GPU be built with a slightly more robust SYS + I/O section that was switchable between master/slave to make doing that kind of thing on Dual GPU cards, where one chips SYS + I/O runs the GPU portion from both chips? Instead dual GPU card always ended up requiring CF/SLI software and were just as problematic as dual cards.

Anyway, there certainly wouldn't bother doing this in CPU + GPU package, there would be no need/point for so much GPU power that they would need multiple GPU dies, they would have a hard time supplying the power and cooling such a beast would need.

Here is the paper:

http://research.nvidia.com/sites/default/files/publications/ISCA_2017_MCMGPU.pdf

3.1 MCM-GPU Organization
In this paper we propose the MCM-GPU as a collection of GPMs
that share resources and are presented to software and programmers
as a single monolithic GPU. Pooled hardware resources, and shared
I/O are concentrated in a shared on-package module (the SYS +
I/O module shown in Figure 1). The goal for this MCM-GPU is to
provide the same performance characteristics as a single (unmanu-
facturable) monolithic die. By doing so, the operating system and
programmers are isolated from the fact that a single logical GPU
may now be several GPMs working in conjunction. There are two
key advantages to this organization. First, it enables resource sharing
of underutilized structures within a single GPU and eliminates hard-
ware replication among GPMs.Second, applications will be able to
transparently leverage bigger and more capable GPUs, without any
additional programming effort

Figure 3 shows the high-level diagram of this 4-GPM MCM-
GPU. Such an MCM-GPU is expected to be equipped with 3TB/s
of total DRAM bandwidth and 16MB of total L2 cache. All DRAM
partitions provide a globally shared memory address space across
all GPMs.

This, in contrast, to this...

2.2 Multi-GPU Alternative
An alternative approach is to stop scaling single GPU performance,
and increase application performance via board- and system-level
integration, by connecting multiple maximally sized monolithic
GPUs into a multi-GPU system. While conceptually simple, multi-
GPU systems present a set of critical challenges. For instance, work
distribution across GPUs cannot be done easily and transparently and
requires significant programmer expertise

and this.....

Alternatively, on-package GPMs could be organized as multiple
fully functional and autonomous GPUs with very high speed in-
terconnects. However, we do not propose this approach due to its
drawbacks and inefficient use of resources.

So Nvidia certainly intends for their MCM GPU to be seen transparently by software as one big GPU.

PeterScott · Oct 12, 2017

cbn said:
Here is the paper:

So Nvidia certainly intends for their MCM GPU to be seen transparently by software as one big GPU.

As I said before. For compute problems. Grahpics is a whole other kettle of fish. There are all kinds of noted negative memory impacts to the architecture:

The MCM-GPU memory system is a Non Uniform Memory Access (NUMA) architecture, as its inter-GPM links are not expected to provide full aggregated DRAM bandwidth to each GPM. Moreover, an additional latency penalty is expected when accessing memory on remote GPMs. This latency includes data movement time within the local GPM to the edge of the die, serialization and deserialization latency over the inter-GPM link, and the wire latency to the next GPM.

Even the low memory intensity, limited parallelism compute tasks suffer from the Memory latency issues:

Surprisingly, even the non-scalable applications with limited parallelism and low memory intensity show performance sensitivity to the inter-GPM link bandwidth due to increased queuing delays and growing communication latencies in the low bandwidth scenarios.

This would bring a gaming GPU to it's knees. Graphics is fully parallel, with absurd memory intensity. There would be huge latency and bus contention as all the unit would be trying to move masses of texture data from all other units simultaneously.

This is NOT a suitable memory architecture for gaming GPU.

Also, you don't need SLI/CF drivers for compute tasks. That is fairly trivial to split up, graphics is NOT.

So really, almost nothing in the paper applies to GPUs for graphics. They are completely different problems spaces with completely different impact on memory usage and partitioning and controlling the problem, and recombining results.

I am not saying this is impossible for a Gaming GPU, just that it is a much tougher problem than it is for compute. The details of this paper are not a solution the gaming GPU problems.

Given the difficulties involved I don't really expect to see anyone fully embrace the MCM GPU until it is absolutely necessary, and the first steps will likely be dual chip with wasteful independent memory pools.

And again, totally pointless for iGPU type solution. This is for getting monster GPU solutions too big for one chip.

cbn · Oct 12, 2017

PeterScott said:
As I said before. For compute problems. Grahpics is a whole other kettle of fish.

I didn't see anything in that paper about being specific to compute:

Here is the first quote I listed in post #76:

"In this paper we propose the MCM-GPU as a collection of GPMs
that share resources and are presented to software and programmers
as a single monolithic GPU. Pooled hardware resources, and shared
I/O are concentrated in a shared on-package module (the SYS +
I/O module shown in Figure 1). The goal for this MCM-GPU is to
provide the same performance characteristics as a single (unmanu-
facturable) monolithic die. By doing so, the operating system and
programmers are isolated from the fact that a single logical GPU
may now be several GPMs working in conjunction. There are two
key advantages to this organization. First, it enables resource sharing
of underutilized structures within a single GPU and eliminates hard-
ware replication among GPMs.Second, applications will be able to
transparently leverage bigger and more capable GPUs, without any
additional programming effort"

It operates transparently at the OS and application level as a single GPU.

cbn · Oct 12, 2017

@PeterScott

Also don't forget the idea is to use EMIB, not MCM.

PeterScott · Oct 12, 2017

cbn said:
I didn't see anything in that paper about being specific to compute:
.

Then you didn't read the paper.

The entire problem space they discuss is compute tasks. They run simulations on well they expect to perform in a variety of compute tasks. The discuss the problem of how to best partition compute tasks for memory locality. Etc....

They do nothing with graphics at all.

cbn · Oct 12, 2017

PeterScott said:
Surprisingly, even the non-scalable applications with limited parallelism and low memory intensity show performance sensitivity to the inter-GPM link bandwidth due to increased queuing delays and growing communication latencies in the low bandwidth scenarios.

Click to expand...

This would bring a gaming GPU to it's knees. Graphics is fully parallel, with absurd memory intensity. There would be huge latency and bus contention as all the unit would be trying to move masses of texture data from all other units simultaneously.

This is NOT a suitable memory architecture for gaming GPU.

Here is the part of the paper from which your quote (bolded below) is from. Notice that applications are grouped into not only compute intensive, but memory intensive ones as well....and that memory-intensive category are the most sensitive to link bandwidth, with 12%, 40%, and 57% performance degradation for 1.5TB/s, 768GB/s, and 384GB/s settings respectively.

3.3.2 Performance Sensitivity to On-Package Bandwidth.
Figure 4 shows performance sensitivity of a 256 SM MCM-GPU
system as we decrease the inter-GPM bandwidth from an abun-
dant 6TB/s per link all the way to 384GB/s. The applications are
grouped into two major categories of high- and low-parallelism,
similar to Figure 2. The scalable high-parallelism category is further
subdivided into memory-intensive and compute-intensive applica-
tions (For further details about application categories and simulation
methodology see Section 4).
Our simulation results support our analytical estimations above.
Increasing link bandwidth to 6TB/s yields diminishing or even no
return for an entire suite of applications. As expected, MCM-GPU
performance is significantly affected by the inter-GPM link band-
width settings lower than 3TB/s. For example, applications in the
memory-intensive category are the most sensitive to link bandwidth,
with 12%, 40%, and 57% performance degradation for 1.5TB/s,
768GB/s, and 384GB/s settings respectively. Compute-intensive
applications are also sensitive to lower link bandwidth settings, how-
ever with lower performance degradations. Surprisingly, even the
non-scalable applications with limited parallelism and low memory
intensity show performance sensitivity to the inter-GPM link band-
width due to increased queuing delays and growing communication
latencies in the low bandwidth scenarios.

But notice Nvidia is only shooting for 768 GB/s of bandwidth (resulting in 40% performance degradation in memory intensive applications) due to need for better packaging and signaling technologies:

3.3.3 On-Package Link Bandwidth Configuration.
NVIDIA’s GRS technology can provide signaling rates up to 20
Gbps per wire. The actual on-package link bandwidth settings for
our 256 SM MCM-GPU can vary based on the amount of design
effort and cost associated with the actual link design complexity, the
choice of packaging technology, and the number of package routing
layers. Therefore, based on our estimations, an inter-GPM GRS link
bandwidth of 768 GB/s (equal to the local DRAM partition band-
width) is easily realizable. Larger bandwidth settings such as 1.5
TB/s are possible, albeit harder to achieve, and a 3TB/s link would re-
quire further investment and innovations in signaling and packaging
technology. Moreover, higher than necessary link bandwidth settings
would result in additional silicon cost and power overheads. Even
though on-package interconnect is more efficient than its on-board
counterpart, it is still substantially less efficient than on-chip wires
and thus we must minimize inter-GPM link bandwidth consumption
as much as possible.
In this paper we assume a low-effort, low-cost, and low-energy
link design point of 768GB/s and make an attempt to bridge the
performance gap due to relatively lower bandwidth settings via ar-
chitectural innovations that improve communication locality and
essentially eliminate the need for more costly and less energy effi-
cient links. The rest of the paper proposes architectural mechanisms
to capture data-locality within GPM modules, which eliminate the
need for costly inter-GPM bandwidth solutions

But remember we are thinking about the possibilities of EMIB. What is possible with EMIB vs. MCM?

If 1.5 TB/s bandwith only results in 12% performance degradtion in memory intensive applications, what could EMIB provide? And how much could the latency be reduced with EMIB vs. MCM?

PeterScott · Oct 12, 2017

cbn said:
But remember we are thinking about the possibilities of EMIB. What is possible with EMIB vs. MCM?

If 1.5 TB/s bandwith only results in 12% performance degradtion in memory intensive applications, what could EMIB provide? And how much could the latency be reduced with EMIB vs. MCM?

Memory intensity of compute is has nothing on the memory intensity of gaming. I would be very interested in seeing NVidia paper on a multi chip GPU for gaming, but this isn't it.

EMIB vs MCM? Is NVidia is using EMIB? Because the last time I checked Intel wasn't building multi chip GPUs, and they certainly wouldn't in a APU like the one that is a topic of this thread.

I seriously doubt we are going to see any kind of multi-chip packages for a gaming GPU anytime soon.

Even if we did, it would just be a desperation play from AMD, and NVidia would crush them with a single chip.

Bouowmx · Oct 12, 2017

AMD Navi flagship will be 2x Vega 10/20 (or another 4096-core GCN derivative) at 7 nm. Believe me.

cbn · Oct 12, 2017

PeterScott said:
Memory intensity of compute is has nothing on the memory intensity of gaming. I would be very interested in seeing NVidia paper on a multi chip GPU for gaming, but this isn't it.

EMIB vs MCM? Is NVidia is using EMIB? Because the last time I checked Intel wasn't building multi chip GPUs, and they certainly wouldn't in a APU like the one that is a topic of this thread.

Post #69 and post #70 (below) is how the conversation started.

cbn said:
What about using multiple smaller dies (EMIB together) to make one GPU that is larger than is normally possible with a monolithic die or with multiple dies on an interposer?

PeterScott said:
The problem with multiple GPU dies is you kind of still have the SLI/CF problem, of memory waste (duplicate texture buffers) and software to partition the work, as even today certain games have problems with it. Those problems don't just disappear because they are connected on an interposer.

So if it is true that is AMD selling GPU dies to Intel eventually I would imagine AMD will have one compatible with EMIB as well. This same EMIB compatible die used to make one large GPU (edit: Or a large APU)

That way future AMD GPU EMIB dies get used in more than one way. (Although I do imagine an EMIB compatible die would work fine for single GPU Video card as well).

PeterScott · Oct 12, 2017

cbn said:
So if it is true that is AMD selling GPU dies to Intel eventually I would imagine AMD will have one compatible with EMIB as well. This same EMIB compatible die used to make one large GPU (edit: Or a large APU)

That way future AMD GPU EMIB dies get used in more than one way. (Although I do imagine an EMIB compatible die would work fine for single GPU Video card as well).

It isn't.

maddie · Oct 12, 2017

PeterScott said:
I seriously doubt we are going to see any kind of multi-chip packages for a gaming GPU anytime soon.

Even if we did, it would just be a desperation play from AMD, and NVidia would crush them with a single chip.

By any chance, did you work on the Skylake-X team?

LTC8K6 · Oct 13, 2017

maddie said:
By any chance, did you work on the Skylake-X team?

Were they making GPUs too?

DrMrLordX · Oct 13, 2017

Bouowmx said:
AMD Navi flagship will be 2x Vega 10/20 (or another 4096-core GCN derivative) at 7 nm. Believe me.

And I'm sure it'll be sold out and/or selling at inflated prices by the time it comes out. Ugh.

maddie · Oct 13, 2017

LTC8K6 said:
Were they making GPUs too?

No, not discrete ones, but according to Intel, I'm told that a glued together multiple die solution can't ever be effective compared to a monolithic solution.

IntelUser2000 · Oct 13, 2017

maddie said:
No, not discrete ones, but according to Intel, I'm told that a glued together multiple die solution can't ever be effective compared to a monolithic solution.

They aren't really wrong. They'd be forced to in the future because they have no choice, and gains from process are slowing. Everyone would forever do monolithic if process gains aren't slowing down. You need 3-4 years now for a new process that brings 2x gain in density and ~30% reduction in power.

Moore's Law, by principle seems to favor technologies that allow for proliferation around the globe, and ever smaller, more mobile devices. The companies that did that gained greatly financially, and those that didn't, not so much. Multi-big die devices go against that.

maddie · Oct 13, 2017

IntelUser2000 said:
They aren't really wrong. They'd be forced to in the future because they have no choice, and gains from process are slowing. Everyone would forever do monolithic if process gains aren't slowing down. You need 3-4 years now for a new process that brings 2x gain in density and ~30% reduction in power.

Moore's Law, by principle seems to favor technologies that allow for proliferation around the globe, and ever smaller, more mobile devices. The companies that did that gained greatly financially, and those that didn't, not so much. Multi-big die devices go against that.

I'm not really sure what you're saying here.

You seem to be implying the need to go multi-die in the future and yet claim that the poster who wrote this following statement [which I was replying to], isn't really wrong. Can't have it both ways.

"I seriously doubt we are going to see any kind of multi-chip packages for a gaming GPU anytime soon.
Even if we did, it would just be a desperation play from AMD, and NVidia would crush them with a single chip."

Basically I was trying to point to a recent example of the extreme fallacy in that statement using Skylake-X vs Epyc as proof. Now we have something similar being said of GPUs. In my view, an epically, or should I say, 'Epyc-ally' myopic argument.

PeterScott · Oct 13, 2017

maddie said:
No, not discrete ones, but according to Intel, I'm told that a glued together multiple die solution can't ever be effective compared to a monolithic solution.

Yeah, isn't it great that AMD solved the problem of multi-socket CPU systems, plaguing the industry, that required kludgy drivers (like CF/SLI) to handle using more than one CPU socket... Oh, Wait...

In reality that was never a problem. Different problem spaces have different solutions. CPUs aren't that finicky, even in separate sockets like has been done for decades, or multiple CPU chips sharing a package (My 2008 Q9400 has two CPU chips).

AMD did not break new ground putting multiple CPU chips into one package. Those issues were solved ages ago.

Multiple gaming GPUs OTOH, have management and memory sharing issues that remain significant unsolved problems.

maddie · Oct 13, 2017

PeterScott said:
Yeah, isn't it great that AMD solved the problem of multi-socket CPU systems, plaguing the industry, that required kludgy drivers (like CF/SLI) to handle using more than one CPU socket... Oh, Wait...

In reality that was never a problem. Different problem spaces have different solutions. CPUs aren't that finicky, even in separate sockets like has been done for decades, or multiple CPU chips sharing a package (My 2008 Q9400 has two CPU chips).

AMD did not break new ground putting multiple CPU chips into one package. Those issues were solved ages ago.

Multiple gaming GPUs OTOH, have management and memory sharing issues that remain significant unsolved problems.

Who's talking about 'multiple gaming GPUs'? It seems only you as, AFAIK, the concept is a multi-die integrated GPU. This is a perfect example of a strawman defense.

Technical issues do exist, which I should not need to add, exists in EVERY technical solution to ANY product. Just because we have working examples of a product does not mean that technical hurdles were not involved, just that have been solved. The same is happening to multi-die GPUs. Just because you can't see a solution should not be taken that there can't be one.

IF is scalable to a 512 bit width according to AMD. What is the data capacity of a bus that wide?

PeterScott · Oct 13, 2017

maddie said:
Who's talking about 'multiple gaming GPUs'? It seems only you as, AFAIK, the concept is a multi-die integrated GPU. This is a perfect example of a strawman defense.

I assume most here care about GPUs for gaming and not compute/HPC.

Solving multi-GPU issues for compute/HPC is trivial compared to solving it for gaming.

maddie · Oct 13, 2017

PeterScott said:
I assume most here care about GPUs for gaming and not compute/HPC.

Solving multi-GPU issues for compute/HPC is trivial compared to solving it for gaming.

Your point being what exactly? That it being hard means ......? We leave the hard problems for posterity?

The simple answer is that for continued rapid advancement in performance, the industry is being forced into a multi-die approach for both CPUs and GPUs. The choice is which 'hard' do we tackle first. Continuing on the path for high performance cost-effective yields on smaller nodes or segmenting a product and reassembling it on a SI substrate. I suggest that the latter 'hard' is less difficult than the first. Witness Intel's EMIB upcoming designs.

I will even argue that your HPC not gaming argument is flawed as the data bandwidth requirements for HPC is just as high as gaming and in some cases even higher. Hell, even mining needs more bandwidth/shader than gaming in general. You upclock the memory and downclock the GPU.

PeterScott · Oct 13, 2017

maddie said:
Your point being what exactly? That it being hard means ......? We leave the hard problems for posterity?

I only said don't expect them soon. We are many years away from when backs are really against the wall on die size.

Multi-GPU designs are always going to suffer penalties in gaming, compared to monolithic ones. You could force this move early, but doing it prematurely, will leave you at further disadvantage to a competitor that is still using monolithic designs.

This has been a problem worked on for almost 2 decades, and SLI/CF implementations are still not seamless. You would think after decades, this would be running smoothly at least at the driver level and all games would just work fine. They don't. There are still interactions between games and CF/SLI drivers that have to be worked out on a case by case basis. Do you really think just moving the chips on the same package is going to make those issues vanish?

A lot of people seem to think that Navi will be AMDs move to multi-chip gaming, and seem giddy that this will put them ahead of NVidia.

I think it is a near certainty that both of those things won't happen.

Either AMD knows it is too soon, so they really won't go multi-chip yet. Or if they do go down this road, they will actually create bigger performance deficits vs NVidia, not solve them.

maddie · Oct 13, 2017

PeterScott said:
I only said don't expect them soon. We are many years away from when backs are really against the wall on die size.

Multi-GPU designs are always going to suffer penalties in gaming, compared to monolithic ones. You could force this move early, but doing it prematurely, will leave you at further disadvantage to a competitor that is still using monolithic designs.

This has been a problem worked on for almost 2 decades, and SLI/CF implementations are still not seamless. You would think after decades, this would be running smoothly at least at the driver level and all games would just work fine. They don't. There are still interactions between games and CF/SLI drivers that have to be worked out on a case by case basis. Do you really think just moving the chips on the same package is going to make those issues vanish?

A lot of people seem to think that Navi will be AMDs move to multi-chip gaming, and seem giddy that this will put them ahead of NVidia.

I think it is a near certainty that both of those things won't happen.

Either AMD knows it is too soon, so they really won't go multi-chip yet. Or if they do go down this road, they will actually create bigger performance deficits vs NVidia, not solve them.

For the last time. This is not SLI/XFire. Your statement underlines your misunderstanding. We did not have a cost effective, low latency, low power way to allow high bandwidth between dies. SI tech with micro-bumps is a new tech just mastered, so using the point of it never being done before is ridiculous. Having it allows new solutions as has always happened in the world.

Space-X recovering 1st states with rocket engines was also said to have too much of a weight penalty to possibly work. Throughout the tech world, we have these definitive statements of the unworkable that proved to be false, once resources were allocated to solving them.

Your final statement is very revealing in that it appears to be a partisan argument used for personal goals. Defending 'my' side and belittling the 'other' side. Strangely, all of the major players are progressing in this direction as fast as possible. All I can say is that we'll soon see who is correct.

I'm expecting a reply from you, but this is my last. Carry on.

PeterScott · Oct 13, 2017

maddie said:
For the last time. This is not SLI/XFire.

And hopefully, for the last time. The issues related to managing multiple GPUs still have to be dealt with, even if you put them in the same package. The interconnect may be faster, but at the logical level, it is still the same problem.

maddie said:
Your final statement is very revealing in that it appears to be a partisan argument used for personal goals. Defending 'my' side and belittling the 'other' side.

Said the Pot, to the Kettle.

maddie said:
By any chance, did you work on the Skylake-X team?

Topweasel · Oct 13, 2017

PeterScott said:
Yeah, isn't it great that AMD solved the problem of multi-socket CPU systems, plaguing the industry, that required kludgy drivers (like CF/SLI) to handle using more than one CPU socket... Oh, Wait...

In reality that was never a problem. Different problem spaces have different solutions. CPUs aren't that finicky, even in separate sockets like has been done for decades, or multiple CPU chips sharing a package (My 2008 Q9400 has two CPU chips).

AMD did not break new ground putting multiple CPU chips into one package. Those issues were solved ages ago.

Multiple gaming GPUs OTOH, have management and memory sharing issues that remain significant unsolved problems.

Well not completely true on any level. AMD took a mesh tech which Intel is using on the SL server dies and applied it to a whole package that goes from the inter module communication to multi-socket communication. That is more than fitting two dies on single chip or multi-socket. True their not as finicky which is one of several reasons this is implemented on the CPU's first. But that isn't even why you use this on an MCM like design with the GPU's.

First the problems with SLI and Xfire are due to using an established bus tech to communicate delicate communication and latency intensive communication with secondary devices that have their own driver implementation in the BIOS. Even with their X2 cards they are leveraging their work in that field to include two whole cards into a single package. What a proper implementation like IF on graphics cards would allow is that same mesh tech to allow for all these dies to be talked to as one device. Will they pull if off who knows. But this isn't just about getting the most efficient package. We know this isn't the case even on non-finicky CPU's because AMD had to double up power on the 1950x to keep it's clock speed. But what it would fix is yields. GPU dies are giant dies run at high clockspeeds that have to run at hot temperatures. An arch designed around running slower dies more efficiently would take care of a lot yield issues, both in the clockspeed goal and die size to defect ratio. GPU's tend to have some of the worst yield results. Doing it like this would more than make up for the total increase in silicon.

jpiniero · Oct 13, 2017

PeterScott said:
It wouldn't save them money in any term. Intel has owned the vast majority of the CPU/GPU market with their iGPU chips for years.

But that's only because of the CPU and not the IGP. I have no idea how large the IGP division is; but if they could save a decent amount on money on salaries and the like you have to think they are exploring it. Especially given that Intel is now a server company first.

Intel Chips With “Vega Inside” Coming Soon?

Lifer

Platinum Member

Lifer

Lifer

Platinum Member

Lifer

Platinum Member

Golden Member

Lifer

Platinum Member

Diamond Member

Lifer

Lifer

Diamond Member

Elite Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Lifer