What's the purpose of integrated GPUs?

OVerLoRDI · Apr 25, 2012

Cost. Power. Form factor.

Cerb · Apr 25, 2012

cytg111 said:
cant find the link, but theres an interview with carmack on youtube (duh), about rage, where he goes into detal about what consoles does much much better than PC's and evens out the factor 10'ish flops due to this design .. and its actually in respect to shared resources.

Dunno, but we've probably all read it. The thing is, consoles still just don't have the performance, due to being behind the times by the time games are able to really utilize them. It all looks great when they start, and they get so much control, but then we have PCs with many times the power, by the times the games are coming out.

Shared RAM is ultimately the superior option, for programmability. Up to now, it's been a bandwidth problem. RDRAM's failure due to oligarchical collusion hasn't helped matters, either. If the GPU and CPU can share enough memory bandwidth, at low enough latency, or if GBs of stacked memristors really turn out cheap and effective as caches, and don't screw up cooling, we could start to move away from dedicated memories (Intel already shares cache, and AMD's iGPU are fast enough it doesn't matter that they don't, quite yet, so it's going pretty well on the chips, IMO).

pelov said:
This is the same argument that people use for defending Bulldozer but it never works. You can't claim something is amazing and then ask the entire planet to gravitate towards the new ISAs and architecture.

But, it's not that at all. OpenGL, OpenCL, and DirectX are the architectures. AVX(2) can hide behind function calls and (H|G)LSL translators. The question is, will it be good enough?

In pure computation, yes, it will be. It will fall short where modern GPUs excel, IMO (large register files and thousands of available vector lanes are going to be hard to beat, when they can all be utilized well), but will be no slouch, and more importantly, will bring x86 to the same level that vector-enhanced RISCs have been at for a long time, and maybe even exceed them.

But, to do well in graphics, compared to the competition, it will still need fast hardware for textures, poly setup, AA, AF, and so on, and grahics tasks don't need all that fancy prefetching, instruction snooping, branch prediction, instruction re-ordering, etc., that eats up so much of our CPUs' space and power. High FLOPS and bandwidth can do a lot, but it will only be as fast as what it can do the slowest. Since die space itself aught to be fairly cheap, and Intel's CPUs are already not too slow at parallel compute tasks (faster is not the goal; good enough that you can use it is), I doubt we'll see much further integration from Intel in the near future. Faster iGPUs and faster scalar cores, each able to run slower or turned off when not needed, just makes more sense, for the time being.

BenchPress · Apr 25, 2012

pelov said:
This is the same argument that people use for defending Bulldozer but it never works. You can't claim something is amazing and then ask the entire planet to gravitate towards the new ISAs and architecture.

Bulldozer is a whole different story. They lowered the IPC, crippled the SIMD performance, and didn't provide any hardware transactional memory technology.

You can count on it that Intel will not lower the IPC for Haswell, they'll double the SIMD throughput with AVX2 (and add mind-blowing features like 8x gather), and they're adding TSX to boot.

So referring to Bulldozer as being the same argument is quite ridiculous.

The same thing was said for AVX implementation, but where is it exactly? A few synthetic benchmarks and a handful of applications :/

AVX is clearly nothing more than the stepping stone to AVX2. The first AVX specification that Intel revealed, included support for FMA4. Then they changed it to FMA3, and then they moved it to AVX2. So the goal has always been to provide four times the SIMD throughput. Basically they wanted AVX2 all along but needed an intermediate step to get there.

So don't try to judge AVX2 based on AVX. AVX is incomplete and only benefits a narrow field of applications. AVX2 on the other hand features 256-bit operations for everything. And together with its gather and vector shift instructions, it's finally a suitable instruction set for SPMD processing.

And SPMD processing is precisely the thing that made GPUs much more powerful at throughput computing all these years. Haswell will be a nail in the coffin for GPGPU.

BenchPress · Apr 25, 2012

IntelUser2000 said:
Come on!

Haswell - 2013
Llano - 2011

Irrelevant. Fusion chips are limited by bandwidth. There won't be a whole lot of progress in the next few years. Of course the bandwidth can be increased, or they can add eDRAM, but each of these things increases the cost considerably.

Meanwhile, Haswell is only the beginning of the CPU's increase in computing throughput. Also keep in mind that theoretical peak performance is meaningless. The GTX 680 loses against a quad-core CPU at certain workloads (graphics nonetheless)! And that's still without AVX, let alone AVX2.

So AMD won't have an answer against a homogenous many-core CPU with AVX2.

CPUarchitect · Apr 25, 2012

OVerLoRDI said:
Cost. Power. Form factor.

Cost isn't an issue. By getting rid of the IGP there's room for twice the CPU cores.

Power is a concern, however the solution could come in the form of AVX-1024. They could keep the 256-bit execution units, but feed them 1024-bit instructions over four cycles. The same amount of useful work per clock would be done, with four times fewer instructions going through the CPU's front-end. And that's a huge power saving.

IntelUser2000 · Apr 25, 2012

You are still ignoring the part where in Notebook would need a significant sacrifice, especially dual cores where it takes up a vast majority of sales. Also they still need to add dedicated units for graphics like what Cerb said.

More importantly, that's not what Intel is doing. They are pushing towards more dedicated accelerators to save power and increase performance.

It may make sense for GPGPU but I doubt it makes sense for 3D graphics.

Cost isn't an issue. By getting rid of the IGP there's room for twice the CPU cores.

Because everyone knows enthusiast desktop chips are vast majority of the market, right? Besides, you can only put two more cores in place of the Ivy Bridge iGPU(entirely forgetting needing fixed function units for graphics).

ShintaiDK · Apr 25, 2012

CPUarchitect said:
Cost isn't an issue. By getting rid of the IGP there's room for twice the CPU cores.

Power is a concern, however the solution could come in the form of AVX-1024. They could keep the 256-bit execution units, but feed them 1024-bit instructions over four cycles. The same amount of useful work per clock would be done, with four times fewer instructions going through the CPU's front-end. And that's a huge power saving.

Have you considered what the power distribution is on Sandy or Ivy between the GPU and CPU?

Also if you payed attention to Larabee. You would know the 512bit units combine 32bit instructions and then execute it in a single cycle. To compare, using 1024bit with 4 cycles or 256bit in one cycle in terms of graphics output is the same.

And a final add, even Larabee got a raster unit added. Go figure.

Olikan · Apr 25, 2012

BenchPress said:
So AMD won't have an answer against a homogenous many-core CPU with AVX2.

why not? they have AVX 1...there is something that will lock them?

pelov · Apr 25, 2012

Olikan said:
why not? they have AVX 1...there is something that will lock them?

They're adding FMA3 in Piledriver so it would be a logical to assume that they would.

alyarb · Apr 25, 2012

Hey if you guys aren't busy I thought we could compare a bunch of theoretical constraints in such a way that they sound worthwhile enough on paper to challenge real world paradigms hardened by practical experience and conventional wisdom, and then share condescending soliloquies with each other on why our game-changing concept hasn't been implemented yet...

BenchPress · Apr 25, 2012

Olikan said:
why not? they have AVX 1...there is something that will lock them?

Not in theory, no. But they've been investing into heterogeneous computing for years now and thereby sacrificing CPU performance. They'd have to abandon Fusion and make a 180 degree turn to focus on homogenous CPU performance and catch up with Haswell and its successors.

Ben90 · Apr 25, 2012

cytg111 said:
cant find the link, but theres an interview with carmack on youtube (duh), about rage, where he goes into detal about what consoles does much much better than PC's and evens out the factor 10'ish flops due to this design .. and its actually in respect to shared resources.

I think its more of an API abstraction issue.

CPUarchitect · Apr 25, 2012

IntelUser2000 said:
You are still ignoring the part where in Notebook would need a significant sacrifice, especially dual cores...

Ivy Bridge is a teeny tiny chip, bringing quad-core to mainstream. And in case you haven't noticed, even chips for mobile phones have started to go quad-core. And the multi-core revolution doesn't end there, it's only getting started. Hardware transactional memory makes multi-threading a whole lot more scalable, so expect to see more cores with every process shrink.

Also they still need to add dedicated units for graphics like what Cerb said.

No. Like I said before, gather support in AVX2 takes care of the most expensive graphics operations, in a generic way. Also, adding a few more instructions (like extending BMI1/2 to AVX) would hardly take extra space. And again those would be generic enough to be useful for a lot of other purposes.

More importantly, that's not what Intel is doing.

Really? So providing a fourfold increase in core throughput is not part of an any well-coordinated plan? They've announced from the beginning that AVX would be extendable to 1024-bit.

It may make sense for GPGPU but I doubt it makes sense for 3D graphics.

Only a year ago most people thought GPGPU was going to go mainstream and the CPU would become less relevant. Then AVX2 was announced, obviously targeting SPDM processing. Suddenly opinions started to shift and now that NVIDIA has sacrificed GPGPU efficiency it's clear that's it's a dead end street.

With that in mind it really shouldn't be that hard to realize that it only takes a few more steps to make the CPU highly efficient at graphics. AVX-1024 is straightforward to implement after AVX2, and neither is vectorizing BMI particularly challenging... I'm sure that anything else you think would be missing, is easy to add as well.

Also keep in mind that it's inevitable that graphics becomes more generic. With every new Direct3D and OpenGL version the GPU has to add new capabilities. And this has brought it closer and closer to the CPU architecture. Efficient support for deep call stacks will require them to become even more like CPUs. Another thing that makes them converge is that graphics doesn't scale infinitely. The pipeline latencies have been going down every generation to keep the thread count manageable, and at some point they'll have to start doing out-of-order scheduling just like a CPU. That would also improve cache locality. And lastly, the applications are also evolving toward more complex workloads.

So CPUs will become suitable for graphics sooner than you might realize.

CPUarchitect · Apr 25, 2012

IntelUser2000 said:
Besides, you can only put two more cores in place of the Ivy Bridge iGPU...

Wrong. It's the size of 3.5 cores, so let's round that up to a nice and even 4.

CPUarchitect · Apr 25, 2012

ShintaiDK said:
Have you considered what the power distribution is on Sandy or Ivy between the GPU and CPU?

What's are you trying to say?

Also if you payed attention to Larabee. You would know the 512bit units combine 32bit instructions and then execute it in a single cycle. To compare, using 1024bit with 4 cycles or 256bit in one cycle in terms of graphics output is the same.

Haswell will have two 256-bit floating-point units per core. And yes I know that executing 1024-bit instructions over 4 cycles results in the same throughput per unit. The point is that it lowers power consumption because the front-end has to deliver fewer instructions and there will be less switching activity in the schedulers.

And a final add, even Larabee got a raster unit added. Go figure.

No, Larrabee performed rasterization in software. And they're not the only ones who've tried it: High-Performance Software Rasterization on GPUs. Rasterization is likely going to become programmable at some point.

ShintaiDK · Apr 25, 2012

CPUarchitect said:
What's are you trying to say?

Haswell will have two 256-bit floating-point units per core. And yes I know that executing 1024-bit instructions over 4 cycles results in the same throughput per unit. The point is that it lowers power consumption because the front-end has to deliver fewer instructions and there will be less switching activity in the schedulers.

No, Larrabee performed rasterization in software. And they're not the only ones who've tried it: High-Performance Software Rasterization on GPUs. Rasterization is likely going to become programmable at some point.

The GPU uses alot less power than the CPU for the same work. The CPU sits on something like 90% of the TDP budget.

Executing individual 16 or 32bit instructions in a 1024 AVX in 256bit parts is pretty useless. Try examine what AVX is used for today aswell as other SSE parts. They dont tend to me "GPGPU" related if you get my drift.

And for Larabee:

http://arstechnica.com/hardware/new...he-confusion-over-intels-larrabee-part-ii.ars

But again, I´m sure you know better than AMD/Intel

By your methods, I´m sure Ivy would need a 200W TDP.

CPUarchitect · Apr 25, 2012

ShintaiDK said:
The GPU uses alot less power than the CPU for the same work. The CPU sits on something like 90% of the TDP budget.

Please stop guessing. Anand shows that Ivy Bridge uses 53.3 Watt at full CPU load, while during running Metro 2033 it consumes 58.7 Watt. And that DX11 game is very light on the CPU (especially when bottlenecked by the GPU). So the GPU consumes quite a bit of power.

Executing individual 16 or 32bit instructions in a 1024 AVX in 256bit parts is pretty useless. Try examine what AVX is used for today aswell as other SSE parts. They dont tend to me "GPGPU" related if you get my drift.

What are you talking about? AVX2 can process 4 x 64-bit, 8 x 32-bit, 16 x 16-bit, or 32 x 8-bit per cycle. And it's perfectly suited for SPMD.

But again, I´m sure you know better than AMD/Intel

No, but I know better than Ars Technica. Larrabee does not have a hardware rasterizer.

IntelUser2000 · Apr 26, 2012

CPUarchitect said:
Please stop guessing.

Really? Because you are not?

http://www.hardware.fr/articles/863-6/hd-graphics-4000-2500-consommation-3d.html

Doesn't it say clearly on that AT review that the measurement is sytem power?

Wrong. It's the size of 3.5 cores, so let's round that up to a nice and even 4.

Sure, just ignore the L3 caches, which is a big benefit for CPU-only workloads. So in the end you get a crappy CPU and a crappy GPU.

By the way, the 15W Haswell SKUs are all dual core. And the indication is that the iGPU(not a CPU wanting to be a GPU frankenstein), is more buffed up than ever. And its obvious the 15W parts are going to be pushed really hard because of Ultrabooks.

And that's probably exactly what's going to happen.

Dual cores - 15-17W
Quad cores - 35/45/55W

The 25W dual cores stopped making any sense with Arrandale, and basically disappeared with Sandy Bridge. With Ivy Bridge even the 35W dual cores are not making sense anymore. Look at the Ivy Bridge chips. I mean the 35W 3520M clocks a mere 100MHz higher than the Sandy Bridge 2640M, Base, 2 core Turbo, and 1 core Turbo!

Really? So providing a fourfold increase in core throughput is not part of an any well-coordinated plan? They've announced from the beginning that AVX would be extendable to 1024-bit.

You said Haswell allows for the CPU to completely replace iGPUs, and they would stick more cores for that purpose. That's exactly the opposite happening in Haswell. Future chips are staying with Gen graphics, and the ones that aren't even Gen is going to/said to have move to Gen graphics.

Valleyview, the next Atom, is moving to Gen graphics based on Ivy Bridge. Xolo X900 review says even future Smartphone chips are moving to Gen graphics: http://www.anandtech.com/show/5770/lava-xolo-x900-review-the-first-intel-medfield-phone/10

lamedude · Apr 26, 2012

It would be nice if Abrash would make an AVX version of Pixomatic so we could find out but he finally accepted Gabe's offer to work at Valve after that Larrabee thing didn't pan out. Maybe he'll sneak a software renderer into HL3.

Hulk · Apr 26, 2012

tweakboy said:
Simply said. It is for people who don't play games. If your a gamer then you disable the onboard and use your dedicated PCIe card. gl

Bingo!
I don't play games and love integrated GPU's.
- Their cost is very low
- They don't take up a PCIe slot
- They don't draw a lot of power generate much heat or make noise
- My experience with Intel drivers has been flawless
- I never waste a moment upgrading graphics drivers, the drivers I start with sometimes just sit there and do their job for years

Since Comcast went encrypted digital and made TV tuners uselessI've been using iGPU's.

CPUarchitect · Apr 26, 2012

IntelUser2000 said:
Really? Because you are not?

http://www.hardware.fr/articles/863-6/hd-graphics-4000-2500-consommation-3d.html

That's the power consumption of just the GPU plane. It's not representative because there's also a lot of power being consumed by cache and RAM accesses. And even if that was included, the GPU still can't operate without running the graphics driver on the CPU. So that portion of the power consumption also has to be attributed to graphics.

Doesn't it say clearly on that AT review that the measurement is sytem power?

Yes, which is why I subtracted the idle power consumption. You have to look at the whole picture, and not isolate a particularly power efficient portion of the chip that is helpless by itself.

Sure, just ignore the L3 caches, which is a big benefit for CPU-only workloads.

So in the end you get a crappy CPU and a crappy GPU.

No. You don't have to increase the LLC cache size linearly with the number of cores. The dual-core Penryn had 6 MB of L2, while a quad-core Core i5 has 6 MB of L3. You can expect to see 8-cores with 6/8 MB in the future.The temporal and spacial locality of the LLC data doesn't change significantly with more cores. And something like graphics actually has highly regular access patterns.

By the way, the 15W Haswell SKUs are all dual core.

Sure, but that's the short-term future. I'm talking about a longer-term future. We'll get 1 TFLOP of computing power out of 15 Watt CPU sooner than you might realize (but still several generations after Haswell). And it would be far more useful if that was fully generic instead of having a limited programming model. Also note that currently we're still stuck with lightly threaded software because of a lack of hardware transactional memory. But Haswell brings us TSX. Of course it won't make the software heavily multi-threaded overnight, but gradually it will become better to have more small cores than few big cores.

And the indication is that the iGPU(not a CPU wanting to be a GPU frankenstein), is more buffed up than ever. And its obvious the 15W parts are going to be pushed really hard because of Ultrabooks.

And that's probably exactly what's going to happen.

Dual cores - 15-17W
Quad cores - 35/45/55W

Yes, but the graphics performance expectations of a 15 Watt part will also be significantly lower than that of a 55 Watt part. And that applies to whether you have heterogeneous or homogeneous graphics.

You said Haswell allows for the CPU to completely replace iGPUs, and they would stick more cores for that purpose.

No, I said it's a significant step toward that. It marks the end of GPGPU, but not the end of IGPs, yet.

It doesn't seem unlikely though that the graphics driver will use the massive computing power of AVX2 to assist the IGP. And so this creates a gradual transition from heterogeneous to homogeneous computing. Once AVX-1024 is implemented, they can probably get rid of the IGP's compute cores, since it's practically the same thing. And then they can start replacing dedicated fixed-function components with new VEX instructions.

Gloomy · Apr 26, 2012

alyarb said:
hey if you guys aren't busy i thought we could compare a bunch of theoretical constraints in such a way that they sound worthwhile enough on paper to challenge real world paradigms hardened by practical experience and conventional wisdom, and then share condescending soliloquies with each other on why our game-changing concept hasn't been implemented yet...

fgrrrykjdslhglkjfa gaghahaha

Anarchist420 · Apr 27, 2012

Well, I was glad to see that some people didn't think I was out of my mind. Both approaches kind of make sense to me, however.

I think that programmable blending and programmable depth could *possibly* eventually be made more efficient than hardware blending/depth. I don't think it's really a black and white issue. Then again, I'm no expert.

BenchPress · Apr 27, 2012

Anarchist420 said:
I think that programmable blending and programmable depth could *possibly* eventually be made more efficient than hardware blending/depth.

NVIDIA already performs all blending in the shader for Tegra (see the GL_NV_shader_framebuffer_fetch extension). And that's a power-restricted mobile chip!

Programmable (depth) culling isn't new either, and can provide a nice speedup.

So these things are absolutely possible, and they're ideal for an implementation on the CPU.

What's the purpose of integrated GPUs?

Diamond Member

Elite Member

Senior member

Senior member

Senior member

Elite Member

Lifer

Platinum Member

Diamond Member

Platinum Member

Senior member

Platinum Member

Senior member

Senior member

Senior member

Lifer

Senior member

Elite Member

Golden Member

Diamond Member

Senior member

Golden Member

Diamond Member

Senior member