Engineers boost Llano GPU performance by 20% without overclocking

BenchPress · Feb 8, 2012

piesquared said:
But the work MS put into C++ AMP would suggest that GPGPU has a strong future, it isn't dead.

Meh. C++ AMP is just CUDA in disguise. A Microsoft disguise, that is.

here's more potential in 10,000 shaders than there is in AVX2.

Why? AVX2 includes all the vital instructions a GPU supports. GPUs have a few additional instructions which are useful for graphics, but AVX2 also has instructions for small data types which no GPU supports. So there's no significant difference in potential.

CPUarchitect · Feb 8, 2012

Olikan said:
amd's GCN supports x86 instructions, while avx-1024 seems to be a holy grail.

GCN won't support x86 instructions. It will merely support the x86-64 virtual address space and coherency model. That's an improvement but it's stopping short of full unification.

Instead of trying to add CPU features to the GPU, Intel instead simply adds GPU features to the CPU. AVX2 is a massive leap forward considering that it will double the CPU's peak throughput with a negligible increase in transistor count. And the gather instruction could even be up to 18 times faster than the old alternative.

After that Intel will likely focus on making the CPU as power efficient as the GPU by executing wider SIMD instructions over multiple clock cycles, just like a GPU.

It's ironic that AMD was the first to unify the vertex and pixel processing cores (ATI at that time), but it's resisting the unification of the CPU and GPU cores.

CPUarchitect · Feb 8, 2012

BenchPress said:
C++ AMP is just CUDA in disguise. A Microsoft disguise, that is.

It's an open specification. But yes, it looks like Microsoft is the only one implementing it, and it doesn't offer any substantial benefits over the alternatives.

What worries me the most is that future GPUs will likely have more features to better support GPGPU workloads. But that means that just like we went through Shader Model 1.0 to 5.0 already (including sub-versions), and six CUDA versions, there would have to be various versions of C++ AMP. Developers really don't like that fragmentation.

So why not skip all those GPU-specific limitations and just support C++ as-is (or any other language for that matter). AVX2 can be used to auto-vectorize any code loop with independent iterations.

bronxzv · Feb 8, 2012

CPUarchitect said:
And the gather instruction could even be up to 18 times faster than the old alternative.

huh ?

Olikan · Feb 8, 2012

i see...nice read 😀

i still have doubts...why would intel adopt OpenCL than?

CPUarchitect · Feb 8, 2012

Olikan said:
i see...nice read 😀

i still have doubts...why would intel adopt OpenCL than?

To cover all the bases. Their OpenCL SDK is already taking advantage of AVX. It's based on the LLVM compiler framework, so the auto-vectorization can be used for just about any programming language.

skipsneeky2 · Feb 8, 2012

gorydetails said:
is this a new stepping,,,

Captain Kirk's nipples.😛

Blitzvogel · Feb 8, 2012

soccerballtux said:
yeah I'm pretty sure you're right. My old laptop had more than enough GPU to play WoW on medium settings but even with it at ultra low it could only muster about 13fps due to bandwidth bottleneck. Sometimes I wish they would give you 64MB video memory integrated...would it be that hard? Then you could at least get some mildly serious gaming done with older games.

😕

My experience as of late with an A6 APU has been quite good, and the idea certainly has shown it's validity in making decent entry level gaming performance cheap to the masses for the most part.

Easily the three things that hold AMDs APUs back (especially in laptops) are 1) memory bandwidth, 2) GPU clock speed and 3) CPU clock speed.

Graphics performance has been hit and miss with various titles. Surprisingly Crysis performed quite well, and despite the popular assertion that it is a badly optimized, the A6 managed just over 20 FPS average @ 1366 x 768, no AA, and medium settings, with physics and shaders at high. Left 4 Dead 2 and Team Fortress 2, as expected could be run on just about max settings, though I would leave AA off since it is a very bandwidth depleting feature.

The main "miss" was the lack of acceptable performance in Battlefield 3 and Call of Duty 4, though they were probably way too hard on the CPU cores of the A6, since they are only 1.5 GHz, and even with turbo, are only 2.4 if the TDP is within acceptable limits. CoD4 in my experience is very high GHz reliant, not architecture reliant. My CoD4 test was on a 40 player server, with lots of crap going on, and max settings without AA or AF, but that is something to expect a modern system to be able to deal with, without issue.

Between more memory bandwidth and on-chip video memory, I think it's more feasible to continue boosting the bandwidth, especially since it's beneficial to the CPU, unless the bandwidth and size of on die video memory was 512 MB and at least 40 GB/s for something like a Trinity class GPU, and still cost effective. The proposition of using that on chip RAM as a cache for the CPU is a interesting idea, but that's just another layer programmers would have to consider about how to make the use of when PC systems already vary quite wildly.

Abwx · Feb 8, 2012

CPUarchitect said:
AVX2 is a massive leap forward considering that it will double the CPU's peak throughput with a negligible increase in transistor count.

Surely doubled as well as AVX did double throughput , that is , on
a few percents of a code...

You should be more cautious before writing such intel propaganda.

Basically intel is late and afraid of the gpgpu potential leveraging,
so they try their best to downplay such possibilities , wich is the
main purpose of the rest of you post that i didnt care to quote
since it s just technical garbage.

Abwx · Feb 8, 2012

BenchPress said:
Why? AVX2 includes all the vital instructions a GPU supports.

Because decoding instructions is one thing and executing them
efficently is another thing for wich a CPU is not armed by its
huge lacking in execution ressources...

Olikan · Feb 8, 2012

CPUarchitect said:
To cover all the bases. Their OpenCL SDK is already taking advantage of AVX. It's based on the LLVM compiler framework, so the auto-vectorization can be used for just about any programming language.

well, that didn't helped much, the only benefit is for openCl would be to cover the Scatter instruction, and that is not really that much important...( i did my homework 😎)

just because the openCl adoption, i still think that in raw performance, Gpus are going to be better

EDIT:

i got ninja'd by Abwx

Abwx · Feb 8, 2012

Olikan said:
EDIT:

i got ninja'd by Abwx

Not at all , my post was just a preview...😉

tweakboy · Feb 8, 2012

that is still slow, people will disable that and put in their own video card.. thx gl

BD231 · Feb 8, 2012

tweakboy said:
that is still slow, people will disable that and put in their own video card.. thx gl

There's no point in buying Llano if you're not going to use the integrated GPU.

CPUarchitect · Feb 8, 2012

Abwx said:
Surely doubled as well as AVX did double throughput , that is , on
a few percents of a code...

No, the problem with AVX1 is that it doesn't have 256-bit integer instructions, no vector-vector shifts, and no gather. That means it can't execute scalar code in each lane the way a GPU does because there is no vector equivalent of each scalar instruction. AVX2 fixes all of that, and adds in FMA as a bonus.

So you can't downplay the importance of AVX2 without also downplaying GPGPU.

You should be more cautious before writing such intel propaganda.

How is it my fault that Intel will implement AVX2 before AMD does?

Basically intel is late and afraid of the gpgpu potential leveraging,
so they try their best to downplay such possibilities

Why would Intel be late? They sell four times more CPUs than AMD does (and not every AMD CPU is an APU). Also, AMD can't stay behind and will have to implement AVX2 as well. So in a couple years from now, there will be more systems capable of AVX2, than systems with an APU. And that's not all:

Screen%20Shot%202012-02-02%20at%209.20.35%20AM_575px.png

"In 2014 AMD plans to deliver HSA compatible GPUs that allow for true heterogeneous computing where workloads will run, seamlessly, on both CPUs and GPUs in parallel. The latter is something we've been waiting on for years now but AMD seems committed to delivering it in a major way in just two years." - Anand

Apparently it will take AMD at least two more years to complete its heterogeneous architecture. By that time we'll have Haswell in the majority of new desktops, laptops, ultrabooks, workstations and servers, and its 14 nm shrink will be launched that same year.

CPUarchitect · Feb 8, 2012

Olikan said:
well, that didn't helped much, the only benefit is for openCl would be to cover the Scatter instruction, and that is not really that much important...( i did my homework 😎)

I'm sorry but clearly you didn't do it very well because there is no scatter instruction. Not in AVX2, and not in any GPU.

just because the openCl adoption, i still think that in raw performance, Gpus are going to be better

Effective performance is the only thing that matters. The very topic of this thread shows that GPUs leave performance on the floor when not assisted by the CPU. And that's only from cache misses. There's lots of additional overhead from passing things back and forth and there are GPU-specific bottlenecks that make GPGPU applications only reach a fraction of the theoretical GFLOPS number.

Also in games and such the GPU is busy doing graphics, so you can't actually burden it with a lot of extra work without affecting graphics performance. A quad-core Haswell CPU will have close to 500 GFLOPS of raw computing power. So developers will find that a much more attractive source of generic computing power. Not in the least because it only requires a mere recompilation of their code.

Abwx · Feb 8, 2012

CPUarchitect said:
No, the problem with AVX1 is that it doesn't have 256-bit integer instructions, no vector-vector shifts, and no gather. That means it can't execute scalar code in each lane the way a GPU does because there is no vector equivalent of each scalar instruction. AVX2 fixes all of that, and adds in FMA as a bonus.

The widening of parralelism will still be not enough.
It s not only a matter of instructions window width but firstly
of efficient exe ressources.

CPUs , haswell included , are not adequate for massive
parralele executions.
You can bet that haswell will just make use of openCL at best.

CPUarchitect said:
Why would Intel be late? They sell four times more CPUs than AMD does (and not every AMD CPU is an APU). Also, AMD can't stay behind and will have to implement AVX2 as well. So in a couple years from now, there will be more systems capable of AVX2, than systems with an APU. And that's not all:

"In 2014 AMD plans to deliver HSA compatible GPUs that allow for true heterogeneous computing where workloads will run, seamlessly, on both CPUs and GPUs in parallel. The latter is something we've been waiting on for years now but AMD seems committed to delivering it in a major way in just two years." - Anand

Apparently it will take AMD at least two more years to complete its heterogeneous architecture. By that time we'll have Haswell in the majority of new desktops, laptops, ultrabooks, workstations and servers, and its 14 nm shrink will be launched that same year.

Since AMD has a better GPU , Intel position is to create false flags
until they have a decent GPGPU , as such their AVX512 or even 32768
is just a smoke screen as AVX was develloped at a time when
both AMD and Intel did thought that CPUs would prevail over GPUs to the point of progressively implementing their instructions on actual CPU uarchs , but they just did realize that it would be
impossible without designing a behemoth CPU with poor global efficency, so they have to rely on improved integrated GPGPUs , wich was the reason for the Larabee project.

Larabee being ill conceived , they have to improve their GPU
and in the meantime wage a disinformation campaign according
to wich AVX will be the holy grail and cure for all illnesses...

CPUarchitect · Feb 8, 2012

Abwx said:
The widening of parralelism will still be not enough. It s not only a matter of instructions window width but firstly
of efficient exe ressources.

What makes you think it won't be enough? A quad-core Haswell CPU will reach close to 500 GFLOPS. Llano's GPU can do 480 GFLOPS. Regardless of how Llano's successors will perform, that's nothing to sneeze at.

CPUs , haswell included , are not adequate for massive
parralele executions.

That's just a blanket statement based on preconception. The numbers show otherwise.

Abwx · Feb 8, 2012

CPUarchitect said:
What makes you think it won't be enough? A quad-core Haswell CPU will reach close to 500 GFLOPS. Llano's GPU can do 480 GFLOPS. Regardless of how Llano's successors will perform, that's nothing to sneeze at.

That's just a blanket statement based on preconception. The numbers show otherwise.

If you think that a haswell core will have as much FPU capability
than a full quadcore i7 then you really are a die hard fanboy...

You make confusion between average throughput and peak
throughputs that can occur only with a sequence of a few
instructions merged in the global code that cant be exclusively
computation instructions , moreover , only AVX instructions
related computations...

The difference with Llano GPU is that it can sustain its 480 Gf
while it s unlikely that a CPU can sustain a theorical peak
as it must execute a lot of other instructions beside
pure computing ones.....

Btw , by the time haswell will be launched , Llano s perfs
in flops will be old history , not because of haswell but rather
due to its replacement.

BenchPress · Feb 8, 2012

Abwx said:
Because decoding instructions is one thing and executing them
efficently is another thing for wich a CPU is not armed by its
huge lacking in execution ressources...

Now you're just clueless. Haswell cores should have a computing density of about 8 GFLOPS/mm² (including the L3 cache). For comparison, a GeForce GTX 580 has a computing density of 3 GFLOPS/mm².

Phynaz · Feb 8, 2012

Abwx said:
The difference with Llano GPU is that it can sustain its 480 Gf

On real GP code? Link please!

The first AMD slide I find when I Google it says 480Gflop peak, not sustained.

http://www.legitreviews.com/article/1649/1/

That would translate into about 160Gflop sustained best case, less on GP code.

CPUarchitect · Feb 8, 2012

Abwx said:
If you think that a haswell core will have as much FPU capability
than a full quadcore i7 then you really are a die hard fanboy...

If you mean Nehalem then yes, Haswell will have the same throughput in a single core. AVX doubled the vector width, and AVX2 adds fused multiply-add (FMA) support which doubles the number of floating-point operations per instruction. Hence there's a fourfold increase in computing power. I'm not a "fanboy", I'm simply stating the facts.

If AMD was the first to implement AVX2 I would applaud them instead of Intel. Unfortunately they brought us Bulldozer instead...

You make confusion between average throughput and peak
throughputs that can occur only with a sequence of a few
instructions merged in the global code that cant be exclusively
computation instructions , moreover , only AVX instructions
related computations...

I'm not confusing anything. I'm using the exact same approach by which GPU manufacturers calculate their FLOPS rating. And rest assured that Intel will provide Haswell's cores of plenty of bandwidth to achieve a high sustainable throughput.

The difference with Llano GPU is that it can sustain its 480 Gf
while it s unlikely that a CPU can sustain a theorical peak
as it must execute a lot of other instructions beside
pure computing ones.....

Llano can't sustain 480 GFLOPS in any real-world application. And what are these "other instructions" you're talking about?

Btw , by the time haswell will be launched , Llano s perfs
in flops will be old history , not because of haswell but rather
due to its replacement.

Yes, but like I said before only effective performance is relevant. AMD can throw more GPU cores at it but they'll face diminishing returns due to the bandwidth bottleneck and Amdahl's Law.

Abwx · Feb 9, 2012

CPUarchitect said:
If you mean Nehalem then yes, Haswell will have the same throughput in a single core. AVX doubled the vector width, and AVX2 adds fused multiply-add (FMA) support which doubles the number of floating-point operations per instruction. Hence there's a fourfold increase in computing power. I'm not a "fanboy", I'm simply stating the facts.

If AMD was the first to implement AVX2 I would applaud them instead of Intel. Unfortunately they brought us Bulldozer instead...

BD has FMA , so according to your own words its FP capability
are currently unused since it doesnt obviously leverage this
advantage in FP dependant softs , but still , Haswell will...

I guess you will applaude FMA once it will be an Intel feature...

CPUarchitect said:
I'm not confusing anything. I'm using the exact same approach by which GPU manufacturers calculate their FLOPS rating. And rest assured that Intel will provide Haswell's cores of plenty of bandwidth to achieve a high sustainable throughput.

Llano can't sustain 480 GFLOPS in any real-world application. And what are these "other instructions" you're talking about?

Yes, but like I said before only effective performance is relevant. AMD can throw more GPU cores at it but they'll face diminishing returns due to the bandwidth bottleneck and Amdahl's Law.

So a peak throughput cant be sustained by specialized
execution units and architecture but it can be sustained
one the compute gear is an Intel....

As for Amdhal s law you know that it s an empirical law
based on past computers but cant be extended to future
uarchs....

Abwx · Feb 9, 2012

Phynaz said:
On real GP code? Link please!

The fist AMD slide I find when I Google it says 480Gflop peak, not sustained.

http://www.legitreviews.com/article/1649/1/

That would translate into about 160Gflop sustained best case, less on GP code.

To demonstrate the capability of C++ AMP Microsoft showed a rigid body simulation program that ran on multiple computers and devices from a single executable file and was able to scale in performance from 3 GLOPS on the x86 cores of Llano to 650 GFLOPS on the combined APU power

http://pcper.com/news/General-Tech/AFDS11-Microsoft-Announces-C-AMP-Competitor-OpenCL

CPUarchitect · Feb 9, 2012

Abwx said:
BD has FMA , so according to your own words its FP capability
are currently unused since it doesnt obviously leverage this
advantage in FP dependant softs , but still , Haswell will...

I never disregarded Bulldozer's FMA support. It too doubles the peak throughput. But it effectively only has one AVX-256 unit per two cores. And while the FMA support gives it the same peak performance as Sandy Bridge, in practice it's less powerful because every ADD or MUL has to be executed separately. On average only about 1/3 of floating-point instructions can be an FMA.

So Intel made a wise descision to first widen the execution units and then add FMA. AMD has not yet announced when it will double the execution width to catch up.

Engineers boost Llano GPU performance by 20% without overclocking

Senior member

Senior member

Senior member

Senior member

Platinum Member

Senior member

Diamond Member

Platinum Member

Lifer

Lifer

Platinum Member

Lifer

Diamond Member

Lifer

Senior member

Senior member

Lifer

Senior member

Lifer

Senior member

Lifer

Senior member

Lifer

Lifer

Senior member