AMD summit today; Kaveri cuts out the middle man in Trinity.

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
Your hypothesis sounds very interesting. Do you have some numbers to back it up? I'd like to see if an AVX2 CPU is faster at different typical parallel compute tasks than a compute optimized GPU (like GK110) at the same power and a comparable process node.
Neither Haswell nor GK110 is out yet. But who needs the numbers when the technology speaks for itself? GPUs achieve a high theoretical throughput by using wide vector units with gather and FMA support. AVX2 offers the exact same features, but integrates them into the CPU cores themselves thus avoiding any heterogeneous latency overhead or bandwidth bottlenecks. The CPU also benefits from higher cache hit rates due to a lower thread count (in turn thanks to out-of-order execution).

Either way I don't think GK110 can be included in a fair comparison. NVIDIA describes it as The Fastest, Most Efficient HPC Architecture Ever Built. Clearly this behemoth is not aimed at the average consumer. AVX2 on the other hand will be in every consumer desktop/laptop/ultrabook chip, starting with Intel but later also AMD, and a high likelihood of equivalent homogeneous throughput computing technology appearing for other architectures sooner or later.

Also note that a discrete GPU can't work by itself! It still needs a CPU, and so the power consumption of that CPU has to be taken into account as well. Therefore a fair comparison should test against APUs instead. See for instance these result: Handbrake OpenCL. The i7-2820QM is running the OpenCL code on the CPU, and outperforms the A10-4600M. With AVX2's doubling of the throughput and addition of gather support, the CPU should be able to greatly increase its lead.

A fair comparison also requires running optimized code. AVX2 can run OpenCL, but OpenCL is not aimed at homogeneous computing. It has restrictions to be able to run on the GPU. AVX2 can run more aggressively optimized code that doesn't have to live by OpenCL's restrictions. So homogeneous computing offers more capabilities than heterogeneous computing, enabling more applications than what the latter would be able to support.
 

piesquared

Golden Member
Oct 16, 2006
1,651
473
136
Better yet, AVX2 can run on HSA so consumers get the choice of superior more efficient GPGPU performance. It's a win win for HSA.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Neither Haswell nor GK110 is out yet. But who needs the numbers when the technology speaks for itself? GPUs achieve a high theoretical throughput by using wide vector units with gather and FMA support. AVX2 offers the exact same features, but integrates them into the CPU cores themselves thus avoiding any heterogeneous latency overhead or bandwidth bottlenecks. The CPU also benefits from higher cache hit rates due to a lower thread count (in turn thanks to out-of-order execution).
I don't see an obvious bottleneck in current GPU architectures:
haswell_vs_gpus.png

More on that here:
http://forums.anandtech.com/showpost.php?p=33600506&postcount=137

Either way I don't think GK110 can be included in a fair comparison. NVIDIA describes it as The Fastest, Most Efficient HPC Architecture Ever Built. Clearly this behemoth is not aimed at the average consumer. AVX2 on the other hand will be in every consumer desktop/laptop/ultrabook chip, starting with Intel but later also AMD, and a high likelihood of equivalent homogeneous throughput computing technology appearing for other architectures sooner or later.
GK110 is not the one and only GPGPU architecture, it is just be one of the most optimized ones. And both low threaded fat OOO-cores with wider SIMDs and scalar/SIMD-capable power usage optimizing GPGPU SIMD cores (the big blocks) will grow towards eachother. Why? Because out of their original rather suboptimal use cases they become more and more adapted to the wide field of possible use cases. But still they can't substitute eachother perfectly.

Also note that a discrete GPU can't work by itself! It still needs a CPU, and so the power consumption of that CPU has to be taken into account as well. Therefore a fair comparison should test against APUs instead. See for instance these result: Handbrake OpenCL. The i7-2820QM is running the OpenCL code on the CPU, and outperforms the A10-4600M. With AVX2's doubling of the throughput and addition of gather support, the CPU should be able to greatly increase its lead.

A fair comparison also requires running optimized code. AVX2 can run OpenCL, but OpenCL is not aimed at homogeneous computing. It has restrictions to be able to run on the GPU. AVX2 can run more aggressively optimized code that doesn't have to live by OpenCL's restrictions. So homogeneous computing offers more capabilities than heterogeneous computing, enabling more applications than what the latter would be able to support.
But you always seem to forget that using the full AVX2 throughput data has to be in place and some algorithms either need complex data structures or some shuffling to achieve that. Such necessities lower any achieved throughput significantly compared to theoretical throughput. And under todays power limits it's a difference whether a chip doesn't come close to max t'put because of masking, but being able to clock gate unused shader cores, or a chip doesn't come close, but is heavily executing instructions to get data in place for actual calculations.

In the other thread I linked two Intel presentations shedding some light on what's going on when turning vectorized AVX code into parallelized OpenCL code.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
I don't see an obvious bottleneck in current GPU architectures:

one very obvious bottleneck is the Shared Memory and L1D BW/FLOP for example: a measly 0.33 for Kepler and 3.0 (estimated based on IDF Spring disclosures [1]) for Haswell

has to be in place and some algorithms either need complex data structures or some shuffling to achieve that.

indeed, it's where the AVX2 gather instructions come handy


[1]: BJ12_ARCS002_102_ENGf.pdf downloadable from intel.com/go/idfsessionsBJ
 
Last edited:

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
I attended ISCA last week and the Intel CTO gave one of the keynotes. He talked about exa-scale computing, and how the DoE and DARPA want to see a 50-75x increase in FLOPs/W by 2019. He showed data that this will be impossible with homogeneous cores, and pushed for heterogeneity in the architecture (a la MIC). Sure, it uses the "same" ISA as x86, but what does that really get you? Also, Intel researchers published a paper on task scheduling on heterogeneous architectures ("big" and "little" cores on the same die, again with the same ISA).

The fact is, everyone (including Intel it seems) is going for heterogeneous architectures. You just can't achieve 75 GFLOPs/W (which is what DARPA wants by 2019) using a programmable homogeneous architecture. If AMD is leading the pack for now, then good for them, but it is in everyone's future.

Also, what do people expect the real benefit of AVX2 is going to be? I've heard a lot of talk about "gather," but that doesn't sound like a great feature to me, other than perhaps reducing code size a little bit. All that means is that your operation takes as long as the longest memory access, right (which could be super long)? It actually could lead to lower IPC or FLOPs, compared to a combination of using software pre-fetching and older, narrower AVX instructions.

EDIT: I should clarify that scatter/gather works so well for GPUs because they have MASSIVE thread parallelism that lets them hide the disadvantages of scatter/gather operations. CPUs running single-threaded code will have to swallow the entire, variable memory penalty, and it very well may not lead to the higher performance everyone here seems to be expecting.
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
and pushed for heterogeneity in the architecture (a la MIC). Sure,

MIC in itself is a very homogeneous architecture, a lot of stricly similar cores all able to execute both scalar and vector code


Intel researchers published a paper on task scheduling on heterogeneous architectures ("big" and "little" cores on the same die, again with the same ISA).

I missed this paper, where can I found it ?


It actually could lead to lower IPC or FLOPs, compared to a combination of using software pre-fetching and older, narrower AVX instructions.

gather should provide sizable speedups for the numerous cases with low cache miss %
 

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
MIC in itself is a very homogeneous architecture, a lot of stricly similar cores all able to execute both scalar and vector code




I missed this paper, where can I found it ?




gather should provide sizable speedups for the numerous cases with low cache miss %

I know everyone on this board seems to think that homogeneous ISA = homogeneous processors, but Intel doesn't seem to think so. They would consider a Xeon + Xeon Phi setup as a heterogeneous compute environment, because you code each one differently.

Paper here: http://www.jaleels.org/ajaleel/publications/isca2012-PIE.pdf

You're right about how gather is great when the caches work correctly. I'm just saying the caches can't be relied upon in real life. If you have a 16-wide vector operation, and just one of those operators has to be fetched from main memory (i.e., cache miss), then all the other parallel operations have to wait too. GPUs hide this by just switching to another thread, but what can a CPU do? Wait? Hope this never happens?

Also, in general, computers aren't slow because their floating point vectors aren't wide enough. They're slow because of the cache/memory system. Wide gather might make this problem worse. That's all I'm saying.

EDIT: paper link disappeared after posting. Fixed now.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
I know everyone on this board seems to think that homogeneous ISA = homogeneous processors
isn't it the common way to look at it ?

, but Intel doesn't seem to think so. They would consider a Xeon + Xeon Phi setup as a heterogeneous compute environment, because you code each one differently.
MIC can't execute MMX/SSEx/AVX code and Xeon can't execute MIC vector code so they are two distinct ISAs (at the moment) thus heterogeneous compute

thanks


and just one of those operators has to be fetched from main memory (i.e., cache miss), then all the other parallel operations have to wait too.
only when you run out of load buffers, OoO execution + SMT are generally quite good at covering cache miss latencies



Also, in general, computers aren't slow because their floating point vectors aren't wide enough. They're slow because of the cache/memory system. Wide gather might make this problem worse. That's all I'm saying.
well, if software synthetized gather is faster in some difficult cases, we will use it, now the equivalent of 8x vgather is 18 instructions, laking the mask feature so I don't think it will be really a useful optimization
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
I don't see an obvious bottleneck in current GPU architectures:
haswell_vs_gpus.png
First and foremost there's the PCIe bottleneck. It can have a devastating effect on performance, pretty much ruling out discrete GPUs for a whole class of heterogeneous computing.

Secondly there's the DRAM bandwidth issue. Even though a CPU and integrated GPU share the same bandwidth, GPU computing is more easily bottlenecked. That's because there's very little cache memory per work item on a GPU, and thus it has to reach out to DRAM memory much more often.

Also I believe Haswell should be capable of 64B / clock between all cache levels. That's a full cache line, which simplifies the design (think about TSX support), and improves performance/Watt. Core 2 had 32B / clock at 65 nm so there should be plenty of room for 64B at 22 nm, plus they save on logic. Furthermore, there will most likely be two 32B L1 read ports and one 32B L1 write port.
GK110 is not the one and only GPGPU architecture, it is just be one of the most optimized ones. And both low threaded fat OOO-cores with wider SIMDs and scalar/SIMD-capable power usage optimizing GPGPU SIMD cores (the big blocks) will grow towards eachother. Why? Because out of their original rather suboptimal use cases they become more and more adapted to the wide field of possible use cases. But still they can't substitute eachother perfectly.
Power efficiency can be achieved by extending AVX to 1024-bit, for which it was designed from the start. By executing such instructions on AVX2's existing 256-bit execution units, the instruction rate goes down by a factor of four, while keeping the throughput the same. Hence the power-hungry front-end of the CPU can be clock gated more often, and there's less switching activity in the out-of-order logic.

The GPU has a much longer way to go to become anywhere near capable of running CPU tasks. Hence I'm suggesting to make the GPU focus solely on graphics workloads, like mainstream Kepler, and have the CPU handle all general purpose workloads, including high throughput ones. This way any heterogeneous overhead is completely avoided and it doesn't sacrifice graphics performance.
But you always seem to forget that using the full AVX2 throughput data has to be in place and some algorithms either need complex data structures or some shuffling to achieve that.
That's why AVX2 adds gather and powerful shuffle instructions.
 

Smoblikat

Diamond Member
Nov 19, 2011
5,184
107
106
Don't get me started on AMD's garbage IMC... it's probably one of the biggest reasons why they're behind Intel right now.

Well holdon. Back in the Ph2 days AMD had a bulletproof IMC, you could pump comical amounts of volts into it and it wouldnt do so much as utter a laugh. I had it at 2.2v DDR3 on air for a while, never saw degradation. Intels SB IMC is so pathetically wimpy compared to the old AMD one, it can barely do 1.5v.
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
First and foremost there's the PCIe bottleneck. It can have a devastating effect on performance, pretty much ruling out discrete GPUs for a whole class of heterogeneous computing.

Secondly there's the DRAM bandwidth issue. Even though a CPU and integrated GPU share the same bandwidth, GPU computing is more easily bottlenecked. That's because there's very little cache memory per work item on a GPU, and thus it has to reach out to DRAM memory much more often.

well, trinity doesn't use PCIe anymore....
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
well, trinity doesn't use PCIe anymore....

Sure about that? Trinity is bolted on like Llano. So if not PCIe its another bus, but might still be PCIe in nature like DMI, A-Link Express etc. But we are at potatoes vs potatoes.

Trinity_unb.png
 
Last edited:

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
well, trinity doesn't use PCIe anymore....
Which is why I said "ruling out discrete GPUs".

Kaveri appears to represent AMD's best attempt at tackling this issue. But even so, there will still be a bandwidth bottleneck...

You see, when a CPU switches from a sequential scalar workload to a parallel vector workload, all or part of the data may still reside in the L1 or L2 cache. Hence you get massive bandwidth and very low latency access. For a heterogeneous system, the data has to leave the CPU core caches and be transferred to the GPU caches, and back again. This means it always has to pass through some bandwidth bottleneck, and the latency is considerably higher.

Of course AMD will tell you that this overhead can be avoided through careful scheduling of tasks so that both the CPU and GPU can kept busy doing something else while squeezing the data through the bottleneck and waiting for it to appear on the other end. But this really isn't as easy as it sounds, and it puts restrictions on what you can do efficiently. Basically it forces developers to divide the work into large enough chunks to cover the overhead, while more often than not neither the CPU nor GPU is optimal for the entirely of the chunks. The CPU will have to execute code that can be partially parallel, while the GPU will have to execute code that can be partially sequential.

To solve that dilemma you need a homogeneous architecture that can execute both sequential and parallel code efficiently. AVX2 is a massive leap in that direction. And with AVX-1024 they would basically be able to 'morph' the CPU into behaving much more like a GPU architecture, fully dynamically, with zero effort from software developers. That is the future of computing.

It is the only future for computing. It is a widely known fact that as the computing density increases, bandwidth and latency don't improve at the same rate. So things will only get worse for heterogeneous systems. Kaveri improves the situation over Trinity, but does not fix it the way AVX2+ does.
 

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
CPUarchitect, what is this bottleneck you're talking about? Why do you assume a low-bandwidth, high latency interconnect between CPU and GPU cores? Intel doesn't have that in Sandy/Ivy Bridge, so why do you think AMD has that, or will always have that?

Also, do you have any comment on the inefficiencies of wide vector execution that I brought up earlier?
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
CPUarchitect, what is this bottleneck you're talking about? Why do you assume a low-bandwidth, high latency interconnect between CPU and GPU cores? Intel doesn't have that in Sandy/Ivy Bridge, so why do you think AMD has that, or will always have that?
There is always a lower bandwidth and higher latency for data transferred between cores versus within cores.

A CPU core can execute some scalar code and then a few vector instructions and then some scalar code again, without any sort of hitch. Executing the vector instructions on a heterogeneous core instead would require transferring the data and synchronizing the processing between the two. You easily lose more than you might gain.
 

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
There is always a lower bandwidth and higher latency for data transferred between cores versus within cores.

A CPU core can execute some scalar code and then a few vector instructions and then some scalar code again, without any sort of hitch. Executing the vector instructions on a heterogeneous core instead would require transferring the data and synchronizing the processing between the two. You easily lose more than you might gain.

OK, I see what you're saying. So you think that it's unlikely for there to be enough vectorizable code to make GPUs worth it? Instead, you're arguing that a processor would be better served by being able to switch between vector and scalar mode very quickly, right? If this is what you're arguing, then I just disagree with your prediction for the future, but at least I understand your argument.

I still think a lot of people are going to be surprised when high-degree vectorization introduces its own set of problems that a CPU will have a tough time dealing with. GPUs don't mind at all if they encounter cache misses and instead have to read from DRAM, but CPUs slow down a lot when that happens.

BTW, are you actually a computer architect?
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
You're right about how gather is great when the caches work correctly. I'm just saying the caches can't be relied upon in real life. If you have a 16-wide vector operation, and just one of those operators has to be fetched from main memory (i.e., cache miss), then all the other parallel operations have to wait too. GPUs hide this by just switching to another thread, but what can a CPU do? Wait? Hope this never happens?
First of all it is actually very rare in practice, especially for throughput oriented workloads where the access patterns are typically quite regular and prefetching can do an excellent job. Secondly there is Hyper-Threading, so the CPU can switch between two threads. And thirdly there is out-of-order execution, which is the CPU's primary way to stay busy during cache misses.

Note also that this isn't exclusive to gather. A cache miss in scalar code is equally problematic. Yet CPUs are remarkably effective at preventing it and dealing with it when it does occur. Note that Hyper-Threading typically only helps up to 30%, thus there is no point in having many more thread (and not having HT isn't disastrous). The low thread count itself helps ensure high cache hit rates.

Last but not least I'm afraid you're overly optimistic about how well a GPU can deal with it. Because it has so many threads contending for cache space, a somewhat irregular access pattern can easily lead to the majority of requests resulting in a miss. Hence the GPU is forced to read from DRAM all the time, and it slows down to a crawl due to the very low DRAM bandwidth per FLOP!

In a nutshell:
  • Low access irregularity: CPU suffers a tiny bit, GPU doesn't suffer at all
  • Medium access irregularity: CPU suffers a little, GPU takes a considerable hit
  • High access irregularity: CPU can still cope, GPU is choking
Also, in general, computers aren't slow because their floating point vectors aren't wide enough. They're slow because of the cache/memory system. Wide gather might make this problem worse. That's all I'm saying.
Gather doesn't make it worse at all. You have to look at how it helps the average case. Gather replaces 18 legacy instructions, so in the best case we're looking at an 18x speedup, while in the worst case there is no speedup over scalar code. But in the average case it's still considerably faster! In other words, while gather may not help with cache misses, it doesn't worsen things either and does help with making cache hits faster.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
OK, I see what you're saying. So you think that it's unlikely for there to be enough vectorizable code to make GPUs worth it? Instead, you're arguing that a processor would be better served by being able to switch between vector and scalar mode very quickly, right? If this is what you're arguing, then I just disagree with your prediction for the future, but at least I understand your argument.
No, there's definitely sufficient vectorizable code. The problem is the granularity of it. A heterogeneous system incurs a penalty every time you switch from CPU to GPU processing and back. So you need large enough chunks of parallel code to keep the number of penalties low. In the ideal case a low enough number of penalties can be hidden by doing other task. This has several consequences though. The developer has to be very aware of what is executing where, and has to reorder code so that scalar portions and parallel portions form large chunks. He may even have to rewrite algorithms to make them suitable for parallel processing, or include some parallel code in the CPU chunk and some scalar code in the GPU chunk. And finally he's also being tasked with finding ways to hide the heterogeneous penalties by doing other things in parallel.

The reality is that developers really don't like being faced with additional complexity. It's not that they're lazy, it's just that software development is highly complex as it is, without having to deal with hardware specific optimizations. Complex manual optimizations also increase the risk of creating subtle bugs (or not so subtle bugs). So it adds more development time, more testing time, and more support time. And time is money. The ROI has to be worth it for developers to consider heterogeneous computing. But homogeneous computing quite clearly has a much higher ROI. Little to no developer effort, and up to eightfold parallelization.

Also note that changing the granularity of scalar and parallel code changes the flow of data. Take for example a physics engine where objects are checked for collision against their environment, and based on the result a complex response is computed. Assuming that the collision detection is highly parallel and the response is scalar, a heterogeneous architecture would beg for all of the collision detections to be computed on the GPU before all the responses are computed on the CPU. But this means that the result of multiple collisions has to be stored in a buffer. With a homogeneous architecture you don't need that buffer. You just perform one collision detection using parallel code, then immediately after that compute the response using scalar code. The compiler and CPU may even be able to partially overlap those calculations a bit, all automatically. Furthermore, the homogeneous solution is much easier to debug because you're only dealing with one collision and one response at a time. And all the data can stay in a close cache level, without consuming additional bandwidth for temporary buffers.

So homogeneous computing has numerous advantages. It just needed wider vectors and gather support for higher parallel throughput.
I still think a lot of people are going to be surprised when high-degree vectorization introduces its own set of problems that a CPU will have a tough time dealing with. GPUs don't mind at all if they encounter cache misses and instead have to read from DRAM, but CPUs slow down a lot when that happens.
As detailed in my previous post, I don't think they slow down "a lot". And they can avoid many cache misses in the first place by having lots of cache space per thread and by prefetching. The GPU can deal with a certain amount of cache misses without any slowdown, but its performance drops sharply once the amount of cache misses exceeds a threshold.
BTW, are you actually a computer architect?
I'm actually a nuclear engineer, but I'm highly involved in the computational side of things.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
more like 4x best speedup I'll say, don't forget we tak about superscalar cores able to sustain 2 scalar loads per clock
I was looking at how one gather port improves things over one scalar port. Since it's highly unlikely for Haswell to have two gather ports, we have to ignore the second load port when evaluating the effect of gather in isolation.

Also while that means its an 8x improvement for one load/gather port, not 18x, there are also many insert/extract instructions which were previously needed, that are eliminated. Those even had many dependencies. So under ideal circumstances, gather can increase the effective work 18-fold.

Anyway, the exact number doesn't matter to the conclusion that gather never makes things worse but speeds up the average case a great deal.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
I was looking at how one gather port improves things over one scalar port. Since it's highly unlikely for Haswell to have two gather ports, we have to

thus my 4x best speedup based on *actual timings* on Ivy Bridge, the extract/insert version can sustain 2 loads per clock
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Anyway, the exact number doesn't matter to the conclusion that gather never makes things worse but speeds up the average case a great deal.

if the gather instructions are serializing (which will be the case with a single "gather port" as you say), in some difficult cases (*1) fine grain software synthetized gather will be probably a better option, anyway I'll suggest to wait for the real chips before to draw deep conclusions

*1 for example with two running threads competing for the gather resource and one thread with high cache miss rate
 
Last edited:

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
thus my 4x best speedup based on *actual timings* on Ivy Bridge, the extract/insert version can sustain 2 loads per clock
But that would be a test where different port counts are being used. Assuming one gather port and one regular load port for Haswell, the second port would be available for more work while in the extract/insert version you're occupying both.
if the gather instructions are serializing (which will be the case with a single "gather port" as you say), in some difficult cases (*1) fine grain software synthetized gather will be probably a better option, anyway I'll suggest to wait for the real chips before to draw deep conclusions
Well we can at least assume that Haswell's gather implementation will be the same as that of Knights Corner, which can collect any number of elements from one cache line each cycle (as described in section 4.5.3 of the Instruction Set Reference Manual).

It would be incredible to have two such gather ports, but even with only one it's never slower than two regular load ports. And that's because the insert instructions are dependent. So even if you could fetch eight elements in four cycles, you still need nine cycles to get them into the vector. The gather only takes ten in the worst case (eight loads plus one for the mask and one for the blend).
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Well we can at least assume that Haswell's gather implementation will be the same as that of Knights Corner, which can collect any number of elements from one cache line each cycle

sure, thus the 4x best speedup figure, otherwise it will be less
 
Status
Not open for further replies.