CPU-Assisted GPGPU on Fused CPU-GPU Architectures

DrPizza

Administrator Elite Member Goat Whisperer
Mar 5, 2001
49,606
166
111
www.slatebrookfarm.com
http://news.ncsu.edu/releases/wmszhougpucpu/

I had a thought - if this increases performance by over 20%, while cutting energy needs, could this push computer architecture in the direction of fewer options for separate graphics cards?

In other words, CPUs and GPUs fetch data from off-chip main memory at approximately the same speed, but GPUs can execute the functions that use that data more quickly. So, if a CPU determines what data a GPU will need in advance, and fetches it from off-chip main memory, that allows the GPU to focus on executing the functions themselves – and the overall process takes less time.
In preliminary testing, Zhou’s team found that its new approach improved fused processor performance by an average of 21.4 percent.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
My take on it: TL;DR

There are two core problems, and one PR problem. The first is research people.

Researchers tend to love repeating, easy to measure, things. Most of what our CPUs eat power for is not that. This is why researchers keep pushing forward VLIW, which sucks for general-purpose CPUs, and why they keep pushing RISC ideas. Big loops that do the same things over and over again would be conducive to a GPU-like processor, I'll bet.

When more than 1 in 10 instructions is a conditional branch, that falls flat on its face. When only a few tens of instructions go by between procedure calls (JAL), which may themselves be conditional branches, and which may need to set up stack space, and save registers, it comically rolls down the stairs. If it's a conditional indirect procedure call, even slapstick comedic allusions won't cut it. That's when the CPU either loaded the right line(s) into cache, or all those potential MIPS and FLOPS don't mean squat, because you're stuck waiting on memory. In other words, there are large swaths of programs, even whole problem domains, for which this will mean absolutely nothing.


The second problem is programming. While C++ w/ AMP is a good step in the right direction, other languages that can be extended to directly work on vectors at a high level need to get on it. If you need to do custom work in a niche language, with its own compiler/runtime/interface, the CPU and GPU will remain separate, no matter what technical innovations anyone makes. I'm not knocking OpenCL or any others, but there needs to be very low software overhead, as it concerns long-term merging of the two paradigms (I hate having written that, but I can't think of a better way to say it), and the best way to do that is to use the same language, on the same platform, such that CPU+GPU just becomes an added option for those loopy bits.

It was, probably still is, common for tacked-on vector processors and DSPs on embedded processors to be given work to do well in advance of the CPU reaching that point in the program, so they can have it done by the time it needs the data, rather than trying to do it all in sequence; also to soft-load cache lines so that the CPU and vector unit will both have the data they need at the right time; and to transfer results to CPU registers of the CPU's cache, so the CPU can save them out to separate locations. What's special about this appears to be that they've found a way to automatically implement that at compile-time, by taking advantage of features that our CPUs have for doing their common all-over-the-place non-loopy work. Even so, mass-market commodity software is going to be behind, so I doubt it will affect such hardware much.


The third is news reporting. "The paper, “CPU-Assisted GPGPU on Fused CPU-GPU Architectures,” will be presented Feb. 27 at the 18th International Symposium on High Performance Computer Architecture..."

That's far from normal work done by normal people with normal hardware. By the abstract, it looks to me like they are talking largely about how to best do the obvious. The idea is intuitively obvious, while implementation is anything but.

I would expect +20% to be unusually high for desktop applications. If it can be implemented in a hardware-agnostic fashion, it will be worth it for whatever it could give. Much like HW IPC, we are well into severely diminishing returns for more advanced compilers, and an average improvement of 5-10% for GPU-enabled applications would be worth the effort.


Fewer options for AIBs is going to happen anyway, and the problems are not going to be so quickly and easily solved. Right now the big problem is simply that IGP performance isn't high enough, yet the cards that can do the job eat too much power, need too much memory bandwidth, and sit far away from the CPU, in terms of latency. Today, the people that would have gotten a FX 5200 or 8400GS can just get a Llano or Core in-2xxx, and be happy. Over time, it's going to move up the ladder, to a point where low-end AIBs will roughly be what we'd call mid-range now, primarily because IGP is getting good enough to replace low-end cards, much in the same way that IGP already replaced non-3D video cards.


Personally, I would expect work like this to further marginalize IBM Power and Fujitsu SPARC, in favor of x86 server/workstation hardware, more than it will influence consumers' choices.