With 32X PCI-Express, would a fully programmable math unit on a GPU make math on any current CPU largley unnecessary?

MadRat

Lifer
Oct 14, 1999
11,999
307
126
It sounds like alot of raw math functionality performed by a CPU could be more appropriately handled by a GPU independent of the CPU. The 32X PCI-Express port can handle 8GB/sec of thoroughput and AGP cards often offer anywhere from 128-bit to 256-bit memory controllers to high speed DDR RAM. One advantage over AGP that the PCI-Express port will offer is BI-DIRECTIONAL communications, something handicapped in the current AGP specs. I figure the 8GB/sec of thoroughput would be enough to where the GPU would more efficiently utilize raw memory bandwidth than a fsb-strapped CPU. The GPU would also enjoy memory bandwidth several fold over the main memory bus, making parrallel math instructions ridiculously faster than what can be done by a CPU. With some videocards approaching internal memory bandwidth of 20GB/sec compared to an 800fsb of 6GB/sec, the GPU's internal memory bandwidth wins hands down.

Does anyone else feel this functionality for a GPU is eventually on the horizon?

A nice discussion is taking place here: http://www.aceshardware.com/forum?read=105060450. One of the more interesting comments was the idea of instituting a "streaming processor" in the northbridge, altogether bypassing both CPU and GPU. I'd think the memory bandwidth limitations of main memory would make it futile in this respect.
 

Soulkeeper

Diamond Member
Nov 23, 2001
6,736
156
106
the way i see it the gpu has been doing a lot of work that had previousely been done by the cpu and this is good because now you have more cpu cycles for other things like AI calculations and game engine needs in a gaming environment
atleast we know with more bandwidth to the rest of the system this will increase load times of insanely large textures and levels into gpu memory


 

Pudgygiant

Senior member
May 13, 2003
784
0
0
There have long been discussions of modding dnet to run on a gpu (especially from people with faster gpu's than cpu's), so I really hope that functionality increases. The reason no one has been able to do it yet is that the architecture not only varies between chips, but it's completely unlike that of any cpu.
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
as long as you don't need high precision then i guess so. however, nvidia cards are limited to 32-bit precision and ati is limited to 24-bit precision
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Pudgygiant
I don't mean to sound ignant, but what is precision?

In this context, the number of bits used to store values. 32-bit precision means there are 4 billion possible numbers that can be represented, but not necessarily the integers 0-4,294,967,295. This looks like a decent explanation of IEEE floating point numbers. One thing to note is that the numbers are not evenly spaced - there are many many more floating point numbers between 0 and 1 than there are between 4,000,000,000 and 4,000,000,001 (I'm not even sure 32-bit floats can represent both of those values).

General purpose CPUs can work with integers (-2billion to 2 billion or 0 to 4 billion), 32-bit single precision floats (± ~10^-44.85 to ~10^38.53), 64-bit double precision floats (± ~10^-323.3 to ~10^308.3), and an x86-only 80-bit internal format (which is usually converted back to double or single precision after the calculations are finished).

edit: Good lectures here. I did read something a while ago with ATI and 96-bit precision, but I don't remember exactly what.

Originally posted by: MadRat
Does anyone else feel this functionality for a GPU is eventually on the horizon?
For highly repetitive or predictable operations (e.g. SETI, dnetc), a GPU may be good, but for jobs that involve a lot of branches (supposedly in normal mostly-serial integer code, something like 20% of the instructions are branches), you need a very very good branch predictor etc, which GPUs don't have (since they're not meant for that type of computation), and the latency to send the values to calculate to the GPU from the CPU would be too high to add any performance. Basically, for code that isn't highly-parallel, a general purpose CPU is probably a better choice.
 

Pudgygiant

Senior member
May 13, 2003
784
0
0
Ah. Thanks.

Now for something like dnet, would it be possible to emulate a cluster with the gpu and cpu?
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
If you had some sort of driver or HAL that made the GPU look like a second CPU, I don't see why it's not doable. But it would likely require significant changes to the OS -- I don't know of anything that's built to support multiple asymmetric processors out of the box.
 

TerryMathews

Lifer
Oct 9, 1999
11,464
2
0
Originally posted by: Matthias99
If you had some sort of driver or HAL that made the GPU look like a second CPU, I don't see why it's not doable. But it would likely require significant changes to the OS -- I don't know of anything that's built to support multiple asymmetric processors out of the box.

Actually, all of the Windows NT based OSes back through NT4 SP1 'support' AMP. The machine will at least boot up and is able to run multiple threads across both processors.

Personally verified by myself on a Tyan Tiger 100 with a P3 450 and 500 back in the day.
 

tinyabs

Member
Mar 8, 2003
158
0
0
Does anyone else feel this functionality for a GPU is eventually on the horizon?

A nice discussion is taking place here: http://www.aceshardware.com/forum?read=105060450. One of the more interesting comments was the idea of instituting a "streaming processor" in the northbridge, altogether bypassing both CPU and GPU. I'd think the memory bandwidth limitations of main memory would make it futile in this respect.

When PCI was introduced in Pentium, Pentium is only 66MHz. So I think with PCI-X, we will have much faster and multicore CPU on the horizon. PCI-X is the first step, other revelant technologies will be introduce gradually as needed. So don't worry, when you got a GPU/CPU SMP combo running, a dualcore 5GHz CPU is already on the shelves. What I mean is that using GPU good at graphics to drive your math instead is a novel but costly idea. Novel in terms of reusing the technology on sucessive generations of GPU. Costly in terms of cheaper solutions compare to this.

Another thing is that PCI-X is intended for IO, if you use it to do calculation that requires return trip to CPU, it will take a piece of the bandwidth. IO buses intended to do only one way long latency operation.
 

glugglug

Diamond Member
Jun 9, 2002
5,340
1
81
Originally posted by: CTho9305
For highly repetitive or predictable operations (e.g. SETI, dnetc), a GPU may be good, but for jobs that involve a lot of branches (supposedly in normal mostly-serial integer code, something like 20% of the instructions are branches)

While that may be true if you include "branch always" instructions which can't be mispredicted because they branch 100% of the time, and include subroutine calls which while nominally more expensive than a generic branch instruction, can not be mispredicted, I would have a real hard time believing 20% of the instructions to be conditional branches.

This idea is basically the point of Cg on the NV30 and later and the whatever the ATI equivalent is called on the R300 and later cores.

 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
Would you guys please stop abbreviating PCI Express with PCI-X? The latter is an existing, and different, technology. Thanks and good night ;)
 

MadRat

Lifer
Oct 14, 1999
11,999
307
126
Originally posted by: MadRatThe 32X PCI-Express port can handle 8GB/sec of thoroughput and AGP cards often offer anywhere from 128-bit to 256-bit memory controllers to high speed DDR RAM. One advantage over AGP that the PCI-Express port will offer is BI-DIRECTIONAL communications, something handicapped in the current AGP specs.

I think from the outset we were talking PCI-Express; only one fella confused the two here... ;)
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: glugglug
Originally posted by: CTho9305
For highly repetitive or predictable operations (e.g. SETI, dnetc), a GPU may be good, but for jobs that involve a lot of branches (supposedly in normal mostly-serial integer code, something like 20% of the instructions are branches)

While that may be true if you include "branch always" instructions which can't be mispredicted because they branch 100% of the time, and include subroutine calls which while nominally more expensive than a generic branch instruction, can not be mispredicted, I would have a real hard time believing 20% of the instructions to be conditional branches.

This idea is basically the point of Cg on the NV30 and later and the whatever the ATI equivalent is called on the R300 and later cores.

With any reasonably deep & wide pipeline, you have to "predict" even unconditional branches (obviously there isn't a mispredict penalty, but there are still penalties).
 

BenSkywalker

Diamond Member
Oct 9, 1999
9,140
67
91
you need a very very good branch predictor etc, which GPUs don't have

Yet, it is coming(well, sort of). Perhaps with this next gen of parts(NV40/R420) but almost certainly with the generation of parts after that(NV50/R500) GPUs will start shipping with hardware to deal with branching(including conditional) although using a BPU isn't too likely due to the nature of GPUs they will spend the transistors and build the GPU to execute all possible branches. This is not speculative btw, all of this has been stated publicly by the respective IHVs(not to mention it will be a requirement before we see much more progress in terms of the upper limits of shaders). Having a branch mis on a GPU would be utterly catastrophic to performance(drop in excess of 90% wouldn't be shocking), they must handle it differently then processors(reasonably speaking).

Expect GPUs within the next few years to take AI of the processor, physics within a couple of years after that. The problem right now with moving general purpose math crunching apps over to the GPU is that there is nothing close to an easy way to do it. The IHVs do not give away exacting specs on how the machine level functions on their GPUs, one of the IHVs will have to step up and create an API of sorts that can handle the task(this will almost certainly be a PR stunt, although one that could easily work with the upcoming parts).
 

MadRat

Lifer
Oct 14, 1999
11,999
307
126
If a manufacturer's API worked reasonably well to boost general math performance it could very well be a strong selling point.
 

Rainsford

Lifer
Apr 25, 2001
17,515
0
0
In a pipelined CPU, the "math" stage is just a part of the overall process. So for any instruction the CPU would have to perform many functions, but during the process it would have to turn control of the calculation stage over to the GPU, requiring another data transfer from the CPU to the GPU. No matter how fast the bus is, it would still be slower than simply transfering data around inside the processor. And you would incur that penalty with almost every instruction since the vast majority of instructions require a calculation of some kind. You say independent of the CPU, but that would mean that the vast majority of instructions (which are "math") would not run on the CPU but on the GPU, and some sort of syncronization would need to be put in place, and it would probably result in the CPU waiting around a lot of the time.

It seems like a good idea, BUT it seems impracticle when you get down to the nuts and bolts of the design. Well, maybe less impracticle and more like it wouldn't help very much.

Unless I'm misunderstanding something about your idea, in which case, feel free to correct me.
 

aka1nas

Diamond Member
Aug 30, 2001
4,335
1
0
Why don't the IHVs come up with an ISA for graphics cards and then let the indivual IHVs do their own implementations and add instructions as AMD and Intel do? I would think it would be even easier as they already have DirectX to kind of keep their featuresets similar.