With 32X PCI-Express, would a fully programmable math unit on a GPU make math on any current CPU largley unnecessary?

MadRat · Dec 28, 2003

It sounds like alot of raw math functionality performed by a CPU could be more appropriately handled by a GPU independent of the CPU. The 32X PCI-Express port can handle 8GB/sec of thoroughput and AGP cards often offer anywhere from 128-bit to 256-bit memory controllers to high speed DDR RAM. One advantage over AGP that the PCI-Express port will offer is BI-DIRECTIONAL communications, something handicapped in the current AGP specs. I figure the 8GB/sec of thoroughput would be enough to where the GPU would more efficiently utilize raw memory bandwidth than a fsb-strapped CPU. The GPU would also enjoy memory bandwidth several fold over the main memory bus, making parrallel math instructions ridiculously faster than what can be done by a CPU. With some videocards approaching internal memory bandwidth of 20GB/sec compared to an 800fsb of 6GB/sec, the GPU's internal memory bandwidth wins hands down.

Does anyone else feel this functionality for a GPU is eventually on the horizon?

A nice discussion is taking place here: http://www.aceshardware.com/forum?read=105060450. One of the more interesting comments was the idea of instituting a "streaming processor" in the northbridge, altogether bypassing both CPU and GPU. I'd think the memory bandwidth limitations of main memory would make it futile in this respect.

Soulkeeper · Dec 28, 2003

the way i see it the gpu has been doing a lot of work that had previousely been done by the cpu and this is good because now you have more cpu cycles for other things like AI calculations and game engine needs in a gaming environment
atleast we know with more bandwidth to the rest of the system this will increase load times of insanely large textures and levels into gpu memory

Pudgygiant · Dec 28, 2003

There have long been discussions of modding dnet to run on a gpu (especially from people with faster gpu's than cpu's), so I really hope that functionality increases. The reason no one has been able to do it yet is that the architecture not only varies between chips, but it's completely unlike that of any cpu.

jhu · Dec 28, 2003

as long as you don't need high precision then i guess so. however, nvidia cards are limited to 32-bit precision and ati is limited to 24-bit precision

MadRat · Dec 28, 2003

How detailed of precision could they perform?

Pudgygiant · Dec 28, 2003

I don't mean to sound ignant, but what is precision?

CTho9305 · Dec 28, 2003

Originally posted by: Pudgygiant
I don't mean to sound ignant, but what is precision?

In this context, the number of bits used to store values. 32-bit precision means there are 4 billion possible numbers that can be represented, but not necessarily the integers 0-4,294,967,295. This looks like a decent explanation of IEEE floating point numbers. One thing to note is that the numbers are not evenly spaced - there are many many more floating point numbers between 0 and 1 than there are between 4,000,000,000 and 4,000,000,001 (I'm not even sure 32-bit floats can represent both of those values).

General purpose CPUs can work with integers (-2billion to 2 billion or 0 to 4 billion), 32-bit single precision floats (± ~10^-44.85 to ~10^38.53), 64-bit double precision floats (± ~10^-323.3 to ~10^308.3), and an x86-only 80-bit internal format (which is usually converted back to double or single precision after the calculations are finished).

edit: Good lectures here. I did read something a while ago with ATI and 96-bit precision, but I don't remember exactly what.

Originally posted by: MadRat
Does anyone else feel this functionality for a GPU is eventually on the horizon?

For highly repetitive or predictable operations (e.g. SETI, dnetc), a GPU may be good, but for jobs that involve a lot of branches (supposedly in normal mostly-serial integer code, something like 20% of the instructions are branches), you need a very very good branch predictor etc, which GPUs don't have (since they're not meant for that type of computation), and the latency to send the values to calculate to the GPU from the CPU would be too high to add any performance. Basically, for code that isn't highly-parallel, a general purpose CPU is probably a better choice.

Pudgygiant · Dec 28, 2003

Ah. Thanks.

Now for something like dnet, would it be possible to emulate a cluster with the gpu and cpu?

Matthias99 · Dec 28, 2003

If you had some sort of driver or HAL that made the GPU look like a second CPU, I don't see why it's not doable. But it would likely require significant changes to the OS -- I don't know of anything that's built to support multiple asymmetric processors out of the box.

TerryMathews · Dec 29, 2003

Originally posted by: Matthias99
If you had some sort of driver or HAL that made the GPU look like a second CPU, I don't see why it's not doable. But it would likely require significant changes to the OS -- I don't know of anything that's built to support multiple asymmetric processors out of the box.

Actually, all of the Windows NT based OSes back through NT4 SP1 'support' AMP. The machine will at least boot up and is able to run multiple threads across both processors.

Personally verified by myself on a Tyan Tiger 100 with a P3 450 and 500 back in the day.

tinyabs · Dec 29, 2003

Does anyone else feel this functionality for a GPU is eventually on the horizon?

A nice discussion is taking place here: http://www.aceshardware.com/forum?read=105060450. One of the more interesting comments was the idea of instituting a "streaming processor" in the northbridge, altogether bypassing both CPU and GPU. I'd think the memory bandwidth limitations of main memory would make it futile in this respect.

When PCI was introduced in Pentium, Pentium is only 66MHz. So I think with PCI-X, we will have much faster and multicore CPU on the horizon. PCI-X is the first step, other revelant technologies will be introduce gradually as needed. So don't worry, when you got a GPU/CPU SMP combo running, a dualcore 5GHz CPU is already on the shelves. What I mean is that using GPU good at graphics to drive your math instead is a novel but costly idea. Novel in terms of reusing the technology on sucessive generations of GPU. Costly in terms of cheaper solutions compare to this.

Another thing is that PCI-X is intended for IO, if you use it to do calculation that requires return trip to CPU, it will take a piece of the bandwidth. IO buses intended to do only one way long latency operation.

glugglug · Dec 29, 2003

Originally posted by: CTho9305
For highly repetitive or predictable operations (e.g. SETI, dnetc), a GPU may be good, but for jobs that involve a lot of branches (supposedly in normal mostly-serial integer code, something like 20% of the instructions are branches)

While that may be true if you include "branch always" instructions which can't be mispredicted because they branch 100% of the time, and include subroutine calls which while nominally more expensive than a generic branch instruction, can not be mispredicted, I would have a real hard time believing 20% of the instructions to be conditional branches.

This idea is basically the point of Cg on the NV30 and later and the whatever the ATI equivalent is called on the R300 and later cores.

Peter · Dec 29, 2003

Would you guys please stop abbreviating PCI Express with PCI-X? The latter is an existing, and different, technology. Thanks and good night

MadRat · Dec 29, 2003

Originally posted by: MadRatThe 32X PCI-Express port can handle 8GB/sec of thoroughput and AGP cards often offer anywhere from 128-bit to 256-bit memory controllers to high speed DDR RAM. One advantage over AGP that the PCI-Express port will offer is BI-DIRECTIONAL communications, something handicapped in the current AGP specs.

I think from the outset we were talking PCI-Express; only one fella confused the two here...

CTho9305 · Dec 29, 2003

Originally posted by: glugglug

Originally posted by: CTho9305
For highly repetitive or predictable operations (e.g. SETI, dnetc), a GPU may be good, but for jobs that involve a lot of branches (supposedly in normal mostly-serial integer code, something like 20% of the instructions are branches)

Click to expand...

While that may be true if you include "branch always" instructions which can't be mispredicted because they branch 100% of the time, and include subroutine calls which while nominally more expensive than a generic branch instruction, can not be mispredicted, I would have a real hard time believing 20% of the instructions to be conditional branches.

This idea is basically the point of Cg on the NV30 and later and the whatever the ATI equivalent is called on the R300 and later cores.

With any reasonably deep & wide pipeline, you have to "predict" even unconditional branches (obviously there isn't a mispredict penalty, but there are still penalties).

BenSkywalker · Dec 31, 2003

you need a very very good branch predictor etc, which GPUs don't have

Yet, it is coming(well, sort of). Perhaps with this next gen of parts(NV40/R420) but almost certainly with the generation of parts after that(NV50/R500) GPUs will start shipping with hardware to deal with branching(including conditional) although using a BPU isn't too likely due to the nature of GPUs they will spend the transistors and build the GPU to execute all possible branches. This is not speculative btw, all of this has been stated publicly by the respective IHVs(not to mention it will be a requirement before we see much more progress in terms of the upper limits of shaders). Having a branch mis on a GPU would be utterly catastrophic to performance(drop in excess of 90% wouldn't be shocking), they must handle it differently then processors(reasonably speaking).

Expect GPUs within the next few years to take AI of the processor, physics within a couple of years after that. The problem right now with moving general purpose math crunching apps over to the GPU is that there is nothing close to an easy way to do it. The IHVs do not give away exacting specs on how the machine level functions on their GPUs, one of the IHVs will have to step up and create an API of sorts that can handle the task(this will almost certainly be a PR stunt, although one that could easily work with the upcoming parts).

MadRat · Dec 31, 2003

If a manufacturer's API worked reasonably well to boost general math performance it could very well be a strong selling point.

Rainsford · Jan 2, 2004

In a pipelined CPU, the "math" stage is just a part of the overall process. So for any instruction the CPU would have to perform many functions, but during the process it would have to turn control of the calculation stage over to the GPU, requiring another data transfer from the CPU to the GPU. No matter how fast the bus is, it would still be slower than simply transfering data around inside the processor. And you would incur that penalty with almost every instruction since the vast majority of instructions require a calculation of some kind. You say independent of the CPU, but that would mean that the vast majority of instructions (which are "math") would not run on the CPU but on the GPU, and some sort of syncronization would need to be put in place, and it would probably result in the CPU waiting around a lot of the time.

It seems like a good idea, BUT it seems impracticle when you get down to the nuts and bolts of the design. Well, maybe less impracticle and more like it wouldn't help very much.

Unless I'm misunderstanding something about your idea, in which case, feel free to correct me.

aka1nas · Jan 3, 2004

Why don't the IHVs come up with an ISA for graphics cards and then let the indivual IHVs do their own implementations and add instructions as AMD and Intel do? I would think it would be even easier as they already have DirectX to kind of keep their featuresets similar.

Barnaby W. Füi · Jan 4, 2004

http://graphics.stanford.edu/projects/brookgpu/

With 32X PCI-Express, would a fully programmable math unit on a GPU make math on any current CPU largley unnecessary?

Lifer

Diamond Member

Senior member

Lifer

Lifer

Senior member

Elite Member

Senior member

Diamond Member

Lifer

Member

Diamond Member

Elite Member

Lifer

Elite Member

Diamond Member

Lifer

Lifer

Diamond Member

Elite Member