How Do the CPU and GPU/Graphics Card Communicate?

chrstrbrts

Senior member
Aug 12, 2014
522
3
81
Hello,

OK. So, I've asked some questions lately about how the CPU sees I/O devices, and my new conception is that the north bridge is on-chip and the south bridge has become a huge PCIe hub off of which the slower peripherals hang.

Further, peripheral registers are memory-mapped into the CPU's memory space.

But, it occurred to me that the graphics sub-system is different because it is so memory intensive.

The NIC, A/D converter, etc. probaly only have a few control / data registers.

But, the GPU is a full blown processor with dozens (right?) of registers.

Further, if a full blown graphics card is involved, then you're talking about 2-4 gigs of memory.

It's impossible for all of that memory to be memory-mapped, right?

That wouldn't leave any memory in the memory space for the CPU's actual RAM.

So, how do the CPU and graphics sub-system talk?

Thanks.
 

exdeath

Lifer
Jan 29, 2004
13,679
10
81
That's one of the reasons why we now have 64 bit and PAE. The physical address space as well as the linear memory map in the kernel's 2 GB half were quickly consumed by I/O devices, particularly video cards.

Even still it's possible to map 100 GB in a 4 GB window or even a 4k window. You don't have to map everything at once. You can map portions of it via paging and you can map portions of physical memory with bank switching. Even VGA did this to map 256K and 512k as 4+ pages of 64k at a time in the original EGA reserved space (A0000-AFFFF) without requiring more address space. SVGA cards gave you the option of mapping any 64k portion of 1MB or more VRAM in real mode or giving you a linear frame buffer address with all of it accessible in protected mode.

Also the more complex a peripheral is the less I/O and memory addresses it potentially uses for it's host accessible registers.

Even VGA had so many registers that it kept them internal and only presented a handful of index and data registers to the host. The register set and memory address space is fairly small as it was designed to slot in existing EGA/CGA reservations.

You can do anything with just 1 mapped register if you really wanted.

write index of register you want to set
write data to set that register to
write index of register you want to read
read data of register

All at the same address to the host CPU.

As I've stated before, most systems today use a single FIFO port that you write packets too. The packets may have 100s of commands and sub commands and direct internal register set/get options. Maybe a external status register to see how full the FIFO is, read fence flags, ISR status, etc.

You can write to 4 GB memory using only 2 registers if you designed that way. All you'd need is a 32 bit auto incrementing address register and a data register. You set your address in A then read or write your data to B. Internally the data you write at B is written to the address in A then A is incremented. So you would just set A to 0 then write 4 GB of data to B.

Not necessarily what a PC does but demonstrates that how you communicate with I/O devices is entirely up to the designers.

We've been over this again and again. All that's going on, no matter which of 100 ways you do it, or what kind of device you are talking to, is address decoding logic is driving data or reading data off the data bus when a particular device is selected, and is disconnected from the bus when it's not being selected. This hasn't changed since like 1970. You can modernize buses with point to point serial networks and have high speed differential transceivers and encode addressing and data and control in band as packets, but it's ultimately all the same concept and now you are just communicating with a bus bridge that translates the parallel<>serial signals.

For something as powerful as a GPU, a queued asynchronous FIFO is the only way to go. The host CPU just shoves commands there as fast as the device can accept them. Better yet the host CPU can create and chain together command packet streams in it's own RAM, anything from draw commands to state changes, double buffered, then program the GPU to initiate a DMA to stuff the FIFO itself and pull it from RAM all by itself. The host CPU can begin the next batch by starting a new command stream in another section of memory while the GPU's bus master DMA is stuffing itself with the previous one. The DMA on the device can be smart enough to read tags in the command stream and jump to the next linked command stream when it's done, or to pull in and interleave multiple separate streams (scatter/gather operation) in numerous ways.

You typically have a status register that you can read at a separate address that tells you the FIFO depth, busy status, DMA channel status, ISR status, drawing status, command completion status, user fence status that lets you know when a special "let me know when you have got to this point in the command stream" "fence" command that flags a status register or causes an interrupt so you can do things like synchronize texture and state changes asynchronously, the possibilities are really endless. Multiple independent FIFOs are not out of the question either. For example, one FIFO is DMAing into itself a display list of draw commands, while another FIFO is DMAing in texture data, and packing/unpacking the RGBA format to something else, and writing it to an offscreen portion of GPU RAM in preparation for use in a future display list. Meanwhile the fence tag that you purposely placed at the end of your previous frame has been reached, the GPU is bugging you about it that it's done, and you reply with a packet to swap the active display address to show the completed frame.

Also your second FIFO has completed uploading the texture, so at some point in whatever your current command stream you are building you can insert a tag into your next command stream to select the address where you sent that texture on one of your internal texture address registers, along with width, height, clamping perimeters, sampling modes, etc. and use it.

Your goal as a GPU programmer at this point is to make sure those FIFOs never go idle and to make sure you never have any dependency stalls by careful scheduling and ordering of your commands. This means double or even triple buffering command streams, abusing DMA and fences/interrupts for concurrency, loading dependent objects ahead of time before they are needed, batching up state changes to minimize interrupting the GPU and causing it to reset/flush to change gears, etc. But you also don't want to keep it too busy to the point that the host CPU is too far ahead and is creating packets faster than the GPU can consume them. :D Pretty sure with modern GPUs being so ridiculous now that this is practically impossible anyway. The fastest CPU can't send data to the fastest GPU fast enough, and much of it is overhead imposed by the API like Direct 3D function calls and the excessive kernel syscalls and context changes required.
 
Last edited:

piasabird

Lifer
Feb 6, 2002
17,168
60
91
Memory can use dynamic reads that go around the CPU BUS. CPU sends a command to read and then only gets an answer if the read fails or has been completed. A video card is a limited use computer unto itself almost. It Ha its own CPU and RAM. Everything in a computer is so complex now that you almost have to be a hardware engineer to understand it.

Imagine 16 lanes with 16 registers all fetching data, reading, doing computations and then writing the data to output. There is a prediction unit which tries to predict future commands and then loads them into registers. There is an arithmetic Logic Unit which does mathematical simple and complex computations. Some commands are hardwired into circuits in the CPU and some are stored in registers in a separate part of the CPU. Then there is also a Memory Control System that manages direct reads and writes and also Dynamic memory access.

How it all works is everything is hooked into the CPU BUS. it is like a racetrack with connections around the outside. The CPU runs one every time a frequency is detected. It is capable of taking turns doing multiple tasks. In fact it is so complex that AMD developed a way to process an instruction at the top and bottom of the frequency pulse. These pulses are controlled by a clock which determines the speed at which operations are performed.

Everything is all happening all at once all in different directions twice for every clock pulse. It is like your heart. A heart is a multi valved pump its system takes turns pumping in and out.
 
Last edited: