That's one of the reasons why we now have 64 bit and PAE. The physical address space as well as the linear memory map in the kernel's 2 GB half were quickly consumed by I/O devices, particularly video cards.
Even still it's possible to map 100 GB in a 4 GB window or even a 4k window. You don't have to map everything at once. You can map portions of it via paging and you can map portions of physical memory with bank switching. Even VGA did this to map 256K and 512k as 4+ pages of 64k at a time in the original EGA reserved space (A0000-AFFFF) without requiring more address space. SVGA cards gave you the option of mapping any 64k portion of 1MB or more VRAM in real mode or giving you a linear frame buffer address with all of it accessible in protected mode.
Also the more complex a peripheral is the less I/O and memory addresses it potentially uses for it's host accessible registers.
Even VGA had so many registers that it kept them internal and only presented a handful of index and data registers to the host. The register set and memory address space is fairly small as it was designed to slot in existing EGA/CGA reservations.
You can do anything with just 1 mapped register if you really wanted.
write index of register you want to set
write data to set that register to
write index of register you want to read
read data of register
All at the same address to the host CPU.
As I've stated before, most systems today use a single FIFO port that you write packets too. The packets may have 100s of commands and sub commands and direct internal register set/get options. Maybe a external status register to see how full the FIFO is, read fence flags, ISR status, etc.
You can write to 4 GB memory using only 2 registers if you designed that way. All you'd need is a 32 bit auto incrementing address register and a data register. You set your address in A then read or write your data to B. Internally the data you write at B is written to the address in A then A is incremented. So you would just set A to 0 then write 4 GB of data to B.
Not necessarily what a PC does but demonstrates that how you communicate with I/O devices is entirely up to the designers.
We've been over this again and again. All that's going on, no matter which of 100 ways you do it, or what kind of device you are talking to, is address decoding logic is driving data or reading data off the data bus when a particular device is selected, and is disconnected from the bus when it's not being selected. This hasn't changed since like 1970. You can modernize buses with point to point serial networks and have high speed differential transceivers and encode addressing and data and control in band as packets, but it's ultimately all the same concept and now you are just communicating with a bus bridge that translates the parallel<>serial signals.
For something as powerful as a GPU, a queued asynchronous FIFO is the only way to go. The host CPU just shoves commands there as fast as the device can accept them. Better yet the host CPU can create and chain together command packet streams in it's own RAM, anything from draw commands to state changes, double buffered, then program the GPU to initiate a DMA to stuff the FIFO itself and pull it from RAM all by itself. The host CPU can begin the next batch by starting a new command stream in another section of memory while the GPU's bus master DMA is stuffing itself with the previous one. The DMA on the device can be smart enough to read tags in the command stream and jump to the next linked command stream when it's done, or to pull in and interleave multiple separate streams (scatter/gather operation) in numerous ways.
You typically have a status register that you can read at a separate address that tells you the FIFO depth, busy status, DMA channel status, ISR status, drawing status, command completion status, user fence status that lets you know when a special "let me know when you have got to this point in the command stream" "fence" command that flags a status register or causes an interrupt so you can do things like synchronize texture and state changes asynchronously, the possibilities are really endless. Multiple independent FIFOs are not out of the question either. For example, one FIFO is DMAing into itself a display list of draw commands, while another FIFO is DMAing in texture data, and packing/unpacking the RGBA format to something else, and writing it to an offscreen portion of GPU RAM in preparation for use in a future display list. Meanwhile the fence tag that you purposely placed at the end of your previous frame has been reached, the GPU is bugging you about it that it's done, and you reply with a packet to swap the active display address to show the completed frame.
Also your second FIFO has completed uploading the texture, so at some point in whatever your current command stream you are building you can insert a tag into your next command stream to select the address where you sent that texture on one of your internal texture address registers, along with width, height, clamping perimeters, sampling modes, etc. and use it.
Your goal as a GPU programmer at this point is to make sure those FIFOs never go idle and to make sure you never have any dependency stalls by careful scheduling and ordering of your commands. This means double or even triple buffering command streams, abusing DMA and fences/interrupts for concurrency, loading dependent objects ahead of time before they are needed, batching up state changes to minimize interrupting the GPU and causing it to reset/flush to change gears, etc. But you also don't want to keep it too busy to the point that the host CPU is too far ahead and is creating packets faster than the GPU can consume them.

Pretty sure with modern GPUs being so ridiculous now that this is practically impossible anyway. The fastest CPU can't send data to the fastest GPU fast enough, and much of it is overhead imposed by the API like Direct 3D function calls and the excessive kernel syscalls and context changes required.