To understand what multi-threaded DX drivers are, you have to understand what the underlying implementation is. So let's take a look at a modern game engine, DX implementation, and the underlying hardware.
To start, the underlying hardware is a per-context command-serial hardware. As in, for each context, it is serial in how it process the incoming commands. This command stream is what we normally refer to as the "command buffer" or "push buffer". The commands in these buffers are the raw binary machine code for the graphics card to consume. Now, the question is, how are these command codes generated?
This is where the driver comes in. The driver basically boils down to an implementation of the OpenGL/Direct3D interface that translates incoming function calls (with data) into hardware-specific machine code. Of course, there are a lot of restrictions on what the OpenGL/Direct3D states are when specific commands are called, and also what the restrictions are with the underlying hardware. In essence, you can think of the driver as a compiler.
On the other end of the spectrum is the 3D engine. The 3D engine calls the OpenGL/Direct3D interface, which is essentially an abstraction of a virtual device with a single interface. The 3D engine calls the functions exposed by OGL/D3D and assumes that the underlying device conforms to the specs of OGL/D3D. In theory, this means for any hardware that supports the interface, the 3D engine code doesn't have to worry about the actual implementation of the hardware, since the drivers will translate their OGL/D3D commands into the hardware-specific commands.
So, where does this all lead? Well, when the 3D engine has to call into the drivers, the driver takes up some processing resources from the engine. If the engine is really fast, it might call enough times into the driver such that the driver becomes the bottleneck instead of the engine itself. Obviously, then one would think that "hey, I got multiple cores on my machine, I should be able to spread some of the driver load onto the other cores, right?"
Well, yes and no. The problem is that the command stream for the context is presumed to be serial (remember what I mentioned earlier?). So if you have a command stream with multiple things writing to it at the same time, the order isn't quite so guaranteed, right?
But then, some half-assed programmer will ask "why not just put locks on it then?". Well, the problem with that is:
1) the driver will be locked all over the place,
2) locking across multiple cores is pretty expensive, since that involves going back to L2 or potentially L3 cache...and for this case, since we're bound by driver calls already, that means not being able to acquire locks is going to be a huge issue,
3) introduces huge problems with deadlocks and whatnot.
So yeah, it's basically a no-go.
Luckily, graphics tend to be serial (in terms of the incoming command streams). That and we know ahead of time what we need to do anyway -- it's just the process of doing it takes so long. Well, great! The DX11 spec team basically exploits this fact and says, "Well, let's do this. Since you're generating a command buffer, why not break this command buffer up into multiple segments. There will be one main buffer, which is the one that your main thread keeps, and where all synchronization happens. We'll have many other (usually) smaller segments that the main buffer can jump to and back from. Then we'll let other threads fill up these smaller segments, and basically paste/link into the main buffer when and where these smaller segments will get executed."
So, this is awesome. No locks, no synchronization pain. Multi-threaded up to as many sub-jobs as you can get (it can be a LOT). Perfect!
Oh wait, one small problem. The ATi/nVidia drivers don't quite support it yet. Either it still just goes back to that one main thread, or something. I don't know what the deal is.