And still, if the hardware supports 52bits precision (i.e. standard IEEE dp fp) you'll get that and nothing else. It's not as if the driver would just throw away a few bits for the gaming variants.
Quite possibly, yes. I can't possibly know everything that goes on at that level, but I wouldn't find it surprising to find out some operations carried out in another order will be faster, but lose a couple extra bits here and there.
Also, the obvious source of any errors lie during the output stages. Any data that isn't going anywhere except display is more tolerant of calculation errors or even simple copy errors. The standards for consumer grade parts is much lower. Workstation grade or better is generally more reliable in terms data integrity. That's part of the reason for a higher price tag along with the better support (supposedly).
It's the same concept as building ECC capability into memory controllers specifically for workstation/server parts, building them into a general design and sorting based on which sections are functional, or simply taking the same entire design and binning the good chips for the workstations.
Even if the better chips can probably run fine at much higher clocks (as overclockers are fond of pointing out), the reason for clocking them at equivalent speeds or even lower than the same chip binned for consumers is that errors are less likely to occur at lower speeds. Every processor eventually calculates an erroneous result, the only question is how long between errors on average. Consumers are more tolerant of the occasional glitch or blue screen. With workstation/server level hardware, the extra precautions significantly reduces the chance of errors and some users are willing to pay for that reliability.
As far as I know, GPU's haven't followed server processors into adding failover into the design. There probably just isn't enough die space to add the necessary functionality without significantly crippling performance. Some server-specific CPU's executed the same instructions in duplicate to ensure integrity. Some would even shut down a specific pipeline on the fly if it determined it was producing errors. Naturally, those types of features added so much cost the designs didn't fare as well in the long run.
Sure there are some algorithms for special occasions, where you can improve the overall accuracy and that's something you'll only find in the workstation drivers (actually I very much doubt that they do that stuff in the drivers, that's something the programmer should take care of sine it's at the algorithm level and has nothing to do with the HW), but the accuracy itself stays the same.
No idea what you mean with different execution paths.. do you really think that the HW has different execution paths for sp/dp fp and so on? I very much doubt that. You can use different algorithms but that's again something on the application level (how you compile it and against what libs you link..) and not stuff for the drivers..
And you're sure they don't write those APIs in higher level languages and write compilers for the different architectures and optimize afterwards? After all they need those in either case.
Yes, that's essentially how a graphics driver works these days. They are just in time compilers translating API calls into binary code. Your drivers take API calls and compiles at runtime into machine code. No software company compiles down to binary. They basically meet halfway and the GPU manufacturer takes it from there. That's the whole point of an API in the first place.
As for different execution paths, that's due primarily to different GPU architecture. A GPU isn't quite the same as a CPU, not yet, so you can't treat, say, a Radeon HD5870 chip as 1600 really small but really fast independent processors. Each processing unit is probably better compared to an execution unit in a modern CPU, though that's really hard to say with the latest generations. At the same time, you can't call a Radeon HD5870 a 1600 issue CPU. The closest analogy is probably 20 really wide VLIW CPU's. Don't know for sure, I'm getting already in a topic where I know very little.
So while a single instruction more than likely passes through only one stream processor at a time, the result is not guaranteed to stay there. For all we know, data that requires multiple passes through one processor are simply handed off to another processor assembly-line style. You're also not guaranteed to have each atomic step in your algorithm or even your API call translate into one GPU instruction. I'm certain each step down breaks into multiples.
One instruction that would normally go to one execution unit on a modern CPU may end up getting split among multiple stream processors or recycled on the same one. Which order you execute operations can affect the end result for floating point. It may also affect performance, though the idea is to flood each processor so that order doesn't matter.
Anyway, since the driver is the one doing the actual talking, you have only the assurances of the GPU vendor that your code will draw exactly. There's a lot of data that gets sent off for crunching but you don't bother asking it back to check for integrity.
Optimization always involves moving, deleting, or even adding instructions. Compilers have options you can toggle that will vary the resulting binary code, changing the order of operations or presence of error checks. The tightest optimizations result in less code and less checks, but you're almost guaranteed to have boundary or round-off errors. Usually, you find a middle ground that's reasonably stable, then work out the bugs as you find them.
Whether or not Nvidia or AMD really do purposely lower precision for gaming card drivers, I don't know. I don't work for either of them. Still, it's not entirely outside the realm of probability. It's also not entirely outside the realm of probability that both drivers are essentially the same, with just a few checks here and there querying the GPU for a model string.
However, judging by the performance variations in certain applications and games with the same chip, one marked for gaming one marked for workstation, it's safe to assume the code is significantly different.
Lowering precision slightly is simply one possible method to speed up code for an application that probably doesn't care. Another possibility is that the tweaks meant to speed up specific mixes of API calls result in optimizations that can't be shared with a significantly different mix.
If precision was the primary difference between both drivers, then a gaming card would outperform a workstation card in all applications, though produce errors in all of them. That is not the case, so it is more likely that manufacturers follow the common sense approach of simply optimizing for performance in specific applications. For workstation cards, precision is a careful consideration while optimizing whereas in gaming cards, it's simply a reasonable necessity.
Sure there will be more errors in the gaming drivers since they don't test them that thorough and the obvious performance deltas, but it's not as if the gaming variants would somehow magically lose precission.
That's pretty much what's been reiterated in this thread. Gaming cards are not guaranteed for more than reasonable operation and very likely produce more errors than workstation cards. Top to bottom and in between, gaming cards are more tolerant of errors even in key areas where it really is necessary no matter what you're running. It's simply not a consideration unless it produces glaring problems.
Buying a workstation class card even though it uses the same chip design essentially eliminates the possibility of purposeful imprecision and provides an understanding that accidental imprecision aside from standard tolerances will be fixed and not ignored if nobody really notices. Again, for some users, it's worth the extra cost for peace of mind and guaranteed results. Some applications will also query cards for the appropriate model type. For most people, it's probably not worth it.