<<
Run it on a system that is twice as fast as your current one. 
Just kidding. I think O(n) is about as fast as such an algorithm can get (unless you can work out a way of having it reliably operate upon two or four characters at a time (in which case your for-loop will do fewer iterations, and you get close to O(n/2) or O(n/4). >>
You could probably use SIMD, in particular MMX to accomplish this.
Just load the data into the registers, and add or subtract the difference in ASCII representation.
At least that's my impression of how SIMD works ... I haven't actually gotten around to playing with it myself
<<
This would be tricky with null-terminated strings, since you don't want to uppercase the null-terminator, nor any characters after it (you can't be certain that you actually own those bytes of memory). The code needed within the loop to check for these conditions might very well erase any performance gains you get from processing the string in larger chunks. >>
This wouldn't be to bad. You know how long your string is (N), and you know how many characters you can process simultaneously (n). So you could do something like this:
N = strlen(str)
for(i=0; i<N; i+=n)
do_simultaneous_alg(str + i)
i -= n
for(j=i; j<N; j++)
do_single_alg(str + j)
You're setting up an extra loop, so it's a bit more overhead.
If your N >> n and the simultaneous algorithm is significantly faster then the single algorithm, then this should give you an overall boost.
<<
It is usually not worthwhile trying to squeeze extra performance from an algorithm unless it is O(2n), O(n^2), or worse. >>
Yea, I can't really see going to all this trouble to switch case in a string faster
