okay, lets get this straight 🙂
K6-x's have pretty good integer cores, and, when well fed like they are with the K6-III series, do quite well in integer apps, and even improve quite a bit in some games. The K6 was optimized for the instructions that get used the most, which is often the case for newer CPU's while retaining true ISA support.
With current algorithms, RC5 uses the ROL instruction quite heavily - so heavly, infact, that a higher latency and therefore lower throughput on that one instruction can make a HUGE performance difference. According to an rc5 core writer, ROL isn't used extensively AT ALL in most other programs. For that reason, the K6-x family has a high latency upon that very instruction, and accordingly, does poorly in comparison to other CPU's (per mhz, even Cyrix's and Pentium mmx's are faster).
The athlon is a different beast, and, apparently, does the ROL instruction quite quickly. Accordingly, it scores well in rc5.
So, the K6-III does in most integer apps because of its L2 cache. It does poorly in rc5 because of it's slow on the ROL instruction.
RC5 is purely integer, and as such, won't benefit from SSE or 3dnow!. But what about mmx? Well, yes, its integer, but for the most part, its not a wide enough implementation to "bitslice" rc5. People have tried. People have tried to get around the ROL instruction by using mmx, but it simply can't be done with the way mmx is now.
Altivec, on the other hand, is a true 128 bit implementation of SIMD instructions. Arstechnica had a great article discussing SIMD. It CAN bitslice rc5. They bitsliced the RC5 core for the G4, and accordingly, it scores phenominally high / mhz. The G3 doesn't.
The speed page is a bit dated - the new clients that get 3.4kkeys/mhz haven't been out long, and the page needs to be updated. Its on the list of things to do.
With RC5, for a high keyrate it comes down to these things: The number of computers you have running it; the architecture of the CPU; and, the mhz of the CPU (assuming L1 is enabled). The client is small enough to fit into the L1 of any current CPU, and doesn't benefit from L2 speed AT ALL.