ARM has
already done an asynchronous CPU - it works fine in silicon. I think the bigger problem is that it's a very high-risk change, and the industry doesn't generally like to follow the bleeding edge (designing a chip costs many millions of dollars - a couple hundred engineers over a few years - if the chip doesn't work, the financial impact is enormous). If more design and analysis tools begin to support asynchronous design, and they're proven in silicon, it could eventually happen.
edit: Oh, I meant to write more here. With asynchronous designs, you need to know when inputs are ready to use, and when outputs are available (for example, an adder circuit needs to know when the two input numbers are valid, and needs to tell downstream logic when the sum is ready). Normal logic gates don't know when their output is ready (and there's no easy way to tell). There are two things you can do to handle this:
1) Use "self-timing". Basically, next to each block of logic, you put a delay chain whose delay matches the delay through the logic (for example, if the longest path through your adder is 6 gates long, you could have a delay chain of 6 gates). There's a catch though: the delay chain has to match the
worst case delay through the logic. One of the nice things about asynchronous designs is that, in theory, you can run the circuit as fast as the actual delays rather than the worst case delays. With self-timing like this, you have to stick to worst-case. But it gets worse - you have to add margin to the delay, because of variability. Even though your 6 delay gates are designed to be the same speed as the path through the adder, when they're manufactured every gate will have a slightly different delay. You need to make sure that on 99% of the chips that come out of the factory, the delay chains always take longer than the logic, so you need to do your design analysis assuming variation 3 standard deviations from the target. You also need to make sure that this holds true at high temperatures, low temperatures, high voltages, and low voltages (in particular, if your real adder's delay is dominated by wires, and you add gates to the delay chain to compensate, at low temperatures / high voltages the delay chain will speed up more than the actual adder). Now this means that for most of the chips that come out of the fab, not only are you timed for the worst case delay, but you're also slowing down for unneeded margin. Variation is getting worse, so while self-timing might have been feasible at 130nm, it's much worse at 45nm.
2) Use a logic style that "knows" when its output is ready. One family that's commonly used is "dual rail domino logic" - it's actually used because it's blazingly fast; the knowledge that the gate's output is ready is just a side-effect of how it works. These gates take the true and complement versions of their inputs and produce true and complement outputs (which initially reset to be the same; when they're different, you know the output is ready). Unfortunately, since you need the complement of each signal, you now have twice as many wires, which makes routing wires very difficult. On top of that, the power consumption is very high, because every cycle, one of the two wires making up each signal has to switch (with normal gates, if the inputs aren't changing, the output doesn't change, and almost no power is dissipated).
Now, as I mentioned above, one of the potential advantages of asynchronous designs is that you can have them run based on actual delays rather than worst case delays (in an adder, for example, the delay might be 2-3 gates most of the time, but 12 gates occasionally). There's actually another approach that wins back some of these benefits:
razor (I
highly recommend reading that paper - if it's too hard,
this one might be an easier / less technical read). It's still a clocked design, but it dynamically overclocks the chip and checks for errors real-time. What's clever is how the error checking is done: at the receiving end of each block of logic, along with the normal flip flop, they have a second flip flop that checks the data a little bit after the clock ticks. If the clock speed gets too high, the normal flip flop will capture its value before the correct answer is ready, but by the time the second flip flop captures its value, the answer has become available. When you're not hitting the worst case paths on the chip, the first flip flop will be getting good data, and you can run the clock speed higher. If you do hit a bad path, the chip detects the error by comparing the values in the two flip flops and re-executing based on the value in the second flip flop.
In practice, razor has disadvantages, but this post is long enough for now. I can go into it more later if you're interested.
I'm not saying asynchronous designs can't or won't happen, just that it'd be a very large undertaking. I've heard (I don't have a reliable source) that Intel made an experimental asynchronous Pentium back in 1997 that was 3x as fast and half the power of normal CPUs, so there's got to be a good reason they aren't building asynchronous CPUs today.