I'm no pro at architecture, but last I remember the 10 stage theoretical chip should go faster since along with the higher clock they can also pack ten pieces of data in there at once (assuming everything is linear and not branchy). I don't think the article touched upon multiple pieces of data in the pipeline at once.
