Since nobody on here seems to want to answer the questions seriously, I'll give a brief overview of some basic processor architecture.
Before I start; however, the die size as *NOTHING* to do with the speed of the chip.
First off, you need to understand the very basics of an Instruction Pipeline. For an EXTREMELY basic architecture (Even moreso than MIPS) you have a pipeline that resembles this:
Fetch, Decode, Execute, Store
As an instruction, take ADD for instance, is fed through the pipeline, it goes through each of these steps.
Now the longer the pipeline, the higher the clockspeed must be in order to achieve the same number of instructions per clock (IPC). Intel, in going back to the Pentium Pro architecture starting with the Centrino Platform, allowed for a much higher instruction per clock at a lower clockspeed. The Core, Core 2, and subsequent iterations are based off of this architecture.
Without getting too much more in-depth, although neither manufacturer has released the instruction pipeline length, one would think that Intel's pipeline was shorter.
-----------------
On a different note, Intel has always had excellent branch prediction logic.
To understand branch prediction, you have to understand the concept of pipelining. If you are interested in this sort of topic, I suggest reading more about it, but for the sake of keeping this post short, pipelining is essentially feeding as many instructions as possible to keep every stage of the prior mentioned instruction pipeline populated. In other words, we don't want an add to pass the FETCH stage and then nothing happen on the FETCH stage until the ADD has completely written to memory.
Consider the basic piece of assembly code (Note: it is completely useless and does literally nothing but show the value of branch prediction):
push %ebp
movl %ebp, %esp
subl %esp, 4
addl %eax, %edx
cmpe %eax, %edx
jne .CODE_1
addl %eax, %edx
mul %eax, 2
With the concept of pipelining in mind, you can see that the JNE (Jump Not Equal) will go through the instruction pipeline, but so will the ADDL and MUL instructions. What if the JNE actually tests to true though and the program has to jump to a new segment of code. The ADDL and MUL instructions are invalid and must be flushed from the pipeline thereby using up precious clock cycles. We call this a branch misprediction. Having to fill the instruction pipeline with NOP's to flush it before continuing is a costly mistake.
With that in mind, a branch prediction unit will attempt to determine (Based on any number of variables depending on whether it is static or dynamic) whether or not to take the JNE or to load instructions on the assumption that it fails.
-------------
So now we have 2 reasons which could lead to Intel being fast on a clock-clock basis. While there are numerous reasons why this could be the case (Cache policy, cache size, Hyperthreading, OPS Fusion, Compiler Optimization, etc...) hopefully this gives you an extremely small sample of Microprocessor Architecture and an answer to your question.
-Kevin