<<
So the instructions are not held in cache but are directly sent to MPU from software via an instruction to do so by OS/software yeah? >>
*scratches head* No, it doesn't work this way, there are no instructions to fetch instructions.

The von Neumann / stored-program model (which all computers since 1945 have used) dictates that data and program instructions are held in memory; instructions are fetched and executed serially, and the results are written back to memory (modern ILP microarchitectures still maintain this illusion). Load-store architectures, which superceded accumulator, stack, and register-memory architectures, are a subset of this....data is loaded from memory into registers before being operated on by an arithmetic instruction, and the content of a register is written back to memory when dictated so. Instructions are located in the memory space (and thus the caches), and the instruction fetch/decode is an automatic part of the general pipeline...the address of the next instruction to be fetched is dictated by the PC (program counter, or instruction pointer for x86 types) or the branch predictor.
<<
then the data will not be penalised as being 64 bit in that the full 64 bits of data (sum) will be stored in a register along with the memory location part of the 64 bit data.The only time the data will span over more than one register is if the data contains more than 64 bits and then it is spanned over 2x registers, but this is not a speed penalty as i thought most modern CPU's were able to fetch multiple registers in one go for the MPU to utilise, hence no multiple clock cycle penalty. >>
I think you have the wrong idea...there is no "penalty" for the register fetch or write-back (and memory addresses aren't stored along with data in the same register, they are seperate entities). General purpose registers are explicit..."add r1, r2, r3" adds the contents of r2 and r3 and stores the result into r1, without regard to the register width (ignoring legacy modes, that's a different issue). You may want to do arithmetic on a 64-bit data type in a high-level language on a 32-bit MPU, but the compiler is going to compile that into multiple instructions...the 64-bit data type is invisible to the assembly code (though as I originally pointed out, this situation in the critical path is rare in most cases).
<<
What i was hoping is that the new breed of 64 bit processors will be able to do this spanning over what would appear to be 32 bit registers, but in reality are 2x 32 (simulated) bit registers that are in fact made up of one single 64 bit register?
This would reduce the amount of register space that would be unused and allow for a more effiecient use of 64 bit registers when used for 32 bit data, yeah? >>
It's not feasible....general purpose registers are deterministic and defined by the instruction set. Doing a mapping of two logical 32-bit registers onto a single 64-bit register would require a new instruction set definition to use them, since the addresses of input and output registers are explicitly defined in an instruction. In extending an 32-bit ISA to 64-bit, you would then have three ISAs on your hand...the legacy 32-bit ISA that is compatible with old software, the new 64-bit ISA, and a bastard 32-bit ISA with 2X the number of registers that no one would ever use.

With x86, there is also limit to the benefit of additional logical registers since its a two-operand format. IA64 is an exception with 128 logical registers, because as an in-order VLIW ISA its more susceptible to memory latency, and a larger number of registers are needed for the register stack windows.
Also, "wasted space" by sign-extending a 32-bit value to fill the contents of a 64-bit register isn't an issue....keep in mind that with register renaming, the number and width of the actual physical registers is greater than the logical registers defined by the ISA.
<<
Where's our FAQ boy? >>
Already taken care of.
