Vladimir Volkonskiy
Elbrus-2000 (E2k) microprocessor architecture described in Microprocessor Report, Vol.11, No.2, 1999 has several special features: 1) explicit instruction level parallelism on the basis of wide instruction word (like VLIW or EPIC), 2) hardware support for full compatibility with IA-32 on the basis of transparent dynamic binary translation, and 3) hardware support for secure implementation of any high level language. To utilize all these architectural features strong compiler technology is needed. We present optimizing compilers developed for the E2k architecture.
The E2k optimizing compiler from high level languages was developed along with the architecture. As a result, some architectural features of E2k are more suitable for compiler optimization than VLIW or EPIC architectures. We consider some of them, such as branch preparation instead of branch prediction, asynchronous array prefetch in addition to (sometimes, instead of) prefetch data in cache, some specific features of speculation. Being in the process of retargeting E2k optimizing compiler on Itanium2, we present a preliminary comparative analysis of strong and weak points of both architectures in terms of the optimizing compiler. We also present some important algorithms implemented in the compiler, such as global interprocedural analysis and global scheduling.
The transparent dynamic binary translation system was also developed along with the E2k architecture. It was necessary for efficient execution of any IA-32 program including any operating system. We present a hierarchical four-level binary translation system with a strong region based optimizer on the highest level of the system. Some specific details of the optimizer as well as the most important features of hardware support for both compatibility and optimization goals are discussed.
The major distinctive feature of the E2k compiler technology is secure implementation of any high level language. It is based on strong hardware checks of all operations on pointers. It also separates and strongly protects private data of each module. We present secure C and C++ implemented in the compiler on the basis of secure Linux kernel. The secure semantic mode implementation is done almost without any language restrictions. It enables finding very sophisticated bugs in the program.
Presenter's Biography:
Vladimir Volkonskiy is a chief of division in the Russian company Elbrus-MCST. He received M.S. degree in mathematics from the Moscow State University in 1972, Ph.D. degree in computer science from Moscow Institute of Precision Mechanics and Computer Equipment in 1980. His main research interests and professional activity include compilers, optimization algorithms, dynamic optimizing systems, secure implementation of programming languages in compilers, and computer architecture design supporting all these directions. Currently he manages all compiler projects for Elbrus-2000 (E2k) computer architecture running at Elbrus-MCST including optimizing compilers from C, C++, Fortran, in both secure and regular semantic modes, and optimizing binary translation system from IA-32.
Now this was speculation that turned out wrong concerning Merom . But Nehalem is coming so lets hope.
View Full Version : next intel cpu?
--------------------------------------------------------------------------------
SAMSAMHA08-19-2005, 08:25 PM
I know it's all speculation but what's your thoughts?
AT NEXT WEEK'S Intel developer forum, the firm is due to announce a next generation x86 processor core. The current speculation is this new core is going too be based on one of the existing Pentium M cores. I think it?s going to be something completely different.
If it was just a Pentium M variant I don?t think there?d be such a fuss about it. Intel is portraying this as the biggest change since the original P4, yet there have been several new cores introduced since then including the Pentium M itself. No, this change is bigger.
The change is so big in fact, it?s the reason for Apple?s processor switch. Indeed the phrase given when Steve Jobs announced the switch, "performance per watt" is the very same phrase being used by Intel spokesmen.
All we know is it's going to be a multi-core, it's also going to be 64 bit and support hyper threading. The problem is trying to do all this at the same isn?t going to reduce power consumption, in fact doing all this means power consumption is more likely to increase.
There are ways to decrease power consumption but many of these seem to have been already used in the Pentium M series, they can go further but IBM has already gone beyond this in the Cell and XBox360?s PowerPC cores. Perhaps Intel is planning something rather more radical.
The only hint is some comments from Intel apparently saying the processor will be ?structurally different? but will have no problems running the same apps. When has Intel ever had to say this? It can normally be assumed a new core will run the same apps - unless of course, it?s radically different.
So, what is Intel up to?
According to the Apple announcement, the reason it is switching is "performance per watt". Steve Jobs showed a graph with PowerPC projected at 15 computation units per watts and Intel?s projected at 70 units per watt. Intel must have figured out a way to reduce power consumption 4 fold. How? Can this even be done?
Yes, it can be done but it requires striking changes in the processor design. The forthcoming Cell processor?s SPEs at 3.2 GHz use just two to three Watts and yet are said to be just as fast as any desktop processor. I think we can safely assume a future Intel device will not use SPEs instead of x86 processors but they could use some of the same techniques to bring the power consumption down.
Modern microprocessors throw millions of transistors at producing increasingly small performance boosts. The SPEs? designers didn?t do this, they only used transistors if they could be shown to produce a large performance boost. The result is in essence the antithesis of modern microprocessor design, the SPEs are very simple with a relatively short pipeline, strictly in-order execution and no branch prediction.
An extremely stripped back x86 design can and has been done but performance doesn?t so much suffer as gets tortured to death. Out of order execution seems to be pretty critical to x86 performance, most likely due to the small number of architectural registers. Then there is the x86 instruction decoder which on simple processors takes up a significant amount of room and of course consumes power. Even the stripped back designs can?t remove this.
However, there was one company which took a more radical approach and while its processor wasn?t exactly blazing fast it was faster than those using the stripped back approach, what?s more it didn?t include the x86 instruction decoder. That company was Transmeta and its line of processors weren?t x86 at all, they were VLIW (Very Long Instruction Word) processors which used "code morphing" software to translate the x86 instructions into their own VLIW instruction set.
Transmeta, however, made mistakes. During execution, its code morphing software would have to keep jumping in to translate the x86 instructions into their VLIW instruction set. The translation code had to be loaded into the CPU from memory and this took up considerable processor time lowering the CPU?s potential performance. It could have solved this with additional cache or even a second core but keeping costs down was evidently more important. The important thing is Transmeta proved it could be done, the technique just needs perfecting.
Intel on the other hand can and do build multicore processors and have no hesitation in throwing on huge dollops of cache. The Itanium line, also VLIW, includes processors with a whopping 9MB of cache. Intel can solve the performance problems Transmeta had because this new processor is designed to have multiple cores and while it may not have 9MB it certainly will have several megabytes of cache.
Intel likes to call its technique "EPIC" instead of VLIW but it?s the same thing really.
Intel can make a VLIW processor with a large number of small, low power cores and devote one or more of these to translating x86 to the VLIW ISA, they will partly hold the translation software in the bigger cache so it?ll rarely need to hit RAM. It could even do this with a dedicated thread per core but that?ll need a big shared cache. Larrabee Anyone
Intel has a lot of experience of VLIW processors from its Itanium project which has now been going on for more than a decade. Intel also now has HP?s expertise on board as HP?s entire Itanium design team was recently transferred to Intel.
Another technology Intel has access to is DEC?s FX!32. This was written in the mid 1990s and allowed X86 software to run on Alpha RISC microprocessors. A lot of the Alpha people and technology was transferred to Intel and FX!32 most likely went with it, indeed it has already been developing similar technology to run X86 binaries on Itanium for quite some time now.
It gets better. Both the Itanium and the Transmeta designs were said to be inspired by VLIW designs built in Russia by a company called Elbrus. Intel did a deal with Elbrus in mid 2004 then went on to buy the company in August 2004. The exact nature of the deal is unclear, however, as another company continued and taped out the E2K processor earlier this year.
Most interestingly though is the E2K compiler technology which allows it to run X86 software. This is exactly the sort of technology Intel need and since last year they have had access to it and employ many of it?s designers.
So, Intel has access to VLIW technology from the Itanium and HP as well as the translation software from DEC. Most importantly it has the highly advanced technology from Elbrus which has been in development since the 1980s.
The New Architecture
To reduce power you need to reduce the number of transistors, especially ones which don?t provide a large performance boost. Switching to VLIW means they can immediately cut out the hefty X86 decoders.
Out of order hardware will go with it as they are huge, consumes masses of power and in VLIW designs are completely unnecessary. The branch predictors may also go on a diet or even get removed completely as the Elbrus compiler can handle even complex branches.
With the X86 baggage gone the hardware can be radically simplified - the limited architectural registers of the x86 will no longer be a limiting factor. Intel could use a design with a single large register file covering integer, floating point and even SSE, 128 x 64 bit registers sounds reasonable (SSE registers could map to 2 x 64 bit registers).
Rumours suggesting the cores will be four issue wide sound perfectly reasonable for a VLIW processor. At least two (Hyper)threads will almost certainly be supported but more would require more registers not to mention giving them something of a naming problem - Ultra- hyper-threading?
You can of course expect all these cores to support 64 bit processing and SSE3, you can also expect there to be lots of them. Intel?s current Dothan cores are already tiny but VLIW cores without out of order execution or the large, complex, x86 decoders leave a very small, very low power core. Intel will be able to make processors stuffed to the gills with cores like this.
One interesting aspect of an architecture like this is it gives Intel the ability to learn from it and change it in a way X86 never could.
Changing the basic X86 design would lead to all sorts of difficulties with compatibility so instead, over the years more and more has been added and little if anything removed.
Intel will now be free to do as it pleases with X86 decoding done in software Intel can change the hardware at will. If the processor is weak in a specific area the next generation can be modified without worrying about backwards compatibility. Apart from the speedup nobody will notice the difference. It could even use different types of cores on the same chip for different types of problems.
One thing I do not expect is the new core to be an Itanium derivative, it was not designed for low power. Building a new ISA gives Intel a chance to learn the lessons of the sometimes erratic performance of the Itanium. Not that we?ll see the new ISA, this will be hidden from developers underneath the software translation layer. A variant of this device could end up badged as an Itanium though, the software translation should have no trouble converting one VLIW variant to another.
How Fast Will It Be?
Like the Transmeta devices, software will not run at it?s full potential until it?s been fully translated, you can pretty much bet Intel will make sure third party bench-markers will be made well aware of this. I suspect we may also see speculative translation running in the background so everything gets translated and saved as soon as possible. Once translated, the new binaries are saved to disc, they will run as native VLIW thereafter.
The forte of this processor will be multithreaded code and multitasking. If you are doing lots of things at one you?ll be well happy, servers in particular will benefit from this approach. Multitasking will benefit because different cores will get different tasks, a user switching between them will not cause them to halt so responsiveness of systems with this processor will be very good.
Single threaded performance on the other hand could be relatively weak although that?s not a given, I expect AMD will hold on to its crown in single threaded performance for now.
Conclusion
Based on the various comments and actions of Intel, as well as other companies, I think Intel is preparing to announce a completely new VLIW processor which uses software to decode x86 instructions and order their execution. It might be relatively weak on single threaded code but it?ll more than make up for it in numbers, heavily multithreaded code should run very nicely indeed.
We?ll see shortly if my speculation is correct however, multiple processor vendors are already going in the same direction with a large number of simple cores. X86 hardware implementations don?t lend themselves to the simplicity required for large multicore devices, a VLIW approach has already been shown to be workable whilst reducing both power consumption and size.
Historically, Intel has often used new techniques after it's been used by other vendors. Its real strength is taking those ideas, improving them then mass manufacturing them.
I expect Intel will apply its full manufacturing skills to this device - this processor could have as many as 16 cores.
To date, Apple?s CPU switch to Intel has prompted a lot of speculation about the real reason as frankly, it didn?t made much sense. But if this speculation turns out to be true, reasons behind Apple?s switch are obvious.