Nehalem

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Well it appears L1 can . Instruction execution
Each 128-bit instruction word contains three instructions, and the fetch mechanism can read up to two instruction words per clock from the L1 cache into the pipeline.

I don't know if conroe is the same here or not but I will have a look see.

The thing that I see and it really sticks out in my mind is the compiler. I know its a broken record . But intel bought Elbrus for a reason . I am convinced it was for the compiler.

I am a little lazy today but the Elbrus compiler is amazing. If someone wishes to get the info on what it is capable of you will see this was a diamond in the ruff when intel bought the company.

I have always been convinced that Nehalem would sport an offshout of this compiler .

If you read about it one thing becomes clear. This compiler and raytracing go hand and hand . Just look at its requirments. Its incanny how intels hardware is matching up to this software.

If software coders have to code for x86 64 bit. It would be smarter to just go Epic. and be done withit . But thats won't happen . Which brings me back to the elbrus compiler. If a processor has a enough cache (intel has) it will run x86 instructions without a performance penality. After the software has been run once . The first time is slower.If you also have a Multicore chip that you can remove a cpu core and replace it with a Vertex core , This is ware RtRT enters the arena. ( Don't know proper terminolgy for a vertex core)
 

jones377

Senior member
May 2, 2004
463
64
91
Originally posted by: Idontcare
Originally posted by: Nemesis 1
I am at AT forums. Someone has finely assciated Itanium with Intels desktop processors.

If it weren't for the drama we impart on ourselves here then we'd be bored to tears waiting for the industry to do something exciting to entertain us.

Originally posted by: Nemesis 1
Intel calls this "coarse multithreading" to distinguish it from "hyperthreading technology"

It would seem silly to me for Intel's decision makers to ignore the fruits of their investment in developing their "course multithreading" and have the Nehalem team proceed headstrong into reinventing the wheel.

I could see the Nehalem team maybe starting with a version of course multithreading and improving upon it even further still than what was implemented and released in Monticeto.

What I am trying to get at is there is every reason to consider that Nehalem's "SMT" should perform no worse than Itanium's "course multithreading" considering the latter predates the former by nearly 2 years.

Is this a reasonable expectation? If it is, then we need to find some performance analyses on the effectiveness/efficiency of Itanium's "course mutlithreading".

Ultimately, CMT is not as powerful of an implementation as SMT is. It is not able to schedule instructions from different threads at the same clock cycle. Rather it switches to the other thread in the event of the first one suffering an event such as a cachemiss that would otherwise cause a "bubble" in the pipeline. You could say that SMT does everything that CMT does and more.

Intel claimed that SMT in the P4 added up to about 20% extra throughput performance for about 5% bigger core die area. I believe the CMT in Montecito (Itanium) provides a simular speed boost. However, IBM has also implemented SMT in their Power processors and their claimed speedup is much larger than that. IBM's version of SMT also added more than 5% extra die area to the core, so don't think of that as some kind of universal constant for SMT. It's just a matter of how many resources on the CPU that they wish to expand/duplicate before they run into die area limitations with regards to economics etc (you gotta manufacture these things for a resonable cost after all!).

What I'm getting at is that I don't think Nehalem should be compared with Itanium since they take different approaches to multithreading. Rather IBM has shown that it *IS* possible to get a bigger speedup from SMT than Intel managed previously with the P4. We'll see whether Intel can do better than their previous attempt. It is what they are claiming so far, sort of.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Vladimir Volkonskiy

Elbrus-2000 (E2k) microprocessor architecture described in Microprocessor Report, Vol.11, No.2, 1999 has several special features: 1) explicit instruction level parallelism on the basis of wide instruction word (like VLIW or EPIC), 2) hardware support for full compatibility with IA-32 on the basis of transparent dynamic binary translation, and 3) hardware support for secure implementation of any high level language. To utilize all these architectural features strong compiler technology is needed. We present optimizing compilers developed for the E2k architecture.
The E2k optimizing compiler from high level languages was developed along with the architecture. As a result, some architectural features of E2k are more suitable for compiler optimization than VLIW or EPIC architectures. We consider some of them, such as branch preparation instead of branch prediction, asynchronous array prefetch in addition to (sometimes, instead of) prefetch data in cache, some specific features of speculation. Being in the process of retargeting E2k optimizing compiler on Itanium2, we present a preliminary comparative analysis of strong and weak points of both architectures in terms of the optimizing compiler. We also present some important algorithms implemented in the compiler, such as global interprocedural analysis and global scheduling.
The transparent dynamic binary translation system was also developed along with the E2k architecture. It was necessary for efficient execution of any IA-32 program including any operating system. We present a hierarchical four-level binary translation system with a strong region based optimizer on the highest level of the system. Some specific details of the optimizer as well as the most important features of hardware support for both compatibility and optimization goals are discussed.
The major distinctive feature of the E2k compiler technology is secure implementation of any high level language. It is based on strong hardware checks of all operations on pointers. It also separates and strongly protects private data of each module. We present secure C and C++ implemented in the compiler on the basis of secure Linux kernel. The secure semantic mode implementation is done almost without any language restrictions. It enables finding very sophisticated bugs in the program.
Presenter's Biography:
Vladimir Volkonskiy is a chief of division in the Russian company Elbrus-MCST. He received M.S. degree in mathematics from the Moscow State University in 1972, Ph.D. degree in computer science from Moscow Institute of Precision Mechanics and Computer Equipment in 1980. His main research interests and professional activity include compilers, optimization algorithms, dynamic optimizing systems, secure implementation of programming languages in compilers, and computer architecture design supporting all these directions. Currently he manages all compiler projects for Elbrus-2000 (E2k) computer architecture running at Elbrus-MCST including optimizing compilers from C, C++, Fortran, in both secure and regular semantic modes, and optimizing binary translation system from IA-32.



Now this was speculation that turned out wrong concerning Merom . But Nehalem is coming so lets hope.


View Full Version : next intel cpu?


--------------------------------------------------------------------------------

SAMSAMHA08-19-2005, 08:25 PM

I know it's all speculation but what's your thoughts?


AT NEXT WEEK'S Intel developer forum, the firm is due to announce a next generation x86 processor core. The current speculation is this new core is going too be based on one of the existing Pentium M cores. I think it?s going to be something completely different.
If it was just a Pentium M variant I don?t think there?d be such a fuss about it. Intel is portraying this as the biggest change since the original P4, yet there have been several new cores introduced since then including the Pentium M itself. No, this change is bigger.

The change is so big in fact, it?s the reason for Apple?s processor switch. Indeed the phrase given when Steve Jobs announced the switch, "performance per watt" is the very same phrase being used by Intel spokesmen.

All we know is it's going to be a multi-core, it's also going to be 64 bit and support hyper threading. The problem is trying to do all this at the same isn?t going to reduce power consumption, in fact doing all this means power consumption is more likely to increase.

There are ways to decrease power consumption but many of these seem to have been already used in the Pentium M series, they can go further but IBM has already gone beyond this in the Cell and XBox360?s PowerPC cores. Perhaps Intel is planning something rather more radical.

The only hint is some comments from Intel apparently saying the processor will be ?structurally different? but will have no problems running the same apps. When has Intel ever had to say this? It can normally be assumed a new core will run the same apps - unless of course, it?s radically different.

So, what is Intel up to?

According to the Apple announcement, the reason it is switching is "performance per watt". Steve Jobs showed a graph with PowerPC projected at 15 computation units per watts and Intel?s projected at 70 units per watt. Intel must have figured out a way to reduce power consumption 4 fold. How? Can this even be done?

Yes, it can be done but it requires striking changes in the processor design. The forthcoming Cell processor?s SPEs at 3.2 GHz use just two to three Watts and yet are said to be just as fast as any desktop processor. I think we can safely assume a future Intel device will not use SPEs instead of x86 processors but they could use some of the same techniques to bring the power consumption down.

Modern microprocessors throw millions of transistors at producing increasingly small performance boosts. The SPEs? designers didn?t do this, they only used transistors if they could be shown to produce a large performance boost. The result is in essence the antithesis of modern microprocessor design, the SPEs are very simple with a relatively short pipeline, strictly in-order execution and no branch prediction.

An extremely stripped back x86 design can and has been done but performance doesn?t so much suffer as gets tortured to death. Out of order execution seems to be pretty critical to x86 performance, most likely due to the small number of architectural registers. Then there is the x86 instruction decoder which on simple processors takes up a significant amount of room and of course consumes power. Even the stripped back designs can?t remove this.

However, there was one company which took a more radical approach and while its processor wasn?t exactly blazing fast it was faster than those using the stripped back approach, what?s more it didn?t include the x86 instruction decoder. That company was Transmeta and its line of processors weren?t x86 at all, they were VLIW (Very Long Instruction Word) processors which used "code morphing" software to translate the x86 instructions into their own VLIW instruction set.

Transmeta, however, made mistakes. During execution, its code morphing software would have to keep jumping in to translate the x86 instructions into their VLIW instruction set. The translation code had to be loaded into the CPU from memory and this took up considerable processor time lowering the CPU?s potential performance. It could have solved this with additional cache or even a second core but keeping costs down was evidently more important. The important thing is Transmeta proved it could be done, the technique just needs perfecting.

Intel on the other hand can and do build multicore processors and have no hesitation in throwing on huge dollops of cache. The Itanium line, also VLIW, includes processors with a whopping 9MB of cache. Intel can solve the performance problems Transmeta had because this new processor is designed to have multiple cores and while it may not have 9MB it certainly will have several megabytes of cache.


Intel likes to call its technique "EPIC" instead of VLIW but it?s the same thing really.

Intel can make a VLIW processor with a large number of small, low power cores and devote one or more of these to translating x86 to the VLIW ISA, they will partly hold the translation software in the bigger cache so it?ll rarely need to hit RAM. It could even do this with a dedicated thread per core but that?ll need a big shared cache. Larrabee Anyone

Intel has a lot of experience of VLIW processors from its Itanium project which has now been going on for more than a decade. Intel also now has HP?s expertise on board as HP?s entire Itanium design team was recently transferred to Intel.

Another technology Intel has access to is DEC?s FX!32. This was written in the mid 1990s and allowed X86 software to run on Alpha RISC microprocessors. A lot of the Alpha people and technology was transferred to Intel and FX!32 most likely went with it, indeed it has already been developing similar technology to run X86 binaries on Itanium for quite some time now.

It gets better. Both the Itanium and the Transmeta designs were said to be inspired by VLIW designs built in Russia by a company called Elbrus. Intel did a deal with Elbrus in mid 2004 then went on to buy the company in August 2004. The exact nature of the deal is unclear, however, as another company continued and taped out the E2K processor earlier this year.

Most interestingly though is the E2K compiler technology which allows it to run X86 software. This is exactly the sort of technology Intel need and since last year they have had access to it and employ many of it?s designers.

So, Intel has access to VLIW technology from the Itanium and HP as well as the translation software from DEC. Most importantly it has the highly advanced technology from Elbrus which has been in development since the 1980s.


The New Architecture
To reduce power you need to reduce the number of transistors, especially ones which don?t provide a large performance boost. Switching to VLIW means they can immediately cut out the hefty X86 decoders.

Out of order hardware will go with it as they are huge, consumes masses of power and in VLIW designs are completely unnecessary. The branch predictors may also go on a diet or even get removed completely as the Elbrus compiler can handle even complex branches.

With the X86 baggage gone the hardware can be radically simplified - the limited architectural registers of the x86 will no longer be a limiting factor. Intel could use a design with a single large register file covering integer, floating point and even SSE, 128 x 64 bit registers sounds reasonable (SSE registers could map to 2 x 64 bit registers).


Rumours suggesting the cores will be four issue wide sound perfectly reasonable for a VLIW processor. At least two (Hyper)threads will almost certainly be supported but more would require more registers not to mention giving them something of a naming problem - Ultra- hyper-threading?

You can of course expect all these cores to support 64 bit processing and SSE3, you can also expect there to be lots of them. Intel?s current Dothan cores are already tiny but VLIW cores without out of order execution or the large, complex, x86 decoders leave a very small, very low power core. Intel will be able to make processors stuffed to the gills with cores like this.

One interesting aspect of an architecture like this is it gives Intel the ability to learn from it and change it in a way X86 never could.

Changing the basic X86 design would lead to all sorts of difficulties with compatibility so instead, over the years more and more has been added and little if anything removed.

Intel will now be free to do as it pleases with X86 decoding done in software Intel can change the hardware at will. If the processor is weak in a specific area the next generation can be modified without worrying about backwards compatibility. Apart from the speedup nobody will notice the difference. It could even use different types of cores on the same chip for different types of problems.

One thing I do not expect is the new core to be an Itanium derivative, it was not designed for low power. Building a new ISA gives Intel a chance to learn the lessons of the sometimes erratic performance of the Itanium. Not that we?ll see the new ISA, this will be hidden from developers underneath the software translation layer. A variant of this device could end up badged as an Itanium though, the software translation should have no trouble converting one VLIW variant to another.

How Fast Will It Be?
Like the Transmeta devices, software will not run at it?s full potential until it?s been fully translated, you can pretty much bet Intel will make sure third party bench-markers will be made well aware of this. I suspect we may also see speculative translation running in the background so everything gets translated and saved as soon as possible. Once translated, the new binaries are saved to disc, they will run as native VLIW thereafter
.

The forte of this processor will be multithreaded code and multitasking. If you are doing lots of things at one you?ll be well happy, servers in particular will benefit from this approach. Multitasking will benefit because different cores will get different tasks, a user switching between them will not cause them to halt so responsiveness of systems with this processor will be very good.

Single threaded performance on the other hand could be relatively weak although that?s not a given, I expect AMD will hold on to its crown in single threaded performance for now.

Conclusion
Based on the various comments and actions of Intel, as well as other companies, I think Intel is preparing to announce a completely new VLIW processor which uses software to decode x86 instructions and order their execution. It might be relatively weak on single threaded code but it?ll more than make up for it in numbers, heavily multithreaded code should run very nicely indeed.

We?ll see shortly if my speculation is correct however, multiple processor vendors are already going in the same direction with a large number of simple cores. X86 hardware implementations don?t lend themselves to the simplicity required for large multicore devices, a VLIW approach has already been shown to be workable whilst reducing both power consumption and size.

Historically, Intel has often used new techniques after it's been used by other vendors. Its real strength is taking those ideas, improving them then mass manufacturing them.

I expect Intel will apply its full manufacturing skills to this device - this processor could have as many as 16 cores.

To date, Apple?s CPU switch to Intel has prompted a lot of speculation about the real reason as frankly, it didn?t made much sense. But if this speculation turns out to be true, reasons behind Apple?s switch are obvious.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: Nemesis 1
Repost

:D You mean "double post" right? Or are you really meanin to call "repost" on your own thread? :p

I saw your post the other day on Elbrus and you did jog my memory about the infamous "E2K is an Itanium killer" sensationalism articles on TheINQ.

Needless to say we never really saw E2K slay the Itanium...but Intel did not buy them for no good reason. Your compiler theory is plausible.

One thing I read on something that came up in google was that the IA32 software compatibility package for Itanium (Montecito and beyond) was a direct result of the Elbrus acquisition.

Who knows, maybe Nehalem will sport an Itanium IA64 software emulator :Q
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
I am not real sure that the Nehalem core will use this compiler. But thats not to say that Nehalem generation of processors won't leverage this software. If you look at info Intels has given us about Larrabbe. Simple inorder cores with each core capable of 2 threads each x86 . Intel I believe stated that 1 of the 16 cores could be a Vertex engine. Intel did say this was their first tera scale processors. So it kinda looks to me as if intel is discribing very simple cores . remove the X86 decoders and the other bagage thats not needed with this compiler.

Larrabee looks to fit the bill. This is interesting .
 

Keysplayr

Elite Member
Jan 16, 2003
21,219
55
91
Originally posted by: Nemesis 1
Ok I see what your saying. Really it wouldn't make since for intel to use turbo mode to give us performance increases . Your basicly saying that Nehalem is < than Penryn if turbo mode is off. I can't buy into that. No improvements from ondie memory controller. No improvements from multi- leveled shared cache. Meaning more than L2 is shared.

No arch . gains as far as logic and SSE4.2. Now I am thinking Intel is going to sandbag
right up till the release. No reason to do otherwise.

If Nehalem has near 20 stage pipeline, you can sure as hell buy into that. Nehalem seems to be Intels "2nd Try" at Netburst. You know what happened when the very first P4 Willy rolled into retail, don't you?

It sucked. At least compared to the PIII. Let's hope Intel learned from it's initial Netburst failures.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Yep and than I seen P4C. Its a bit differant this time around if Nehalem isn't What Intel wants . and by now they would know . We won't see it. Until AMD comes up with something Intel has time to get things right this time around. But as you said history could repeat itself. Its just not likely If a company doesn't learn from past mistakes its doomed to repeat them . Lets hope intel learned.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Intel claimed that SMT in the P4 added up to about 20% extra throughput performance for about 5% bigger core die area. I believe the CMT in Montecito (Itanium) provides a simular speed boost. However, IBM has also implemented SMT in their Power processors and their claimed speedup is much larger than that. IBM's version of SMT also added more than 5% extra die area to the core, so don't think of that as some kind of universal constant for SMT. It's just a matter of how many resources on the CPU that they wish to expand/duplicate before they run into die area limitations with regards to economics etc (you gotta manufacture these things for a resonable cost after all!).

It said Hyperthreading added less than 5% to the die size.

http://www.research.ibm.com/jo...l/rd/494/sinharoy.html

IBM says SMT adds 24% of the core.

CMT in Montecito takes less than 3% of the die. IBM's Power 5 added substantially more to improve the performance of SMT, while Intel went with the minimum feature approach.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
The numbers from P4 really don't help us alot as Intel has stated and I am not going to doubt them that SMT on Nehalem is much increased from netburst to Nehalem But that was a good read.
 

jones377

Senior member
May 2, 2004
463
64
91
Originally posted by: IntelUser2000
Intel claimed that SMT in the P4 added up to about 20% extra throughput performance for about 5% bigger core die area. I believe the CMT in Montecito (Itanium) provides a simular speed boost. However, IBM has also implemented SMT in their Power processors and their claimed speedup is much larger than that. IBM's version of SMT also added more than 5% extra die area to the core, so don't think of that as some kind of universal constant for SMT. It's just a matter of how many resources on the CPU that they wish to expand/duplicate before they run into die area limitations with regards to economics etc (you gotta manufacture these things for a resonable cost after all!).

It said Hyperthreading added less than 5% to the die size.

http://www.research.ibm.com/jo...l/rd/494/sinharoy.html

IBM says SMT adds 24% of the core.

CMT in Montecito takes less than 3% of the die. IBM's Power 5 added substantially more to improve the performance of SMT, while Intel went with the minimum feature approach.

Thanks, I knew I had seen the numbers somewhere, I just didn't remember where :)
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Nemesis1, when quoting posts/web pages, could you use the quote tags? It'd make your posts a whole lot easier to read. It'd also be easier to read your posts if you didn't put periods in the middle of sentences (e.g. "Thats not to say.Skulltrail isn't a great product.")
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
Originally posted by: keysplayr2003
If Nehalem has near 20 stage pipeline, you can sure as hell buy into that. Nehalem seems to be Intels "2nd Try" at Netburst.

why's that
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Ya Keys noway is Intel heading in direction of NetBurst. Nehalem will be last of its kind.

After thats Its simple inorder cores multi thread capable with a nehalem generation on die running the whole show. They may keep going to higher issue Nehalem cores as the elbrus compiler was designed for up to 10 issue pipes. But that would take some time.

Gesher tho is more likely to be the master core in a terra scale setup.

Thats Why I say Nehalem may leverage the tech. threw larrabbe. (speaking of the compiler)

We know which team designed the Dothan core . We know the team that did the Netburst. We know which team is doing Nehalem . We know who is doing Gesher.

Heres my thinking on this . The Russian Is more than likely Jewish so he is probably working on compilers for Both Tera scale and Gesher. We already know Itanium and improvements to present 86x compilers.

the Pro core being x86 and its following generations will end with Nehalem. But Nehalem will leverage Tera scale proccessing Threw PCI-E . Thats were I see Elbrus Compiler used. First ! As already announced. Problem is we don't really know what was announced . But with the info on the net . I believe with larrabe intel is introducing an improved elbrus compiler epic inorder x86 slave chip. Thats going to turn us upside down. The Elbrus compiler has already been used in all recent intel cpus just not in bits. (Itanium as stated already uses it) I believe the nehalem in combination with Larrabbe is A preview of what Gesher Is. Gesher is why AMD bought ATI and why ATI wanted to sell. Because Intel and Ati engineers worked close to gether and had licenses for tech. ATI new as do all the high ups. They all new High K and Metal gates was coming on 32nm IBm sad it couldn't be bone at 45nm. The guys that run these companies most of them are really smart . All the top brainacs they all keep track of the other guys work. Intel engeeners have so much money to work with . They just get The Top guys.
 

BrownTown

Diamond Member
Dec 1, 2005
5,314
1
0
Nemesis 1, while I applaud your enthusiasm in terms of trying to figure out what is coming up next out of Intel there really is no facts that can support these claims and as such this is all speculation. Especially your comments concerning this "Elbrus" compiler, there has been no indication from Intel about creating an EPIC core for the mainstream market. It is important to note that what you are suggesting would destroy all backwards compatibility of existing code and require it to be recompiled for the new core. You must understand there is a HUGE difference between the work a complier is doing and what the scheduling logic on the chip is doing. You can not simply take the idea behind a complier for an EPIC CPU and expect to somehow be able to put it into hardware and slap it on an X86 CPU and have it somehow revolutionize computing. Statements such as "epic inorder x86 slave chip" just do not make sense, X86 is NOT EPIC. EPIC is an entirely different type of ISA as X86. IT would be possible of course to implement an EPIC core with an X86 compatibility mode, but that would be a radical shift over current mainstream CPUs. More likely is your other suggestion whereby one or more complete X86 cores would be combined with multiple RISC cores or simple X86 cores, but any such talk is only speculation.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Of course its all speculation. But I must say looking at the Larrabbeecore and the info Intel has given us .

1) Simple inorder cores (Inorder cores is the give away)
2). 2 threads per core
3) Differant types of cores can be used.


Now these cores are do out end of 08 beginningof 09.

their will be 16 cores on a chip. These must be really simple cores to run 16 cores and stay in a thermal barrier.

So its not unreasonable to assume that Intel is using Elbrus type compiler with Larrabbee.

It makes sense for intel to try tera scale on a PCI-E slot before moving everthing CPU socket.

Make no mistake about it . If its a tera scale Core the odds of it using a Modified Elbrus Compiler is better than 90%.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I liken these threads to the process of "brainstorming".

We bring up ideas left and right just to quickly assess what options are even possible for the future to hold, we aren't too worried just yet about rank sorting these ideas for plausibility.

Nemesis brought up the very valid point that for over a year before Barcelona/Phenom release that folks partook in rampant and unbridled speculation on K10 performance and how AMD was going to get it. Those threads over the months were a lot of fun, certainly more fun than K10 itself ever turned out to be.

So there is no harm in giving Nehalem its dues and letting the time pass with some idle speculation on just how crazy the Nehalem engineers (where is TuxDave?) are going with this new architecture.

I agree that I personally would place the odds of Elbrus ever having anything to do with the desktop Nehalem as being fairly low. But then again I have zero insight into Intel's decision making strategy and those guys may be planning some brilliantly clever stuff that admittadly I will never even be able to speculate about because I'm just not smart enough to do so.

I would rather operate assuming the Intel guys are more clever than me than to do the opposite and assume I know more than they do about why they would or would not do something like moving the desktop onto an EPIC microarchitecure. All I know is that it is not impossible.

But if we can't let our imaginations run around with some sugar plumbs dancing in our heads then this forum is going to get rather unexciting here for a while as we wait another 10 months for Nehalem to release.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: IntelUser2000
Intel claimed that SMT in the P4 added up to about 20% extra throughput performance for about 5% bigger core die area. I believe the CMT in Montecito (Itanium) provides a simular speed boost. However, IBM has also implemented SMT in their Power processors and their claimed speedup is much larger than that. IBM's version of SMT also added more than 5% extra die area to the core, so don't think of that as some kind of universal constant for SMT. It's just a matter of how many resources on the CPU that they wish to expand/duplicate before they run into die area limitations with regards to economics etc (you gotta manufacture these things for a resonable cost after all!).

It said Hyperthreading added less than 5% to the die size.

http://www.research.ibm.com/jo...l/rd/494/sinharoy.html

IBM says SMT adds 24% of the core.

CMT in Montecito takes less than 3% of the die. IBM's Power 5 added substantially more to improve the performance of SMT, while Intel went with the minimum feature approach.

Excellent info! Thanks for the numbers!

Anyone know how this compares to SUN's Niagara and Niagara2 integer threading per core technique?

Niagara2 has 8 integer threads per core :Q and I don't think they did that by simply making the core 8X bigger than it would have otherwise been.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
I was waiting for someone to ask about Sun . Elbrus and Sun did a Deal along time ago .

Sparc=Elbrus

In March 2004, however, Elbrus MCST said it had begun sampling the MCST R-500, a Sparc-compatible chip clocked at 450-500MHz. Intel already has a patent cross-license with Sun Microsystems Inc., however, the designer of the SPARC line.

 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
From Wikipedia: http://en.wikipedia.org/wiki/Hyperthreading

"Intel claims up to a 30% speed improvement compared against an otherwise identical, non-simultaneous multithreading Pentium 4. The performance improvement seen is very application-dependent, however, and some programs actually slow down slightly when Hyper Threading Technology is turned on. This is due to the replay system of the Pentium 4 tying up valuable execution resources, thereby starving the other thread. (The Pentium 4 Prescott core gained a replay queue, which reduces execution time needed for the replay system, but this is not enough to completely overcome the performance hit.) However, any performance degradation is unique to the Pentium 4 (due to various architectural nuances), and is not characteristic of simultaneous multithreading in general."

1. Nehalem!=Netburst
2. Theory is nice, but the goals have to be within limits of the engineering teams and resources. Even Intel does not have unlimited resources. Look how the Netburst screwed up. Elbrus E2K etc promised amazing performance, but then Anandtech needs to benchmark it to believe it :). So does Phenom, Prescott etc...

 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Netburst doesn't=Elbrus and If we look at what influences the Elbrus Tech has already brought to Intel Products. We would have to look at Macro-Fusion, EFI and whatever else has been done to compilers already.Elbrus has already had a hugh effect . I want to see the epic compiler version for desktop.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: Nemesis 1
I was waiting for someone to ask about Sun . Elbrus and Sun did a Deal along time ago .

Sparc=Elbrus

In March 2004, however, Elbrus MCST said it had begun sampling the MCST R-500, a Sparc-compatible chip clocked at 450-500MHz. Intel already has a patent cross-license with Sun Microsystems Inc., however, the designer of the SPARC line.

Edit: Meant to include this - Niagara and Niagara 2 "source code" are public domain, meaning anyone, including Intel, can pick thru them and quite readily figure out what SUN did and how they did it to make these multi-threaded beasts. No reverese engineering necessary, SUN just basically gave it away in hopes of increasing the chances of people writing more software for the architecture.

In computer science we would actually re-order that as:

E2K(Elbrus) = Sparc

Your ordering implies that Sparc took its designs from Elbrus, but it was Elbrus who licensed tech from SUN. You likely meant to imply this, but just be aware the ordering (left to right) matters to folks who are programmers and computer scientists when they read your post.

I got to work on every Sparc processor, and Niagara, that TI fabbed for SUN. The Niagara2 was simply an amazing piece of silicon to behold and attempt to yield. No idea if the performance sucks balls or not, haven't seen anything performance related that wasn't generated by SUN (caveat emptor applies)

Originally posted by: Nemesis 1
I want to see the epic compiler version for desktop.

I hear you but I still am failing to understand how or why this would impact desktop users.

Is the compiler used to re-compile existing programs (say Povray or Crysis) or is it used to improve the logic design and layout of the Nehalem chip circuits themselves?

You probably posted it already, but what exactly are you thinking the Elbrus compiler would do for Nehalem? I am just not getting it yet.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
I won.' debate the sparc Arch . But many people are not sure who did what.

A series of Elbrus computers have been produced by Babaian's team. For example, the Elbrus-3 was built using an architecture that is called Explicitly Parallel Instruction Computing (EPIC).

From 1992 to 2004, Babaian held senior positions in the Moscow Center for SPARC Technology and Elbrus International. In these roles he led the development of Elbrus2000 (single-chip implementation of Elbrus-3) and Elbrus90micro (SPARC computer based on domestically developed microprocessor) projects.


Elbrus on Nehalem I have no Idea. Your missing what I said . We should see the epic compiler(Elbrus) used with terascale. simple inorder processors capable of 2 threads per core. Tera scale = Larrabee.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
You asked what Intel would gain from going to the epic Elbrus compiler. I bolded the important stuff

Most interestingly though is the E2K compiler technology which allows it to run X86 software. This is exactly the sort of technology Intel need and since last year they have had access to it and employ many of it?s designers.

So, Intel has access to VLIW technology from the Itanium and HP as well as the translation software from DEC. Most importantly it has the highly advanced technology from Elbrus which has been in development since the 1980s.

The New Architecture
To reduce power you need to reduce the number of transistors, especially ones which don?t provide a large performance boost. Switching to VLIW means they can immediately cut out the hefty X86 decoders.
Out of order hardware will go with it as they are huge, consumes masses of power and in VLIW designs are completely unnecessary. The branch predictors may also go on a diet or even get removed completely as the Elbrus compiler can handle even complex branches. With the X86 baggage gone the hardware can be radically simplified - the limited architectural registers of the x86 will no longer be a limiting factor. Intel could use a design with a single large register file covering integer, floating point and even SSE, 128 x 64 bit registers sounds reasonable .

With the Elbros compiler Intel is no longer limited by its hardware. Intel can change hardware as needed.