Nehalem

Nemesis 1 · Jan 7, 2008

Well i don't care. I seen you quoted Munky . So explain what you meant by that Video' and how it had anything to do wiyh my families Xmas.

Jeg87 . I call BS. Just go to the Skull trail thread . Anything and everthing ya said has been proven wrong.

Comdrpopnfresh · Jan 7, 2008

Originally posted by: Idontcare

Originally posted by: Nemesis 1
If you open the link . Second point . It says this.1.1x-1.25 performance increase in single thread apps' = 10%-25% increase. Off course its at the same clock . How else would Intel do it. .

Click to expand...

You are missing the point I was attempting to communicate.

If Intel sells a 3GHz Nehalem and you run a single-threaded application then the chip is going to automatically "turbo mode" the core running the single-threaded app (if you believe the marketing hype) to something >3GHz.

So if you compare single-threaded performance of a "3GHz" Bloomfield chip (albeit running 1 core at 3.5GHz and the remaining 3 cores at 2Ghz to fit into its power envelope) to a 3GHz Yorkfield are you comparing "clock to clock"? No you won't be.

So my question is how much of that 15-25% single-threaded performance boost is from the CPU up-clocking the loaded core by 15-25% versus how much is actually going to come from IPC improvments?

And if that is the case, what happens if I load a Bloomfield with 4 instances of a single-threaded application and turn them all on at once?

Because of TDP restrictions the chip won't operate any of the cores in turbo mode (as they are all fully-loaded with single-threaded apps) so will I still see a 15-25% performance boost in my single-threaded apps on Bloomfield relative to loading a Yorkfield in similiar fashion?

Have I clearly communicated my question now?

Originally posted by: Nemesis 1
Than you wrote this.

Because of hyperthreading you can get up to 2X more threads operating in parallel - so multi-threaded performance could potentially double on Bloomfield over Penryn simply because Bloomfield supports 8 threads and Penryn supports 4.

Exactly and it scales from 1.2 -2x multi threaded arch improvement. First point on slide 1. = 20%-100% improvement in multi thread arch.

Click to expand...

If the multi-threaded performance increase is "at best" a linear extrapolation of the number of threads on the chip despite the chip architecture changing from Yorkfield to Bloomfield then that further suggests there is little to no IPC improvement per thread.

If they doubled the threads AND increased the IPC per thread AND integrated the memory controller then I would expect the upper end of the improvement range to be >2X and not just simply listed as "up to" 2X.

I think you're disregarding the turbo-clocking of the single-threaded core. Thats not an add-on- it is an architecturally ingrained feature to increase performance. So why shred it and look at the core itself at a steady speed against a penryn? They're both the same process size, and I don't think the depth of the pipelines will have changed much, so parallel pipes would lead to better multi-threaded apps, and you should expect the same single-thread throughput. But the turbo-threading is part of the chip- so why want to strip it? In that case, my Athlon X2 with one core disabled run things almost as fast as my 750mhz thunderbird overclocked to 1ghz. But thats senseless- the thunderbird is much slower.

Nemesis 1 · Jan 7, 2008

I think turbo mode is great. But if intel is using that to increase performance at same clock as a referance for performance increase on nehalem its a little decieving . Because turbo works in single threads only .

As far as a shouting down the other cores and increasing single threaded performance was intended to run on notebook to save the battery

I just read the new AT article and the penryn mobile part has the feature . Turbo mode.

Read it .

JAG87 · Jan 7, 2008

Originally posted by: Nemesis 1
Well i don't care. I seen you quoted Munky . So explain what you meant by that Video' and how it had anything to do wiyh my families Xmas.

Jeg87 . I call BS. Just go to the Skull trail thread . Anything and everthing ya said has been proven wrong.

ya? what are you referring to, the fact that it supports SLI? big deal thats 1 thing I got wrong. nvidia patched up the tylersburg chipset so it can run SLI with a BR02 wawaweewa. skulltrail is still an overpriced benchmark breaker and nothing more. who would pay 600 dollars for a motherboard, and 3000 bucks for a pair of processors, both of which will be obsolete in less than a year, both performance and socket wise. and what are you going to do with it after? who are you going to sell it to? who is going to be stupid enough to buy that once nehalem is out, and how low will you have to price it to be able to sell it? how much money will you lose?

lets not start this conversation again please.

Nemesis 1 · Jan 7, 2008

Its too close to a server platform. dual socket means less cpu overclocking, and FBDIMM = expensive trash. plus I really doubt nvidia will be making a chipset for that, so no SLI.

frankly, if I needed skulltrail I could buy it today. just grab a supermicro or tyan board with dual sockets and a couple of Clovertown Xeons. 8 gigs of FBDIMM, and im ready to go. but do I need that? NO. Do i want it? NO. If I had a business that depended on multi threading and large memory amounts (such as graphic design) then of course. But I dont, I power my PC to surf the web, chat, watch audio/video, and games.

I want the best of everything, and skulltrail is definetely not it. Heck, AMD's quad father is a better solution, at least it uses the 680a chipset and supports SLI. but nvidia is too strict with their chipsets and their platforms. Yes you can make the argument that you can run crossfire with skulltrail, sure thing, but your gonna spend all that money and still not have the best? Heck no. When they put 8 cores on 1 cpu and nvidia makes an SLI chipset for it, I will buy. Or the faggots in Santa Clara could be less strict about their SLI license...

perhaps you did not understand what I said.

what is there in skull trail, that we cannot do already? pretty much nothing. you think that faster overclockable memory is going to bring tangible speed increases?

we can already build dual processor systems, and yes, even though they cannot be overclocked, believe me, skulltrail will not overclock well either. you have no idea what it means to run 8 CORES at high speed. First of all, you will generate so much heat that simple aircooling inside a desktop tower will not suffice. if you think watercooling is a better idea, well prepare to buy 2 cpu blocks and 2 rads. second of all, lets not forget that ALL 8 CORES need to be stable at the given speed. Being in 2 different CPUs, the cores would have very significant temperature differences, and very different voltage requirements to work at a certain speed. Yes, the chips are sold in pairs, but that doesn't guarantee they are identical. And if you decide to touch the FSB (despite the unlocked multiplier) you now have 2 sets of NB, SB and FSB voltages that need to stable.

All I have to say is good luck. Multi processor systems are not meant for the enthusiast, they are meant for the servers that run 24/7 at stock speeds with no problems. Whether intel makes the RAM overclockable, or unlocks the multiplier on the CPUs will make very little difference. You guys just dont see the troubles that will come with such a system. You just look at the potential.

- AMD 4x4 will be better
actually it is already out on the market, and it supports SLI ont he 680a chipset. and when you drop in two Agena FX chips in there, you never know what might happen. So I dont see how you can make an argument that a "future" dual yorkfield machine is better than a dual K8 machine. gee I had to go to university to figure out that an architecture from 2005 wouldnt stand a chance against one from 2008.

Jag87 I took the liberty of bolding everthing false you said. With the exception of the last one. 4X4 was a product but AMD scraped after its miserable showing . And seeing the Great performance of Skulltrail. There is more I could have bolded but that would have been nitpicking.

BrownTown · Jan 7, 2008

Well actually FWIW Skulltrail DOES have SLI, but it is still incredibly overpriced for only minimal performance gains

Does overclock decently though:

http://www.xtremesystems.org/f...howthread.php?t=169421

JAG87 · Jan 7, 2008

Originally posted by: Nemesis 1
Its too close to a server platform. dual socket means less cpu overclocking, and FBDIMM = expensive trash. plus I really doubt nvidia will be making a chipset for that, so no SLI.

frankly, if I needed skulltrail I could buy it today. just grab a supermicro or tyan board with dual sockets and a couple of Clovertown Xeons. 8 gigs of FBDIMM, and im ready to go. but do I need that? NO. Do i want it? NO. If I had a business that depended on multi threading and large memory amounts (such as graphic design) then of course. But I dont, I power my PC to surf the web, chat, watch audio/video, and games.

I want the best of everything, and skulltrail is definetely not it. Heck, AMD's quad father is a better solution, at least it uses the 680a chipset and supports SLI. but nvidia is too strict with their chipsets and their platforms. Yes you can make the argument that you can run crossfire with skulltrail, sure thing, but your gonna spend all that money and still not have the best? Heck no. When they put 8 cores on 1 cpu and nvidia makes an SLI chipset for it, I will buy. Or the faggots in Santa Clara could be less strict about their SLI license...

perhaps you did not understand what I said.

what is there in skull trail, that we cannot do already? pretty much nothing. you think that faster overclockable memory is going to bring tangible speed increases?

we can already build dual processor systems, and yes, even though they cannot be overclocked, believe me, skulltrail will not overclock well either. you have no idea what it means to run 8 CORES at high speed. First of all, you will generate so much heat that simple aircooling inside a desktop tower will not suffice. if you think watercooling is a better idea, well prepare to buy 2 cpu blocks and 2 rads. second of all, lets not forget that ALL 8 CORES need to be stable at the given speed. Being in 2 different CPUs, the cores would have very significant temperature differences, and very different voltage requirements to work at a certain speed. Yes, the chips are sold in pairs, but that doesn't guarantee they are identical. And if you decide to touch the FSB (despite the unlocked multiplier) you now have 2 sets of NB, SB and FSB voltages that need to stable.

All I have to say is good luck. Multi processor systems are not meant for the enthusiast, they are meant for the servers that run 24/7 at stock speeds with no problems. Whether intel makes the RAM overclockable, or unlocks the multiplier on the CPUs will make very little difference. You guys just dont see the troubles that will come with such a system. You just look at the potential.

- AMD 4x4 will be better
actually it is already out on the market, and it supports SLI ont he 680a chipset. and when you drop in two Agena FX chips in there, you never know what might happen. So I dont see how you can make an argument that a "future" dual yorkfield machine is better than a dual K8 machine. gee I had to go to university to figure out that an architecture from 2005 wouldnt stand a chance against one from 2008.

Jag87 I took the liberty of bolding everthing false you said. With the exception of the last one. 4X4 was a product but AMD scraped after its miserable showing . And seeing the Great performance of Skulltrail. There is more I could have bolded but that would have been nitpicking.

everything you bolded was stated based on the assumption that SLI was not supported. just read all the bold statements, its clear that they were said based on the assumption that SLI was not going to be on skulltrail.

the one thing you did bold which has nothing to do with SLI is overclockability, and I stand by my statement. unless you are willing to water cool both processors, you can keep dreaming that they will clock above 4 ghz. and even then, I will be laughing in your face when 1 out of 8 cores craps out above 4 ghz. everyone knows that not all cores are the same, and they all have different tolerances, and these chances double when you go from 4 cores to 8 cores.

so yea, when you have a skulltrail system running at or above 4 ghz, 24/7 stable, not just pi stable like the people on XS, then you can come back and tell me I was wrong about overclocking. and even then, I'll still laugh at you, because you paid 3000 dollars to get twice the performance of my 1000 dollar processor (and only in multi threaded apps). In games you will have the same performance, since games dont even use 4 cores, nevermind 8.

nemesis, don't start this discussion with me again, you don't have the grounds at all to win this argument. you can argue that some people are rich and all they care about is having the best, and ill tell you thats great, nobody cares. these machines are proof of concept, proof that something can be done, and that we have the means to achieve higher performance, but they are neither practical nor useful in the long term.

jones377 · Jan 7, 2008

Not wanting to get involved in your private forum war (which you really should take to PM!) but I gotta ask you JAG87, is your system stable 24/7?

Nemesis 1 · Jan 7, 2008

Originally posted by: BrownTown
Well actually FWIW Skulltrail DOES have SLI, but it is still incredibly overpriced for only minimal performance gains

Does overclock decently though:

http://www.xtremesystems.org/f...howthread.php?t=169421

Ya it does. It is over the top. But I cann't think of one person that wouldn't shell out $2,000 for this if they could get it.

As far as performance ya don't know it depends on a lot of things,

Back to performnce . Here what movie man said . Among many other things. You read the thread so you know.

It's also the WR in Cinebench by SO much it's not even funny!
Maybe a good 10,000 points over any other known machine.

JAG87 · Jan 7, 2008

Originally posted by: jones377
Not wanting to get involved in your private forum war (which you really should take to PM!) but I gotta ask you JAG87, is your system stable 24/7?

I wouldn't make certain statements if I wasn't going to back them up. Of course it is, it's actually 24/7 stable at 4.2 ghz too, but I choose not to run 4.2 because it needs a bit too much voltage for my liking and the temperatures get a bit scary even on water cooling.

Originally posted by: Nemesis 1

Originally posted by: BrownTown
Well actually FWIW Skulltrail DOES have SLI, but it is still incredibly overpriced for only minimal performance gains

Does overclock decently though:

http://www.xtremesystems.org/f...howthread.php?t=169421

Click to expand...

Ya it does. It is over the top. But I cann't think of one person that wouldn't shell out $2,000 for this if they could get it.

As far as performance ya don't know it depends on a lot of things,

Back to performnce . Here what movie man said . Among many other things. You read the thread so you know.

It's also the WR in Cinebench by SO much it's not even funny!
Maybe a good 10,000 points over any other known machine.

wawaweewa, its 10,000 points ahead in a multi threaded benchmark. wooooopie.

proof of concept, nothing more. unless you encode video 24/7. and if you do, why waste money on skulltrail, you might as well get a dual clovertown system or a future dual xeon yorkfield, which I'm sure will be cheaper than skulltrail.

BrownTown · Jan 7, 2008

Originally posted by: JAG87
wawaweewa, its 10,000 points ahead in a multi threaded benchmark. wooooopie.

proof of concept, nothing more. unless you encode video 24/7. and if you do, why waste money on skulltrail, you might as well get a dual clovertown system or a future dual xeon yorkfield, which I'm sure will be cheaper than skulltrail.

Really you can say the same thing for almost any benchmark you choose, no CPU is gonna bottleneck you posting on ATOT or writing on Word or anything like that, and any decent CPU isn't gonna be the bottleneck on gaming at a real resolution (GPU is by far more important here), so unless you do huge amounts of rendering or encoding then there is no reason to buy even a single quad core, you would notice no difference over the 216$ dual core unless you were running benchmarks anyways. Personally I get along just fine with my one banais core at 1.6G, runs everything I needed it too as fast as I need it run.

Nemesis 1 · Jan 7, 2008

Jeg87 Now that I know I can't ever work again . I had to adjust my thinking On buying a skulltrail system. I can Afford it but I have to think about other people now . I need to think about my kids and grandkids future . But I still want an ass kicking Intel machine to ride me out. So I will do the smart thing . As you said . Get the Xtreme Nehalem loader her up with $ larrabbees and call it a day.

BUT! Thats not to say.Skulltrail isn't a great product. But its a niche product.

VirtualLarry · Jan 7, 2008

Originally posted by: Idontcare
Increasing threads/core to >1 without inducing a thread performance penality is not new. Niagara processors do it, Power6 as well.

I would expect POV-Ray (the multi-threaded beta) performance to scale linearly with the number of available threads.

So, if IPC per thread is not improved in Nehalem versus Penryn then I'd expect a Bloomfield to perform 2X as fast as Yorkfield is "clock-for-clock" unless the new and improved hyperthreading in Bloomfield is in fact really crappy and does turn out to introduce a performance penality to the 2nd concurrent thread running on a given core...

I'm not sure how you can suggest that it will be able to run 2x the number of threads simultaniously, without increasing the execution units 2x.

Multiple threads run on spare execution units. If one thread is running, then it is using at least 1 execution unit, possibly more. Thus, there is total execution units-1 (at best) available for processing the second thread. Thus there has to be at least some penalty for processing multiple threads at the same time. Execution units are limited and are not free.

I would expect at best a modest speedup due to SMT, perhaps on the order of 25%, much like netburst. Not something along the lines of 2x, that's impossible.

Nemesis 1 · Jan 7, 2008

Just to lighten things up a bit.

http://youtube.com/watch?v=cbjtuo0_rTg&feature=user

VirtualLarry · Jan 7, 2008

Originally posted by: Nemesis 1
I love what Swinburn wrote here.

Personally, after writing this I'm actually quite concerned about Intel's positioning - I'm worried that now Intel is the current preferred product over AMD, it'll use this leverage to try and suck out as much cash from enthusiasts as possible, and not have them overclock lower parts, like the E6300, Q6600 etc, to perform like £700 CPUs. At the same time, it's also potentially limiting the availability of multi-GPU by its competitors by forcing the separate north bridge, which offers better performance, to potentially only be available onto Bloomfield CPUs. It seems all the cards are in Intel's hands to deal precisely how it wishes.

I don't understand this kind of thinking. It looks to me like Intel is adderessing every segment in a way that will be cost effective for all sectors top middle and bottom . Effectively locking AMD and NV out of the intels lowend parts. So intel is making syre this is an intel chipset only . I love it. In the mid range only 1 16x pci-e or 2 8x pci-e slots.

This is good for midrange. Any see a problem here I sure don't.

Than the highend . I can't wait. I hope Larrabe works like I am hoping in so far as scaling . No way do I exspect Larrabbe to out perform Nv or ATI topend cards. But I am hoping that you can install 4 cards and get better scaling than either sli or XF.

It is really going to be nice knowing if you put out the $$$$ for the Highend desktop parts. The cheaper mid and lower end parts won't be able to = the performance. It about time.

Really whats it matter . As long as intels lowend stumps on AMDs topend its all good.

I really don't see how you can be for limited market competition (== higher prices), and limited consumer choice (constrained by the features of the socket chosen), and limited overclocking potential (since the difference between low-end and high-end parts is more than just clockspeed now).

Do you hate everyone, or just love Intel so much that you can't help but praise them when they plan on screwing everyone else.

That's taking "fanboy" to a new level of sincerity and sickness.

BrownTown · Jan 8, 2008

Originally posted by: VirtualLarry
I'm not sure how you can suggest that it will be able to run 2x the number of threads simultaniously, without increasing the execution units 2x.

Multiple threads run on spare execution units. If one thread is running, then it is using at least 1 execution unit, possibly more. Thus, there is total execution units-1 (at best) available for processing the second thread. Thus there has to be at least some penalty for processing multiple threads at the same time. Execution units are limited and are not free.

I would expect at best a modest speedup due to SMT, perhaps on the order of 25%, much like netburst. Not something along the lines of 2x, that's impossible.

At the risk of being shot down by dmens again over my knowledge of computer architecture I will try to address this point. Making the core wider is not necessarily the solution to making SMT work better nor is it entirely correct to assume a wide core is very amenable to SMT. A very deeply pipelined processor can also benefit just as much from SMT because there are openings to run instructions not just in parellel in a wide core, but also in series in a narrow but deeply pipelined core. To say that is a better way; in a deep pipeline you are issuing instructions at a faster pace (this assumes the deeper pipeline is running at a higer clockspeed of course), but the instruction delay and memmory delays are just as long, so in terms of the number of clockcycles the delays are longer than in a shorter pipeline. Because of this there is the ability to run two threads together even in a very narrow pipeline but in a more serial approach instead of in parrellel, this was of course the idea behind the P4s hyperthreading.

As for the expected speedup, there are several things that can slow you down considerably. One issue is the fact that the chache is now shared between twice as many threads as before meaning each thread has half as much cache (on average). One problem I remember hearing about with the P4 (also a reason single threaded performance was actually HURT when running HT) was that the two threads would not "play nice", that is to say that a thread that was getting minimal usage might start kicking cache entries out that were being used by another thread meaning more hits to memmory. To my knowledge the Core macroarchitecture and almost certainly in turn the nehalem was designed to allow dynamic allocation of the shared cache between threads/core. Hopefully if this is done intelligently in nehalem it can mean that the "main" thread in a game or other application with unbalanced thread usage will get more cache allocated to it than a minor thread running in the background and therefore run better. I don't have any real clue how that would be accomplished, but with a cache shared between 8 threads (with often only 1-2 being used very much) there is a huge advantage to be gained by assigning the cache intelligently. Conversely there is a huge penalty if a main thread continues to have its cache lines overwritten by a bunch of minor threads whose performance is of no real significance to the overall performance of the application.

Personally of all the things in Nehalem the SMT and cache usage is what most intruiges me. The re-addition of SMT is something which is very interesting to me because it promises increased performance for only a minimal increase in die size when compared to adding additional cores.

dmens · Jan 8, 2008

you are correct, SMT is yet another attempt to utilize resources that would otherwise stay idle, but consider the possibility of false work on one thread preventing real work by the other thread. also note that nehalem has three levels of cache.

BrownTown · Jan 8, 2008

Originally posted by: dmens
you are correct, SMT is yet another attempt to utilize resources that would otherwise stay idle, but consider the possibility of false work on one thread preventing real work by the other thread. also note that nehalem has three levels of cache.

As for the "false work preventing real work", that was another thing often cited for the reduction in single threaded performance when running HT, the P4 had a very "optimistic" scheduler that would often issue instructions before all requisite data was obtained, this would then activate the "replay system" stalling one of the execution units until the data required was obtained and keeping the other thread from using the execution unit.

Well anyways, hopefully Intel learned from the P4 and actually produces a useful SMT implementation this time around.

lopri · Jan 8, 2008

Originally posted by: Nemesis 1
Just to lighten things up a bit.

http://youtube.com/watch?v=cbjtuo0_rTg&feature=user

That was a very fun video. Thanks.

Idontcare · Jan 8, 2008

Originally posted by: VirtualLarry

Do you hate everyone, or just love Intel so much that you can't help but praise them when they plan on screwing everyone else.

That's taking "fanboy" to a new level of sincerity and sickness.

What's he got to lose that he hasn't already? I get the feeling his worst nightmare is that one of his grandkids will "come out" and tell the family he got a job working for AMD in Dresden...

Idontcare · Jan 8, 2008

Originally posted by: VirtualLarry
I'm not sure how you can suggest that it will be able to run 2x the number of threads simultaniously, without increasing the execution units 2x.

The point is that if you start adding caveats then you can argue yourself into winning anything.

Of course what you say is true, I am not saying anything different. What I am saying is that Intel has not said otherwise. And they are the one's who have the authority to add the caveats, which they haven't yet.

Ergo my question still stands...is the "up to 2X" increase in multi-threaded performance coming from IPC (per thread) improvements or just simply them increasing the number of threads via SMT?

After all anytime you say "up to 2x" that could of course mean "expect 1.1X because our SMT implementation sucks more German balls than Cartman's mom".

My point is they aren't even leaving the door open for >2X performance increase, they don't say "more than 2X"...so with SMT and with IPC improvements and with clockspeed improvements why are they only willing to guestimate "up to 2X". I am not getting too impressed yet with Nehalem.

Nemesis 1 · Jan 8, 2008

Originally posted by: dmens
you are correct, SMT is yet another attempt to utilize resources that would otherwise stay idle, but consider the possibility of false work on one thread preventing real work by the other thread. also note that nehalem has three levels of cache.

I not sure how intels H/T will work out . But I will make the mistake of assuming Intel has improved it considerably over netburst performance.

On the Nehalem 3level cache thing. From what I have seen and read only the server parts will have L3 . Could you expand on this . Also if the server parts are the only ones with L3 . On the desktop does this statement stand up . Intel Nehalem will have multi-leveled shared cache. Intel has done something with L1 if its is .5mb. I not sure Li can be shared. But the cache size increase for L1 would lead one to believe something is going on . Either way thats a healthy increse in transitors on L1.

Idontcare · Jan 8, 2008

Originally posted by: Nemesis 1

Originally posted by: dmens
you are correct, SMT is yet another attempt to utilize resources that would otherwise stay idle, but consider the possibility of false work on one thread preventing real work by the other thread. also note that nehalem has three levels of cache.

Click to expand...

I not sure how intels H/T will work out . But I will make the mistake of assuming Intel has improved it considerably over netburst performance.

On the Nehalem 3level cache thing. From what I have seen and read only the server parts will have L3 . Could you expand on this . Also if the server parts are the only ones with L3 . On the desktop does this statement stand up . Intel Nehalem will have multi-leveled shared cache. Intel has done something with L1 if its is .5mb. I not sure Li can be shared. But the cache size increase for L1 would lead one to believe something is going on . Either way thats a healthy increse in transitors on L1.

I think you are doing the right thing in assuming Intel will have improved hyperthreading for Nehalem since the Pentium4 days.

The cache talk got me thinking though...how much of the Nehalem cache hierarchy could Intel have leveraged from their existing Itanium architecture?

If the answer is "Intel could leverage quite a bit of it, cache technology is portable between architectures with some, but acceptable, work" then maybe we should look to the latest and greatest Itanium core to see what they are using there.

Same for hyperthreading. Anyone know offhand (too lazy to google and sift at the moment) whether Itanium have hyperthreading?

Nemesis 1 · Jan 8, 2008

I am at AT forums. Someone has finely assciated Itanium with Intels desktop processors.

I was surprized the 4 issue core of c2d didn't make some see cleaerly the direct intels is heading .

Heres a little info I copied . Lets have a peak .

Architecture

The Intel Itanium architecture.Intel has extensively documented the Itanium instruction set and microarchitecture,[22] and the technical press has provided overviews.[23][7] The architecture has been renamed several times during its history. HP called it EPIC and renamed it to PA-WideWord. Intel later called it IA-64, before settling on Intel Itanium Architecture, but it is still widely referred to as IA-64. It is a 64-bit register-rich explicitly-parallel architecture. The base data word is 64 bits, byte-addressable. The logical address space is 264 bytes. The architecture implements predication, speculation, and branch prediction. It uses a hardware register renaming mechanism rather than simple register windowing for parameter passing. The same mechanism is also used to permit parallel execution of loops. Speculation, prediction, predication, and renaming are under control of the compiler: each instruction word includes extra bits for this. This approach is the distinguishing characteristic of the architecture.

The architecture implements 128 integer registers, 128 floating point registers, 64 one-bit predicates, and eight branch registers. The floating point registers are 82 bits long to preserve precision for intermediate results.

[edit] Instruction execution
Each 128-bit instruction word contains three instructions, and the fetch mechanism can read up to two instruction words per clock from the L1 cache into the pipeline. When the compiler can take maximum advantage of this, the processor can execute six instructions per clock cycle. The processor has thirty functional execution units in eleven groups. Each unit can execute a particular subset of the instruction set, and each unit executes at a rate of one instruction per cycle unless execution stalls waiting for data. While not all units in a group execute identical subsets of the instruction set, common instructions can be executed in multiple units. The groups are:

Six general-purpose ALUs, two integer units, one shift unit
Four data cache units
Six multimedia units, two parallel shift units, one parallel multiply, one population count
two floating-point multiply-accumulate units, two "miscellaneous" floating-point units
three branch units
Thus, the compiler can often group instructions into sets of six that can execute at the same time. Since the floating-point units implement a multiply-accumulate operation, a single floating point instruction can perform the work of two instructions when the application requires a multiply followed by an add: this is very common in scientific processing. When it occurs, the processor can execute four FLOPs per cycle. For example, the 800 MHz Itanium had a theoretical rating of 3.2 GFLOPS and the fastest Itanium 2, at 1.67 GHz, was rated at 6.67 GFLOPS.

[edit] Memory architecture
From 2002 to 2006, Itanium 2 processors shared a common cache hierarchy. They had 16 KiB of Level 1 instruction cache and 16 KiB of Level 1 data cache. The L2 cache was unified (both instruction and data) and is 256 KiB. The Level 3 cache was also unified and varied in size from 1.5 MiB to 24 MiB. The 256 KiB L2 cache contains sufficient logic to handle semaphore operations without disturbing the main arithmetic logic unit (ALU).

Main memory is accessed through a bus to an off-chip chipset. The Itanium 2 bus was initially called the McKinley bus, but is now usually referred to as the Itanium bus. The speed of the bus has increased steadily with new processor releases. The bus transfers 2x128 bits per clock cycle, so the 200 MHz McKinley bus transferred 6.4 GB/s and the 533 MHz Montecito bus transfers 17.056 GB/s.[24]

[edit] Architectural changes
Itaniums released prior to 2006 had hardware support for the IA-32 architecture to permit support for legacy server applications, but performance was much worse in comparison with native instruction performance and contemporaneous x86 processors. In 2005 Intel developed IA-32 EL, a software emulator that provided better performance. With Montecito, Intel removed IA-32 support from the hardware.

With Montecito, Intel made enhancements to the architecture in July 2006.[25] The architecture now includes hardware multithreading: each processor maintains context for two threads of execution. When one thread stalls due to a memory access the other thread gains control. Intel calls this "coarse multithreading" to distinguish it from "hyperthreading technology" that was used in some x86 and x86-64 microprocessors. Coarse multithreading is well matched to the Intel Itanium Architecture and results in an appreciable performance gain. Intel also added hardware support for virtualization. Virtualization allows a software "hypervisor" to run multiple operating system instances on the processor concurrently. Montecito also features a split L2 cache, adding a dedicated 1 MiB L2 cache for instructions and converting the original 256 KiB L2 cache to a dedicated data cache.

[edit] Hardware support

[edit] Systems

Idontcare · Jan 8, 2008

Originally posted by: Nemesis 1
I am at AT forums. Someone has finely assciated Itanium with Intels desktop processors.

If it weren't for the drama we impart on ourselves here then we'd be bored to tears waiting for the industry to do something exciting to entertain us.

Originally posted by: Nemesis 1
Intel calls this "coarse multithreading" to distinguish it from "hyperthreading technology"

It would seem silly to me for Intel's decision makers to ignore the fruits of their investment in developing their "course multithreading" and have the Nehalem team proceed headstrong into reinventing the wheel.

I could see the Nehalem team maybe starting with a version of course multithreading and improving upon it even further still than what was implemented and released in Monticeto.

What I am trying to get at is there is every reason to consider that Nehalem's "SMT" should perform no worse than Itanium's "course multithreading" considering the latter predates the former by nearly 2 years.

Is this a reasonable expectation? If it is, then we need to find some performance analyses on the effectiveness/efficiency of Itanium's "course mutlithreading".

Nehalem

Lifer

Golden Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Senior member

Lifer

Diamond Member

Diamond Member

Lifer

No Lifer

Lifer

No Lifer

Diamond Member

Platinum Member

Diamond Member

Elite Member

Elite Member

Elite Member

Lifer

Elite Member

Lifer

Elite Member