AMD vs Intel at the high end in the future

dmens · May 16, 2009

Having said that I also realized upon revisiting AT's document that the last page of the article discusses Westmere's new instructions and I had not considered the fact that Westmere's cores are going to contain some "bloat" as the ISA gets expanded a little bit. So 80mm^2 might be to small if one allows for the possibility of the cores themselves to be increased a modest but reasonable 5mm^2 each.

Those are very minor changes.

IntelUser2000 · May 16, 2009

Originally posted by: Idontcare

At any rate, Nehalem is 263mm^2 and has two QPI's that are not included in 2C westmere. However westmere will include on-die PCIe...so we aren't sure how much of the areal savings from ditching the QPI will be spent on PCIe. But let's assume what is likely a worst case and say PCIe requires the same die-space as one QPI, so the tradeout is a wash.

Sorry. I actually forgot to include SRAM in the calculations. As for the die size of Havendale, it won't have PCI Express controller on CPU die because its on the GMCH portion of the MCM. However, it'll feature one of the QPI links so they can communicate with each other.

I was also assuming a die size estimate of 140mm2 as the low side.

http://www.chip-architect.com/...19_Various_Images.html

Take a look at above again. The SRAM cell sizes for various Intel processes were as follows:

130nm: 2.45um2
90nm: 1.0um2
65nm: 0.57um2
45nm: 0.346um2

While 130nm to 90nm brought the greatest reduction in size for the test chips, it was actually 65 nm to 45nm that brought the ideal 50% scaling. It remains to be seen whether 32nm will be optimized for area or power. From Arrandale demo, I'd say more for latter. Intel does seem to take advantage of the "ideal" advancement on the test vehicle, but they might use it for chips like Itanium, which needs smaller SRAM than desktop chips to keep the die reasonable.

Revised estimates: low 90 to mid 90mm2

Originally posted by: kuzi
Intel canceled Havendale (CPU+GPU) for the 45nm process, but will release a 32nm version, do you know if the GPU in those will be something new, or will it still use the same horrendous IGP stuff Intel uses in Mobos?

Fundamentally, it'll be the same. That's actually a GOOD news IMO because a radically different architecture means radically new drivers. From what it looks like the GPU-core wise, advancement will be in the range similar as Extreme Graphics 2 to GMA900. They caught up to ATI/Nvidia back then, I don't see why they won't be able to do it now. Plus, a possibly faster communication being next to CPU could make it even better.

Idontcare · May 16, 2009

Originally posted by: IntelUser2000
http://www.chip-architect.com/...19_Various_Images.html

Take a look at above again. The SRAM cell sizes for various Intel processes were as follows:

130nm: 2.45um2
90nm: 1.0um2
65nm: 0.57um2
45nm: 0.346um2

While 130nm to 90nm brought the greatest reduction in size for the test chips, it was actually 65 nm to 45nm that brought the ideal 50% scaling. It remains to be seen whether 32nm will be optimized for area or power. From Arrandale demo, I'd say more for latter. Intel does seem to take advantage of the "ideal" advancement on the test vehicle, but they might use it for chips like Itanium, which needs smaller SRAM than desktop chips to keep the die reasonable.

Revised estimates: low 90 to mid 90mm2

32nm sram is 0.171um^2, representing 49.4% areal scaling from 45nm sram. (a true 0.70 linear shrink factor)

65nm -> 45nm was 0.57um^2 -> 0.346um^2 which is only 60.7% areal scaling, reflecting the limitations of double-patterning with dry litho. (a paltry 0.78 linear shrink factor)

The scaling anomaly at 130nm to 90nm scaling was brought about by Intel deferring their transition from aluminum to copper BEOL to the 90nm node (whereas all other logic IDM's made the transition at 130nm, excepting TI which actually made the jump at 180nm in a very painful and needlessly risky move).

This delay required the metal pitch of 130nm to be considerably larger than it would have otherwise been if copper BEOL was used at 130nm. Once copper BEOL was deployed at 90nm they were able to take advantage of the technology to get their sram cell size back on track which then artificially inflated the the node-to-node sram scaling.

Intel uses the sram they build test vehicles for, its never not been the case.

IntelUser2000 · May 16, 2009

Originally posted by: Idontcare

Intel uses the sram they build test vehicles for, its never not been the case.

I remember back in the Pentium 4 days analyzing the area taken up by the 1MB L2 cache on the 90nm Prescott CPU. The numbers came out to be approximately 16mm2.

On their 90nm test vehicle presentation their 52Mbit(6.5MByte) SRAM test die took 109.1mm2(10.1mmx10.8mm), which represented a per MB density of 16.78mm2. Which is practially what Intel used on Prescott. So I am aware they do use test SRAM on their products.

Dothan's L2 density was 19.5mm2/MB at the same 90nm process however. Despite the SRAM test vehicle difference, it scaled better at 65nm to 45nm transition, which was theoretically worse.

65nm -> 45nm was 0.57um^2 -> 0.346um^2 which is only 60.7% areal scaling, reflecting the limitations of double-patterning with dry litho. (a paltry 0.78 linear shrink factor)

Not according to Merom-->Penryn, where it turned out to be 48% for SRAM. Precisely the reason nobody is able to estimate is because depending on whether the engineers want to optimize for area, or leakage, the size becomes different. I'm not sure why it'll change now when from Banias to Penryn, 5 generations of process technology, it was 70% for logic and 50-60% for SRAM, add to that the frantic focus on power conservation.

Fact: Nehalem-EX uses 0.384um2 6T SRAM cells, which are 10% bigger than the usual 0.346um2 for their 45nm process. It's probably a variant of a low power 45nm process.

ilkhan · May 16, 2009

1: Do we know if the PCI-E controller is on the 45nm die or the 32nm for clarkdale/arrandale?
2: Do we expect sandy to be drop in compatible with arrandale laptops? The way Im reading this until we move to a new memory standard Intel should be keeping the same 4 sockets (s989/s1156/s1366/s1567) like AMD has done with AM2.
3: Are we expecting sandy (mobile) to use less power, faster, or both? Or too early to tell? Laptop is my primary everyday computer, the desktop is just for gaming on. So power savings are always good.

Idontcare · May 16, 2009

ilkhan I think we have considerably few details (or even rumors) on what Sandy Bridge is going to like in terms of a transition from Westmere.

I hope someone here can chime in and provide you some answers so I too can learn some more about Sandy. About all I have heard/know about it is related to the ISA aspect and how AVX is getting worked into it all.

But sockets, memory configs, really any architecture details in general have been lacking so far.

If you think about it a lot of these details have got to be locked in stone by now too, heck sandy needs to tapeout in about 6 months if they plan to ship product in Nov 2010.

ilkhan · May 17, 2009

Figured it was too early for sandy. How about 1?

IntelUser2000 · May 17, 2009

Originally posted by: ilkhan
Figured it was too early for sandy. How about 1?

You have not seen the diagrams on the net? It looks like the entire GMCH is on another package with Clarkdale/Havendale, IGP/PCI Express controller/Memory controller. It might not be concrete but so far, no one suggested otherwise.

2: Do we expect sandy to be drop in compatible with arrandale laptops? The way Im reading this until we move to a new memory standard Intel should be keeping the same 4 sockets (s989/s1156/s1366/s1567) like AMD has done with AM2.

We moved to DDR2 when DDR400 came out, and DDR3 when DDR2-800 came out. With Sandy Bridge generation possibly offering DDR3-1600, won't it be time for new one, like DDR4?

If you think about it a lot of these details have got to be locked in stone by now too, heck sandy needs to tapeout in about 6 months if they plan to ship product in Nov 2010.

Yep. We'll probably know by then.

Although I expect the details to be pretty thin considering how much we know now about the mainstream Nehalem family mere 3-6 months away from launch.

Kuzi · May 17, 2009

Originally posted by: Idontcare
This AT article on Intel's stated 32nm plans contains a lot of nuggets.

I went to it to pull this up for your question regarding Clarkdale's IGP:

Thanks IDC. I had checked that article before, just wasn't really paying attention

So Clarkdale/Arandale will use the same 45nm IGP that was supposed to be used in Havendale, which means pretty bad 3D performance.

Originally posted by: ilkhan
That quote is confusing. Sandy gets on-die GPU (Cougar Point), yes?

Seems to me like they just meant Sandy getting a 32nm IGP instead of a 45nm one. Most likely still an MCM design.

Kuzi · May 17, 2009

And now Windows 7 will get better Hyper-Threading Support which can make things even worse for AMD. Bulldozer launch is so far off and we're not even sure it will have some form of SMT built in.

Idontcare · May 17, 2009

Originally posted by: Kuzi
And now Windows 7 will get better Hyper-Threading Support which can make things even worse for AMD. Bulldozer launch is so far off and we're not even sure it will have some form of SMT built in.

We can't be sure that it will have SMT but we can do some homework to determine if we are justified in being sure that it won't have SMT.

Patent search. Specifically patent application search.

Unless AMD intends to license 100% of the necessary IP for implementing SMT (a non-public transaction which we wouldn't see) then they would absolutely have patent apps in to the USPO already. No way they'd risk having a critical patent app be refused only to then find out they have infringing IP incorporated into their BD architecture.

If BD is going to have SMT then it's already implemented, and the IP is either already licensed or patent apps have been filed.

If there are no pending patent apps for SMT by AMD then I'd personally feel 99% certain based on that information that BD will not have SMT of any form.

...I vaguely remembered coming across a forum thread about this topic over on aces...here it is: http://aceshardware.freeforums...-amd-sse5-t538-75.html

Posters Opteron, Dresdenboy, and Hans de Vries appear to be convinced based on the patent sifting they have done that Bulldozer will likely have at least SMT for Integer processing. They are not convinced BD will support SMT for FP.

confused: what would that look like to an OS?)

Take a look and see if you can make more sense of it than I have. My interpretation of their posts is likely to be in error, I am not an expert in architecture implementations by any means.

Originally posted by: Hans de Vries
Bulldozer's clustered multiprocessor architecture

I've always interpreted AMD's clustered multiprocessing, which they claimed as adding 80% performance with 50% extra transistor, as something like the following:

A 2-way superscalar processor can reach 80%-100% of the performance of a 3-way for lots of applications. Only a subset of programs really benefits from going to a 3-way. A still smaller subset benefits from going to a 4-way superscalar.

Now, if you still want to have the bennefits of a 4-way core but also want to have the much higher efficiency of the 2-way cores then you can do as follows:

Design a 4-way processor which has a pipeline which can be split up into two independent 2-way pipes. In this case both threads have there own set of resources without interfering with each other. Part of the pipeline would not be split. Wide instruction decoding would be alternating for both threads.

The split would be beneficial however for the integer units and the read/write access units to the L1 data cache. The total 4-way core could have more read/write ports which should certainly improve IPC for a substantial subset.

The 128 bit SSE/FP units could be modified partly in connection with the read/write ports. There was some improvement but not that much when AMD almost doubled the SSE2/FP hardware going from 64 bit units in K8 to 128 bit units in the K10.

There is lots of efficiency to be gained by using two K8 like SSE/FP which can operate independently in 2-way mode and which can operate together as a single 128 bit unit in 4-way mode. Other similar tricks can be beneficial as well.

Part of the higher IPC of Itanium is due to it's multiple read write ports to cache and it's 64bit FP units which can work independently instead of in a "dumb" 2x64 way mode. The two independent FP units of the Itanium can be fed directly from cache due to all these read ports (and they can write directly to cache as well) Something like this is what you would gain in the 4-way mode while the 2-way modes bring the efficiency in throughput computing.

Regards, Hans

Here is the link to Dresdenboy's patent search results into AMD MPU:

?clustered multithreading with 2 int clusters with each of them having:
?2 ALUs, 2 AGUs
?one L1 data cache
?scheduler, integer register file (IRF), ROB
(see 20080263373*, 20080209173, 7315935)

ilkhan · May 17, 2009

Originally posted by: IntelUser2000

Originally posted by: ilkhan
Figured it was too early for sandy. How about 1?

Click to expand...

You have not seen the diagrams on the net? It looks like the entire GMCH is on another package with Clarkdale/Havendale, IGP/PCI Express controller/Memory controller. It might not be concrete but so far, no one suggested otherwise.

The diagrams I've seen say MCH+GPU on the 45nm die and CPU on the 32nm die. While that suggests QPI between them and the PCI-E controller on the 45nm side, its not definite. But as you say, it's probably on the 45nm side.

Originally posted by: Kuzi

Originally posted by: ilkhan
That quote is confusing. Sandy gets on-die GPU (Cougar Point), yes?

Click to expand...

Seems to me like they just meant Sandy getting a 32nm IGP instead of a 45nm one. Most likely still an MCM design.

Yeah thats what the quote seems like to me too. But on-die GPU was supposed to be a feature of sandy. Thus the confusion.

Idontcare · May 17, 2009

Originally posted by: ilkhan

Originally posted by: Kuzi

Originally posted by: ilkhan
That quote is confusing. Sandy gets on-die GPU (Cougar Point), yes?

Click to expand...

Seems to me like they just meant Sandy getting a 32nm IGP instead of a 45nm one. Most likely still an MCM design.

Click to expand...

Yeah thats what the quote seems like to me too. But on-die GPU was supposed to be a feature of sandy. Thus the confusion.

That and "what" architecture/ISA will be the basis of the integrated 32nm GPU? Larrabee or not?

Here's what our Japanese friends contemplate: http://pc.watch.impress.co.jp/...html/kaigai02.jpg.html

Zstream · May 17, 2009

I still think it is a bad idea to move the GPU onto the CPU right now (not that my opionion matters). Now if MCM is in the mix you have a three way potential for failure, with any of the failures being catastrophic to the bottom line.

Why do they not make another changeable small socket and slap it on the mobo? Obviously latency issues but still seems dangerous for a company to slap all of these onto one chip. The yields will have to be excellent.

Kuzi · May 17, 2009

Originally posted by: Idontcare
...I vaguely remembered coming across a forum thread about this topic over on aces...here it is: http://aceshardware.freeforums...-amd-sse5-t538-75.html

Sweet info IDC, I checked Dresdenboy?s page with the patent info and the CPU Diagram. The CPU diagram is really interesting having two integer units, with each unit having two ALUs. The K8/K10 architectures have one integer unit with three ALUs.

So the CPU has a 4-way Decoder, like the Core2/i7 architectures. And having two INT units with a dedicated L1 data cache for each unit gives us a clue that they very likely designed with the capability to run separate threads. Seems like a more brute-force approach than Intel?s SMT.

Posters Opteron, Dresdenboy, and Hans de Vries appear to be convinced based on the patent sifting they have done that Bulldozer will likely have at least SMT for Integer processing. They are not convinced BD will support SMT for FP. confused: what would that look like to an OS?)

These guys know what they are talking about, I agree with their assessment about Bulldozer getting SMT

For the FPU there has to be SMT support also, otherwise as you say the OS can?t perceive an extra ?complete? core. At least we know that the FPU in Bulldozer has to be 256bits wide to support AVX, and if that is the case, the FPU can be designed in such a way to run multiple 64bit or two 128bit instructions simultaneously. Just a thought here, so anyone with more knowledge correct me if I?m wrong.

Originally posted by: Hans de Vries
Bulldozer's clustered multiprocessor architecture
Design a 4-way processor which has a pipeline which can be split up into two independent 2-way pipes. In this case both threads have there own set of resources without interfering with each other. Part of the pipeline would not be split. Wide instruction decoding would be alternating for both threads.

Hans does a great job to hypothesize how a Bulldozer core can run multiple threads, this method would work better than Intel's HT because each thread has it's own "independent" INT unit. The concern here though, is if you have only a single thread, that requires more than two instructions, can both INT units be combined to work as one unit? Otherwise the single core IPC for BD can be lower than K10 in certain situations.

The 128 bit SSE/FP units could be modified partly in connection with the read/write ports. There was some improvement but not that much when AMD almost doubled the SSE2/FP hardware going from 64 bit units in K8 to 128 bit units in the K10.

There is lots of efficiency to be gained by using two K8 like SSE/FP which can operate independently in 2-way mode and which can operate together as a single 128 bit unit in 4-way mode. Other similar tricks can be beneficial as well.

As Hans mentioned here, split the FPU to run (smaller) independent instructions, or run one wide instruction (ex: 2x 128bit or 1x 256bit). Some circuitry would have to be added to support this of course. Very interesting stuff.

Idontcare · May 17, 2009

Kuzi ???????????????

If so then I bet you can make a lot more sense out of this than I can...google translate just pulls up a blank page, but the diagrams are universally interpretable so I like to think I understand what Goto-san is communicating.

For example I think this addresses your statement "split the FPU to run (smaller) independent instructions, or run one wide instruction (ex: 2x 128bit or 1x 256bit)".

IntelUser2000 · May 18, 2009

Originally posted by: Zstream
I still think it is a bad idea to move the GPU onto the CPU right now (not that my opionion matters). Now if MCM is in the mix you have a three way potential for failure, with any of the failures being catastrophic to the bottom line.

Why do they not make another changeable small socket and slap it on the mobo? Obviously latency issues but still seems dangerous for a company to slap all of these onto one chip. The yields will have to be excellent.

Why would that be true? With MCM(Multi-Core Module), the failure rate affecting both the CPU portion and the GMCH portion is about as likely as affecting if they weren't MCM but traditional CPU+Chipset approach. It's not like they are pairing the two chips together as soon as the design is complete for both. They are put together when they are known to both work.

As for Bulldozer and Sandy Bridge, I would reckon they will be using more tried and true approach than radical ones that may or may not work well in practice.

Kuzi · May 18, 2009

Originally posted by: Idontcare
Kuzi ???????????????

If so then I bet you can make a lot more sense out of this than I can...google translate just pulls up a blank page, but the diagrams are universally interpretable so I like to think I understand what Goto-san is communicating.

Nope I can't read it IDC, but the translation seemed to work for me.

For example I think this addresses your statement "split the FPU to run (smaller) independent instructions, or run one wide instruction (ex: 2x 128bit or 1x 256bit)".

Yes that is exactly how the BD pipeline might work. The two INT units can work as a single 4-way superscalar or 2x 2-way superscalar for running two threads at a time (SMT).

Same thing with the FPU, run two SSE (128bits) threads simultaneously by splitting the 256bit wide pipeline, or run one 256bit instruction (which may take more than 1 cycle depending on the AVX implementation).

I can't guess how much extra space on the BD die will this mechanism need to work, the second INT unit with it's dedicated L1 data cache will take up extra space for sure. In i7 implementing SMT took up only 5% more die space. The BD method may have an advantage at the cost of more die space required (10-20%? more).

Idontcare · May 18, 2009

Originally posted by: Kuzi
Nope I can't read it IDC, but the translation seemed to work for me.

Argh! It must be this dam IE8 then. When I click your link I get the usual "translating" activity from google and then it just displays a blank (entirely 100% empty) page. I can google-translate the pc watch impress homepage, but none of the links from there. But it all worked just fine for me before I let MS upgrade my IE7 to IE8.

(btw I get the same blank page after translation using yahoo babelfish and MS live translate as well)

Originally posted by: Kuzi
I can't guess how much extra space on the BD die will this mechanism need to work, the second INT unit with it's dedicated L1 data cache will take up extra space for sure. In i7 implementing SMT took up only 5% more die space. The BD method may have an advantage at the cost of more die space required (10-20%? more).

128KB of L1$ @ 0.149um^2 bit cell is going to be a rather paltry impact to diesize...if I did my math right 128KB works out to be 0.156mm^2...you'd be challenged to find that on a to-scale diemap with the naked eye.

Since BD is supposedly a 100% entirely new architecture there is little we should assume about it by way of extrapolating what we know of the Athlon (K7/8/10) architecture. I say that to invalidate my own comments, because I'd like to say "given how small the K10 core is relative to Nehalem, adding SMT and increasing the core's diesize 10-20% should not present a manufacturing challenge"...but BD is not K10 + SMT +AVX + etc...BD is supposed to be something we have never seen before, and includes AVX + maybe SMT + etc.

Kuzi · May 18, 2009

Originally posted by: Idontcare
128KB of L1$ @ 0.149um^2 bit cell is going to be a rather paltry impact to diesize...if I did my math right 128KB works out to be 0.156mm^2...you'd be challenged to find that on a to-scale diemap with the naked eye.

Yep you are right about the L1 cache, I was mainly thinking of the extra Integer Unit in each core. You know K10 processors have only one Integer Unit, BD may have two.

I read today about GlobalFoundaries using T-RAM, which is a denser type on SRAM cache.

"Thyristor-RAM will find its way into GlobalFoundries' 32nm and 22nm processes in both bulk-silicon and silicon-on-insulator flavors. If what AMD told us last November is still accurate, GlobalFoundries will start ramping its 32nm bulk process in the fourth quarter of this year, with the 32nm SOI variant to follow in the first quarter of 2010."

You can check out the news at Tech Report

Idontcare · May 18, 2009

Here's T-Ram's technical presentation from HotChips 2007.

If GF only just started working with T-RAM for supporting embedded t-ram at 32nm then it is waaaaaay to late in the game to be incorporated in anything AMD is designing for BD. About 2yrs too late.

It could still be relevant for whatever comes post BD at 22nm, or any design wins that GF nabs for 32nm that have really really short design timeframes (even GPU's might be too long to be able to capture a late-stage 32nm emerging memory technology like this). Again this is assuming AMD wasn't already assuming t-ram would be available at 32nm 2 yrs ago when they were well into BD design.

Having said that...AMD needs something at the process technology level to give them the kind of jump on Intel that they've led with in the past. (first to copper, then SOI, then immersion litho) For a long time it was "zram"...now we get tram to speculate about. Tram looks far more promising a technology than zram ever was, let's hope AMD has been working with them in the background since their 2007 Hotchips presentation went public.

IntelUser2000 · May 18, 2009

Originally posted by: Kuzi

Yep you are right about the L1 cache, I was mainly thinking of the extra Integer Unit in each core. You know K10 processors have only one Integer Unit, BD may have two.

"The CPU diagram is really interesting having two integer units, with each unit having two ALUs. The K8/K10 architectures have one integer unit with three ALUs."

What do you mean by "Integer Units"? Usually, they are ALUs, but you are saying like its something else. Number of ports?? Or the things called "Integer Clusters" in the pic? It seems the definition is very vague.

I can't guess how much extra space on the BD die will this mechanism need to work, the second INT unit with it's dedicated L1 data cache will take up extra space for sure. In i7 implementing SMT took up only 5% more die space. The BD method may have an advantage at the cost of more die space required (10-20%? more).

On the presentation made by AMD dated 2007, Clustered Multi-threading is said to give: "50% area increase for 80% performance increase". Which is also vague.

Idontcare · May 18, 2009

Have you guys seen this already?

The specINT and specFP were added by someone after AMD presented the slide, it has been "doctored" so to speak, the numbers are not from AMD. But the rest of the graphic is, including the performance normalized scale on the y-axis.

srp49ers · May 19, 2009

Speaking of tram. check this out Tram

Kuzi · May 19, 2009

Originally posted by: IntelUser2000

Originally posted by: Kuzi

Yep you are right about the L1 cache, I was mainly thinking of the extra Integer Unit in each core. You know K10 processors have only one Integer Unit, BD may have two.

"The CPU diagram is really interesting having two integer units, with each unit having two ALUs. The K8/K10 architectures have one integer unit with three ALUs."

Click to expand...

What do you mean by "Integer Units"? Usually, they are ALUs, but you are saying like its something else. Number of ports?? Or the things called "Integer Clusters" in the pic? It seems the definition is very vague.

Lets call the part of the CPU that does Integer Calculations an Integer Execution Unit. In K10 this is how it looks like.

Notice to the lower left of the diagram, there are three ALUs, these are part of the Integer Execution Unit in K10 CPUs. It's 3-way superscalar having the ability to issue 3 integer operations per clock cycle.

Notice on the Bulldozer diagram, there are "two" Integer Execution Units per core, called Clusters on the diagram. Now from the info IDC provided:

Here is the link to Dresdenboy's patent search results into AMD MPU:
?clustered multithreading with 2 int clusters with each of them having:
?2 ALUs, 2 AGUs
?one L1 data cache
?scheduler, integer register file (IRF), ROB
(see 20080263373*, 20080209173, 7315935)

Each of those INT Clusters will have two ALUs, if you combine the two clusters as one unit (4 ALUs), you get the ability to issue 4-way operations per clock (like Core 2/i7). Or run each cluster separately and the CPU can run two threads at a time (SMT).

This is what we are assuming AMD might do to add SMT capability into Bulldozer.

AMD vs Intel at the high end in the future

Platinum Member

Elite Member

Elite Member

Elite Member

Golden Member

Elite Member

Golden Member

Elite Member

Senior member

Senior member

Elite Member

Golden Member

Elite Member

Diamond Member

Senior member

Elite Member

Elite Member

Senior member

Elite Member

Senior member

Elite Member

Elite Member

Elite Member

Senior member

Senior member