Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Cerb · May 13, 2011

HW2050Plus said:
So Bulldozer should fix two design issues:
a) CMT -> This is just not effective enough regarding die size.
b) High Frequency Design -> Obviously not effective regarding TDP (I mean AMD has even SOI and higher TDP).
c) High uncore die consumption

a) only known by people under NDA, right now. Definitely not the slam dunk for high performance on the desktop, but that hasn't been BD's goal from the start. If they can sell cheaper chips w/o losing money on them, approximating Nehalem performance on the desktop will be good.

Oh yes I know. CMT causes that 2 cores have to share a single decoder.

No, AMD chose that for an implementation of CMT. One hallmark of CMT is sharing high-throughput resources that are not timing-sensitive, and leaving timing-sensitive resources separate.

That is the issue. You invest a lot of silicon with extremly little gain. Just compare die size of Llano core (~9.6 mm&#178 with that of BD module (~18.9 mm&#178. As two Llano cores are faster than a BD module AMD even has a reduced performance vs. die size efficiency than they had with their current chips.

Sorry, but no, only people under AMD NDAs know how a Llano core compares to a BD core in performance, how 2 Llano cores compare to a BD module, and if there are situations where the decoder will be a bottleneck. AMD has been tight-lipped, and very careful, here.

b) again, unknown. Higher TDP than if they had designed BD to top out at 3GHz, instead of having 4GHz turbos for release models? Sure. But, there's no reason to believe that they did not tune the design for GloFo's 32nm, and have plenty of headroom.

c) probably. They do have a history. In the end, we'll just have to see, though. Maybe they've fixed that. We just plain do not know.

So my proposal for AMD for Bulldozer II:
to solve a):
BD brings a lot of fine things with it: A decoder capable of doing 4 ops decode / cycle. This is an advantage over Intel. So just add another one to feed the other integer core. That takes ~6 mm²/core. A lot but you gain a lot.

If it would only take 6mm^2, would not increase power consumption much, and would not reduce clock speeds, I imagine AMD would have done it. I was surprised to see that they only do 4 for two cores, rather than 5-6 over two cores, but 4 more? Even if it weren't x86, you'd need VLIW or very low speeds to make that worth it. Neither are the best choices for general purpose computing.

Now also widen up to 4 ALU + 2 AGLU. That costs very little ~2 mm²/core. Also the scheduler must be changed of course but that is no issue as the scheduler step back was related to b) and as we fix b) that should be no issue. The result could be a core that is equal or faster than SB core. On top of that utilizing the amazing decoder they can add SMT. So another SB feature included.

Now that's just insane. 2:2, while very odd, gives them the ability to have some of the 4-issue cake, I imagine to bottleneck the ALUs less than with their other options, while being able to maximize the effectiveness of those 2 ALUs.

b) Remove high speed design, fix latencies (esp. INT-SSE) etc. That will fix TDP issues

You mean those perfectly reasonable TDPs that AMD has stated and hinted at? Some latencies are fairly high, but I doubt making a slower chip would allow them to rival Intel on servers, which is the goal. x86, regardless of desktop or server, tends to like higher speeds, over higher IPC. Code that might get 2-3 average IPC on a slow RISC CPU, plain can't, using x86. But, try to get code that can't theoretically do better than 1 IPC to perform better on that slower RISC CPU, and...it won't, at least not by much. You end up better off with higher speeds and deeper pipelines (unless you go to Netburst-like extremes), even at the cost of CPI, and as the speeds get higher (relative to cache and main memory latencies), the real world advantages of wider issue get smaller and smaller (as do the differences between ISAs). Since so much actual code is not high-IPC, going with speed makes a lot of sense. Given the poor scaling of all resources related to managing more instructions at once, it is more efficient to have separate sets of them, than to maximize their use via SMT.

c) Optimize uncore: That will result in even smaller chip though we added more transistors with a)

How much would it cost to do, though? It could be a case where they might not be able to make the money back, if it isn't a major success. meanwhile, if it is a major success, they could make plenty of money selling more chips, and will have been better off by not delaying it any further.

APUs alone will not do it for AMD as you do not earn enough to finance the business.

Red herring. It will be awhile before BD becomes an APU.

In the end, what you're basically saying is that you think CMT sucks the big one, and they should have designed it more like Intel's CPUs. Meanwhile, cost- and power-efficient CMT is pretty much the whole point of BD, and if they have to stick with a design for awhile, better that they design for the future now, than design for the now and be stuck without enough R&D money, or time, to change pace (even Intel and AMD don't know what the x86 market will be like in 5-10 years, so designs that are forward-looking and flexible are a must). However, right now, we just don't know. BD is too different from any CPUs we are used to, at this time.

jimbo75 · May 13, 2011

http://itbbs.pconline.com.cn/diy/13128608.html

Dresdenboy pointed out that cpu-z seems to have intels QPI instead of AMD's HT-Link. Something to get excited about but don't go betting your house on it being real.

exar333 · May 13, 2011

Martimus said:
Good point. Touché

It is kind of funny how this thread has stayed alive for months, and thousands of posts with almost no actual content.

What else would we discuss about BD then? The scraps and bones of information we have that is confirmed is so small...

I cannot wait for actual benches! We have all waited for a LONG time!

RyanGreener · May 13, 2011

The only thing I wish AMD would do is talk about the motherboards more. There is so much conflicting evidence everywhere about whether the motherboards that are AM3 and 8xx cseries can support Bulldozer with a BIOS update, or if AM3+ sockets will need to be specifically used for all features etc.

Leaks would be great....in about a month its going to be revealed but I don't really know how long it takes AMD to do these things.

Soleron · May 13, 2011

azemsiB said:
The only thing I wish AMD would do is talk about the motherboards more. There is so much conflicting evidence everywhere about whether the motherboards that are AM3 and 8xx cseries can support Bulldozer with a BIOS update, or if AM3+ sockets will need to be specifically used for all features etc.

Leaks would be great....in about a month its going to be revealed but I don't really know how long it takes AMD to do these things.

AMD has already given their position. It is: We do not support Bulldozer in sockets other than AM3+, contact the motherboard makers for anything further.

And Asus/Gigabyte have stated specific current AM3 models will allow Bulldozer to be used with a BIOS update.

If you have more questions, only your motherboard vendor can answer them, not AMD. I don't see anything more AMD could do.

Tuna-Fish · May 13, 2011

HW2050Plus said:
Think again what an incredible success Intel had regarding by just adding Port 5 and how little it costed.

Core 2 did not get most of it's boost from port 5 -- it got it from better branch prediction and prefetchers. My point is that the register file & forwarding network is perhaps the single hardest thing to make in the processor, and it's size (and the operation delay) scales superlinearly when you add ports. Intel only just now added enough register read ports to serve all possible instruction mixes with SNB. Adding ports to the file is hard. And you'd want AMD to add 4 read ports and two write ports because of your misguided idea that adding ALU's is the single cheapest way to increase performance? Sorry, but just no. Adding two alus wouldn't add much more than 10% in normal x86 loads full of dependencies. And it has a cost, not in die size, but in clock speed.

I also named the die sizes of chips where it is done like that and how small they are. And Bulldozer where it was done vice versa and how bad the performance vs. die size gets.

Do you know why bulldozer cores look to be so bad in area vs phenom cores? Because there is so much more stuff in the frontend. What is it there for? Branch prediction. Which is the single cheapest way to increase performance in x86. And which is the single biggest reason why AMD loses to Intel so much. Which is why I find your "As two Llano cores are faster than a BD module" more than a little suspect. If BD claws back half difference in branch prediction accuracy between SNB and Stars, it will pay back for the loss of an ALU, the lower maximum decoding throughput, and have performance left to spare.

If I recall correctly, JFAMD, who has quite a bit more knowledge on BD performance than you, has stated "IPC increases" about fifty dozen times. Why do you keep saying things like "two Llano cores are faster than a BD module"?

Riek · May 13, 2011

HW2050Plus said:
You make the mistake to forget to substract the die size consumption of included graphics unit.

Oh what difference ...

Yes a 4C Bulldozer which competes with Dual Core Sandy Bridge regarding performance. I take two similar (as far as you can do that) performing chips and comparing the die sizes. And what is your point?

dual core SB (i3) is llano domain. Has nothing to do with bulldozer. BD Quad core will not compete with i3.

2 Bulldozer Cores will be a little bit faster overall than 1 Sandy Bridge Core though in some single benchmark results BD will also come out below that. So I claim that for sure at least in one benchmark 2 BD cores will be slower than a SB core (HT on).

based on? this was true for Thuban though... but BD is no thuban.

We have enough information to make an educated guess on performance of Bulldozer. Also we have the new die picture without at least apparent obfuscation. Also the die size fits with what we know from Deneb & Co (bad die size -> uncore).

Improving the die size might cost more developer cost than it does in wafer cost. And we do not have enough information to make an educated guess on performance. You are the living proof of that. AMD says IPC will be higher compared to thuban, you say it will not. So there is a difference in perception of IPC so we(you) cannot make an educated guess about performance.

Sure it will exist and it is on the roadmaps. The question is if Intel will release also a desktop 8C part. It was on their desktop roadmap but disappeared. Maybe but that is speculation because Bulldozer is too weak so a 6 Core part is just enough to turn out complete superior on desktop.

Who knows, We don't know how BD will perform. What i can tell is that on 32nm intel cannot make a 3,4Ghz 8core with SB. (witin 130W TDP)

They have now 95W TDP including a graphics core and they really don't get that far even with the graphics core. With AVX you impose - might be but as there is currently no code out I do not mind if the Turbo is lower then.

Wether AVX is supported by software is irrelevant. Intel supports those instructions so it should be able to handle to load and the thermal impact. So for intel it is important that AVX uses more power and it is relevant in how much they can scale SB within its TDP due to this. Btw if you look up how AVX works on SB it painly obvious to expect a huge power consumption increase.

They don't need 22 nm because SB-E will come on 32 nm and have 8 Cores and they will be faster than Interlagos 16 Core.
Your comparison of Gulftown/Westmere vs. Bulldozer is a comparison of the past. A comparison of SB-E vs. BD is what compete in 2012.

Now it even gets better, now you are talking about the performance of 2 cpu's which are unreleased and have no not yet released specs.
22nm WILL be important for the server and high end since intel is STUCK against TDP with their flagships (6core ~3,4GHz), 10core ~2,4GHz. Your statements about interlagos are dumb.
Gulftown is made on 32nm. SB does not deliver any power advantage compared to westmere. (Hence SB-E won't even reach 3,46Ghz like gulftown simply because it uses MORE power)(due to AVX).

Oh yes I know. CMT causes that 2 cores have to share a single decoder. That is the issue. You invest a lot of silicon with extremly little gain. Just compare die size of Llano core (~9.6 mm&#178 with that of BD module (~18.9 mm&#178. As two Llano cores are faster than a BD module AMD even has a reduced performance vs. die size efficiency than they had with their current chips. So in fact regarding performance vs. die size Bulldozer is even a step back for AMD increasing the distance to Intel. Yes they have some additional features like AVX, FO4 but that won't help AMD really.

This is the basis of your failure. you see the decoder as the issue which is incorrect. The decoder is on of the things that needs to be build for peaks but is underutilized most of the time. Then you get to the statement where 2 llano cores are faster then a BD module, which again is alreadt proven wrong by JF.. who said multiple times performance/ipc increases with BD!!! the rest of the sentence is ofcourse bs just because of that.

First the 4M/8C top part will come as a 125 W TDP part. Next the SB 2600K is consuming much less than 95W. You claim not in AVX - I claim who cares as no AVX code is out. And then particularly in AVX we have to see if this - claim by your side - high consumption in AVX does not come from exceptional performance.

4M/8c will also come for 95W TDP.
Again see previous comment, wether AVX is used or not is completely irrelevant. The cpu supports the instruction and thus has to be able to handle the increased power. E.g. The TDP has to be calculated on it.

One module more and what has Intel then?

SB-E on 32nm process and ivy bridge for mainstream.

For you: ~2-3 mm² more for that per module. Would add up to 12 mm² more for the whole chip on a 4M/8C. You should realize for what die space is consumed. It is not for interger. Roughly 30% are for decoders, about 35-40% for FPU/SSE and 30% for integer/pipelines/scheduler/L1 cache. Now let's take BD: 18.9 mm² for core -> ~6 mm² for integer/pipelines/scheduler/L1 cache - makes ~3 mm² for a single core in total. So for all of your integer stuff you have on your 280 mm² chip around 24 mm² for the integer performance. A single pipeline + Register file path + scheduler part + ALU will be around ~ 0.75 mm². That is the cheapest way of all to crank up performance.

Click to expand...

Just like with multiple cores, adding pipelines brings deminishing returns. SB uses HT to utilize some of that wasted space in 80% of the cases. The impact of adding pipelines is also in datapaths, in layout in sheduling, in oOo shuffling, in decoding in retiring. All the stuff around it has to be adjusted just to accept those pipelines.

AMD is the again the first who did the near impossible - a wide x86 decoder. Fantastic! But they split this power in two halves. What a waste!

Click to expand...

Again with the decoder crap. The decoder is fine.

Intel is very decoder limited so this could have become a major advantage. Intel is struggling and added a loop trace cache in Sandy Bridge to compensate for these limitations. AMD comes out with a full solution but just wastes it!

Click to expand...

The problem is not with the decoder, the problem is with ops that are decoded and are rendered invalid by branches and others. (branch fusion helps). In those moments, having a second core utilizing the cycles you are wasting in retrieving data is usefull. Intel does similar with HT, but HT is execution starved.

As I said. Bulldozer will bring a great decoder, a good FPU, finally also prefetchers and long L/S queues. But all these great things are completly annihilated by (integer) CMT and the high frequency design. If they remove those two points which just break Bulldozer (again slower than predecessor per die size) then everything could be really fine. If those are fixed AMD could become roughly on par with Intel regarding overall performance, performance per die size, performance per Watt. But with the first BD incarnation they will even lack more behind as they do now. This is only covered by the 32 nm transition and the ability to issue 8 cores.

Click to expand...

First what you call an high frequeny design is for most designs a design that tillts towards frecuancy. A high frequency design is williamette, northwood, prescott,... BD is fine, stop falling in your own mindset where decoding is an issue, where performance drops compared to K8 design, where additional latency in mostly legacy instructions are presenting doom.

Issuing a P4-like design is also not revolutionary. CMT is okay. But CMT is just stupid on x86 as x86 is so much decoder limited. CMT on FPU is clever because many applications don't use FPU and FPU costs really a lot of die space. But on integer? Double cores but then half the core it self? What is the logic inside? Yes there is a gain but you could achieve that with SMT as well at almost no cost.

Click to expand...

BD is not a P4 design. P4 was focussed on high frequencies, it had a double pumped ALU for SIMPLE calculations. P4 came before it's time.
The problem is doubling the cores can increase performance by 95%. add 50% (2ALU) more width in your integer unit and you gain 10%. Both factors give diminishin returns in the end, but alu much much sooner. Those 10% are easily recuperated with higher frequency or a more flexibel design. A flexible design which is only possible if you keep things simpler.
SMT also has a cost and is in alot of cases not benificial.

Click to expand...

piesquared · May 13, 2011

Bulldozer (as well as Fusion and APU's) looks very well suited for emerging market trends, while at the same time addressing current workloads.

And here's another good article on the architecture.

http://www.microsofttranslator.com/...833-1/dossier-architecture-amd-bulldozer.html

Idontcare · May 13, 2011

Riek said:
Again with the decoder crap. The decoder is fine.

piesquared said:
Bulldozer (as well as Fusion and APU's) looks very well suited for emerging market trends, while at the same time addressing current workloads.

You know what I love about all this anti-BD crap posting that goes on here? It reminds me of exactly the same sort of fanboy denial that was going on when C2D was about to launch around this same time 5yrs ago in 2006. Only then the shoe was very much on the other foot back then (it was AMD fanboys who were in total denial).

I take it (the denial at the behest of the Intel fanboys) as proof positive that AMD probably does have a monster on their hands and they are about to conroe the market themselves.

Only another month or so to go!

Phynaz · May 13, 2011

Idontcare said:
I take it (the denial at the behest of the Intel fanboys) as proof positive that AMD probably does have a monster on their hands and they are about to conroe the market themselves.

Man, that would be so cool...

piesquared · May 13, 2011

Idontcare said:
You know what I love about all this anti-BD crap posting that goes on here? It reminds me of exactly the same sort of fanboy denial that was going on when C2D was about to launch around this same time 5yrs ago in 2006. Only then the shoe was very much on the other foot back then (it was AMD fanboys who were in total denial).

I take it (the denial at the behest of the Intel fanboys) as proof positive that AMD probably does have a monster on their hands and they are about to conroe the market themselves.

Only another month or so to go!

IMO, the biggest success (financially) of the Bulldozer architecture will be in servers, HPC, the cloud and virtualization. AMD has been preparing for this shift for many years and one has to believe they took this trend heavily into consideration when designing Bulldozer. At the same time, not losing focus on workloads that don't require that technology. So while I think Bulldozer will excel in those emerging trends, it seems well balanced also for workstation, enterprise, enthusiast. As you say, we'll see...

Cogman · May 13, 2011

Idontcare said:
You know what I love about all this anti-BD crap posting that goes on here? It reminds me of exactly the same sort of fanboy denial that was going on when C2D was about to launch around this same time 5yrs ago in 2006. Only then the shoe was very much on the other foot back then (it was AMD fanboys who were in total denial).

I take it (the denial at the behest of the Intel fanboys) as proof positive that AMD probably does have a monster on their hands and they are about to conroe the market themselves.

Only another month or so to go!

You know, It wasn't too long ago that I remember someone getting upset that you were an "Intel fanboy". This post is funny to me just because of how optimistic it is towards AMD.

Myself, I would LOVE to see AMD come out with a monster, but I'm very skeptical. I really hope that I'm wrong, but I don't believe that they can really do it again.

Martimus · May 13, 2011

Idontcare said:
You know what I love about all this anti-BD crap posting that goes on here? It reminds me of exactly the same sort of fanboy denial that was going on when C2D was about to launch around this same time 5yrs ago in 2006. Only then the shoe was very much on the other foot back then (it was AMD fanboys who were in total denial).

I take it (the denial at the behest of the Intel fanboys) as proof positive that AMD probably does have a monster on their hands and they are about to conroe the market themselves.

Only another month or so to go!

The funny thing is that I was thinking the exact same thing, although I don't take the similarities of the responses as proof that AMD will "Conroe" Intel. I think it is more a psychological thing, where people are likely to be very sceptical of something that changes their preconceived notions for something, even when that notion is based on nothing but past experience and not logical reasoning, as was the case for Conroe, and the case here with Bulldozer.

In other words, I would not be surprised if BD really puts AMD ahead of Intel, or if it fails at doing so. But I think the correlation between the reaction we are seeing now and what we saw with Conroe are not logically linked to how they will actually perform.

Of course, I have long ago accepted that I am wrong a disproportionate amount of time when I have an opinion like this, so it probably isn't a good idea to listen to what I have to say

hamunaptra · May 13, 2011

piesquared said:
Bulldozer (as well as Fusion and APU's) looks very well suited for emerging market trends, while at the same time addressing current workloads.

And here's another good article on the architecture.

http://www.microsofttranslator.com/...833-1/dossier-architecture-amd-bulldozer.html

Whoa wait WTF!!?!?!?! I just read that the BD uses a replay mechanism?!?!?!?(according to that article) this is news to me!
I read an article about this being one of the reasons for the P4's horrid performance is because scheduling was so aggressive and the replay had to basically requeue things for execution. Ugh this is frustrating, I hope this doesnt make the BD perform like crap!

Idontcare · May 13, 2011

hamunaptra said:
Whoa wait WTF!!?!?!?! I just read that the BD uses a replay mechanism?!?!?!?(according to that article) this is news to me!
I read an article about this being one of the reasons for the P4's horrid performance is because scheduling was so lousy and the replay had to basically recalculate things. Ugh this is frustrating, I hope this doesnt make the BD perform like crap!

HT sucked balls on P4 too, not so much on Nehalem.

I think Intel attempted to disprove with the P4 just about every good idea in microarchitecture the computer science world had to offer at the time.

Trace cache, replay, HT, double-pumped ALU's, etc etc...the list goes on and on. Its not that these were bad ideas per se but they just got way to aggressive in incorporating them all at a time when the xtor budget just wasn't there and power-consumption was right at the threshold of practicality to begin with.

Making a 200W CPU in 2004 was just a non-starter. Nowadays it would be easy to deal with in the desktop format as the GPU guys show us with their dual-GPU cards.

Just because Intel tried and failed doesn't mean AMD will repeat the experiment verbatim IMO.

Cogman · May 13, 2011

Idontcare said:
HT sucked balls on P4 too, not so much on Nehalem.

I think Intel attempted to disprove with the P4 just about every good idea in microarchitecture the computer science world had to offer at the time.

Trace cache, replay, HT, double-pumped ALU's, etc etc...the list goes on and on. Its not that these were bad ideas per se but they just got way to aggressive in incorporating them all at a time when the xtor budget just wasn't there and power-consumption was right at the threshold of practicality to begin with.

Making a 200W CPU in 2004 was just a non-starter. Nowadays it would be easy to deal with in the desktop format as the GPU guys show us with their dual-GPU cards.

Just because Intel tried and failed doesn't mean AMD will repeat the experiment verbatim IMO.

Luckily, cooling in 2011 is MUCH better than it was in 2004. The advent of heatpipes has allowed for some pretty wicked cooling systems. Today, we could support 200W CPUs without TOO much difficulty.

The P4s problem really was one of feeding. They had this massive pipeline, high clock speeds, and a stupidly slow memory connection coupled with (relatively) small cache sizes. In some ways, things like hyperthreading only served to exacerbate the problem.

A netburst design today with a much larger L1 instruction cache might actually stand a fighting chance with todays designs in raw single threaded computational power.

Dresdenboy · May 13, 2011

piesquared said:
And here's another good article on the architecture.

http://www.microsofttranslator.com/...833-1/dossier-architecture-amd-bulldozer.html

This article blends known facts with older and partly obsolete speculation. My diagram was drawn before the first BD module slides were published by AMD. Particularly the speculated FPU contains the most errors. Also replay was a topic brought up by me because of BD-related patents describing such methods (at least one by Gary Lauterbach) - and then there is a difference between replay of exec ops and L/S ops.

alyarb · May 13, 2011

Cogman said:
Luckily, cooling in 2011 is MUCH better than it was in 2004. The advent of heatpipes has allowed for some pretty wicked cooling systems. Today, we could support 200W CPUs without TOO much difficulty.

The P4s problem really was one of feeding. They had this massive pipeline, high clock speeds, and a stupidly slow memory connection coupled with (relatively) small cache sizes. In some ways, things like hyperthreading only served to exacerbate the problem.

A netburst design today with a much larger L1 instruction cache might actually stand a fighting chance with todays designs in raw single threaded computational power.

I think that to say the problem with netburst was "one of feeding" implies that mega-long pipelines can always be fed (at least as well as shorter ones). I think there is a reasonable maximum and at almost every node, netburst stepped farther from that reasonable max. There is a point where cache misses and dependencies are too costly in such a pipeline because of how many independent pending threads are at risk of being halted by a screwup in a preceeding thread. I don't think more L1 or a wider RAM bus would make things better for netburst. Late-cycle smithfields had 128-bit DDR2 just like most systems today and it had no mitigating effects on the fundamentally inefficient pipeline.

The number of stages has to be balanced by the capability of the CPU's front end, and since intel is the leader there, and they have personally suffered the consequences of violating that balance, I think their decision to stay around 16 stages, such as with nehalem, is particularly meaningful.

I'm no architect, but I'm going to just surmise that increasing singlethreaded performance, since it is the most difficult and elusive endeavor in the business, can no longer be done by skewing the fundamental parameters of a CPU relative to one another. ALUs, cache, TLB, decode, prediction and prefetch hardware all share delicate algebraic relationships to one another that must be observed. The most gains are going to come as hardware amendments to the x86 convention like the radix-16 divider or uop cache.

Cogman · May 13, 2011

alyarb said:
I think that to say the problem with netburst was "one of feeding" implies that mega-long pipelines can always be fed (at least as well as shorter ones). I think there is a reasonable maximum and at almost every node, netburst stepped farther from that reasonable max. There is a point where cache misses and dependencies are too costly in such a pipeline because of how many independent pending threads are at risk of being halted by a screwup in a preceeding thread. I don't think more L1 or a wider RAM bus would make things better for netburst. Late-cycle smithfields had 128-bit DDR2 just like most systems today and it had no mitigating effects on the fundamentally inefficient pipeline.

Are you kidding? The smithfield did quite well for the time. Much better than the former Prescott did. Considering that it was two processors on one die essentially doubling the memory requirements.

Though, L1 Cache is what the Netburst architecture really needs for performance. Higher speed data links to memory help, but only to the extent that they ease the pains of a cache miss.

alyarb · May 13, 2011

Sure, netburst steadily got faster, but AMD's 128kb L1 has been around since K7. If the tradeoffs of a large pipeline could all be accounted for by L1, intel would have addressed it prior to the debut of netburst (rather than letting the mystery go unsolved and retreating back to P6 after a decade of backbreaking work). My point is that there is such a length beyond which nothing will alleviate the pains of a miss. Not higher clocks or wider RAM or cache.

http://www.anandtech.com/bench/Product/35?vs=93

OCGuy · May 13, 2011

Idontcare said:
You know what I love about all this anti-BD crap posting that goes on here? It reminds me of exactly the same sort of fanboy denial that was going on when C2D was about to launch around this same time 5yrs ago in 2006. Only then the shoe was very much on the other foot back then (it was AMD fanboys who were in total denial).

I take it (the denial at the behest of the Intel fanboys) as proof positive that AMD probably does have a monster on their hands and they are about to conroe the market themselves.

Only another month or so to go!

Except the problem is that they are so far behind, that even if they finally get competitive this summer (which we aren't even sure of yet,) Intel has Ivy and 22nm right around the corner.

drizek · May 13, 2011

Idontcare said:
HT sucked balls on P4 too, not so much on Nehalem.

I think Intel attempted to disprove with the P4 just about every good idea in microarchitecture the computer science world had to offer at the time.

Trace cache, replay, HT, double-pumped ALU's, etc etc...the list goes on and on. Its not that these were bad ideas per se but they just got way to aggressive in incorporating them all at a time when the xtor budget just wasn't there and power-consumption was right at the threshold of practicality to begin with.

Making a 200W CPU in 2004 was just a non-starter. Nowadays it would be easy to deal with in the desktop format as the GPU guys show us with their dual-GPU cards.

Just because Intel tried and failed doesn't mean AMD will repeat the experiment verbatim IMO.

Remember BTX?

Oh, and Rambus?

Cerb · May 13, 2011

Cogman said:
Luckily, cooling in 2011 is MUCH better than it was in 2004. The advent of heatpipes has allowed for some pretty wicked cooling systems. Today, we could support 200W CPUs without TOO much difficulty.

For a few niche markets. What's the performance of the <100W desktop version? What about the <50W laptop version? We could cool it all day long, today, but the market would rather leave that kind of thermal density to the likes of IBM.

The P4s problem really was one of feeding. They had this massive pipeline, high clock speeds, and a stupidly slow memory connection coupled with (relatively) small cache sizes. In some ways, things like hyperthreading only served to exacerbate the problem.

Even today, it can sometimes take longer (wall time) to get each thread's work done, than the increase in total throughput by HT (sometimes, this matters; sometimes, it does not; sometimes, it is not true), and can sometimes (with the P4 it was pretty much always) take more power than it offers in performance gain. Nowadays, there's little point in turning HT off, but Intel still can't make it a multithreading panacea (it is ideal for the Larrabee follow-ons, though)...because it simply isn't, for many situations. Luckily for Intel, while they keep pushing HT, they really don't need it, to stay on top, today.

In theory, with simulations of modifications made to known CPUs, sharing front-end units can not only decrease space and power, but increase multithreaded application performance by as much or even more than the die savings, thanks to radically reducing overhead of communication within units of the cluster, in turn freeing up communication resources between clusters and between sockets (not unlike the performance gains seen with AMD going dualcore). Now, how does a researcher simulating a non-existent variant of an Alpha CPU translate into performance for a real x86 CPU? Can it do half what it does on paper in the real world, from a company with real budget and time constraints? Even if it can, will that be enough to not be clobbered by Intel having better transistors than GloFo? That's part of what we get to find out.

Phynaz · May 13, 2011

drizek said:
Remember BTX?

Oh, and Rambus?

Remember DTX?

Idontcare · May 14, 2011

drizek said:
Remember BTX?

Oh, and Rambus?

lolz! Yeah I had forgotten all about the rambus saga D: That was just one too many overlords for the market to accept at the same time.

Did you know that the litigation is still dragging out to this day in appeals courts?

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Elite Member

Senior member

Diamond Member

Senior member

Senior member

Golden Member

Senior member

Golden Member

Elite Member

Lifer

Golden Member

Lifer

Diamond Member

Senior member

Elite Member

Lifer

Golden Member

Platinum Member

Lifer

Platinum Member

Lifer

Golden Member

Elite Member

Lifer

Elite Member