what do you do that takes advantage of more than 4 cores?
Intel has a 6 core desktop cpu that is much faster than your current quad, and you didnt buy it.
300$ =/= 1000$.
what do you do that takes advantage of more than 4 cores?
Intel has a 6 core desktop cpu that is much faster than your current quad, and you didnt buy it.
price, performance, and watts are the only things that affect a consumer. that would be... pirce, 0-60 and fuel mileage. so![]()
There wont be a differnce between the 2 cores in a modual,... so its not like 1 core will be differnt from the one next to it. And their both "real" cores, they simply share some stuff (that doesnt impact performance when your only useing 1 thread in a modual, and when your useing all your threads... well... do intel make 16 cores / thread cpus? any program that can use so many threads these processors will be extremly fast in)
besically this means the bulldozer will be really fast both in lightly threaded aplications, and in ones that can make use of many threads.
That's not true. Large die sizes mean greater manufacturing costs, which means greater environmental impact from building the chip. Consumers should penalize companies(and I think Nvidia is worse at this than Intel or AMD) for building excessively sized dies. An Intel example though would be the wasted die space on SB for hte IGP and HT that they disable. For AMD it would be the die space for their disabled cores in Phenom II(although they make it easy enough to re-enable them).
so what happens when a 2nd or 3rd core is needed? will the OS just automatically assign it to the next core in line (ie, the 2nd core in the first module), or assign it to the first core in the next module? Will it be software dependent, OS, etc? I can see this working out really well if the threads can be prioritized properly, like this: 1-5, 2-6, 3-7, 4-8, instead of 1-2, 3-4, 5-6, 7-8. Going from JFAMD's comment that the 2nd core when fully utilized will only act as 80% of an entire theortical single core, it seems that this software optimization could make or break BD.
If both cores are being used in a module equally, then its more like each core is working at 90% of their 100%
but if you re-enable them then you're using more power, thus causing more pollution in the atmosphere, leading to global warming and the destruction of the polar bear habitat. thus:
YOU ARE A POLAR BEAR KILLER!! /sarcasm
How can you quantify the amount of work each core is performing, if both are being used? A core working at 90% of its theoretical max.. seems like a number pulled out of thin air. Don't get me wrong, I am not trying to question your intelligence.. just trying to clear things up.
That quote was from 2006 and Intels' policy since P4 has always been at least 1% performance gain per 1% more power usage. If it was inefficient, Atom would sure not use it.
And in certain cases benefit can be way more than 20 %. But depends on the workload.
If BD does scale so well, I just hope it also has decent single-threaded performance, meaning a lot more than Phenom II. Phenom II is probably even slower than core 2 duo clock for clock, not to mention SB.
So besically... this design of the cpu, means it ll work really well both with light threaded stuff, and with very highly threaded stuff. But thats not the only reason for it, another brilliant side effect of shareing, is they use much less power this way.
AMD design is assigning resources accordingly to their estimate of what future workloads will be, allowing them to have more relevant resources/performance in a smaller die.
The most obvious example is the integer to fpu ratio - instead of going 1 integer core + 1 256-bit FPU , AMD goes 2 integer cores + 1 2x128-bit FPU , since AMD is expecting 256-bit FP operations to be much less common than 128bit ones and integer operations.
This will allow AMD to have 8 integer cores + 4 2x128bit FPU in a die size only slightly larger than a traditional 4 integer cores + 4 256-bit FPU.
The downsides are:
a) AMD might be wrong in their estimations and so the 4 extra integer cores will be useless since the CPU will be bottlenecked by the shared resources;
b) shared resources have a performance toll even if estimations are correct (180% instead of 200% performance, according to AMD);
c) on the rare workloads where the shared resources are the bottleneck, performance will suffer compared to those where the shared aren't the bottleneck (this is, instead of performing as an octo-core product it will perform as a quad-core or worse).
But in general this is not a black/white discussion of workloads with bottlenecked shared resources at only 100% of 1 core performance levels and workloads w/o bottleneck at 200%. There will be 150%, 170%, 186%, 192%, 160% cases etc. And each case itself will have phases with different levels. Just look at IPC diagrams of different codes.
Yeah the key we dont know of, at what clockspeed are they comparing this performance increase at.
I mean, that really doesnt matter all that much, if its higher clocks needed to pull off this performance increase fine. The uarch is a highspeed design anyways, so if it can scale extremely well clock wise then, that will not be an issue.
Except for maybe marketing lol!
Its good to know its single threaded performance is better than current offerings...lets hope he also was saying clock for clock its better.
I believe though, that its probably not as good as intels offerings clock for clock, which is where the high speed design of BD comes into play and in MT apps it should blow all competition out of the water.
I'm not sure what you mean by initial numbers, since I haven't seen a single benchmark from the Bulldozer architecture. I do know from the things that were released during Hot Chips about the architecture that the vast majority of the architecture has been improved from K10.
The front end has been completely overhauled, including the branch prediction which probably is the most improved part of this architecture (although it was a weakness for the STARS architecture, so how improved this is will have a big impact on the Bulldozer performance since the new architecture has deeper pipelines.) The Branch target buffer now uses a two level hierarchy, just like Intel does on Nehalem and Sandybridge. Plus, now a mispredicted branch will no longer corrupt the entire stack, which means that the penalties for a misprediction are far less than in the STARS architecture. (Nehalem also has this feature, so it brings Bulldozer to parity with Nehalem wrt branch mispredictions)
Decoding has improved, but not nearly as much as the fetching on the processor. Bulldozer can now decode up to four (4) instructions per cycle (vs. 3 for Istanbul). This brings Bulldozer to parity with Nehalem, which can also decode four (4) instructions per cycle. Bulldozer also brings branch fusion to AMD, which is a feature that Intel introduced with C2D. This allows for some instructions to be decoded together, saving clock cycles. Again, this seems to bring Bulldozer into parity with Nehalem (although this is more cloudy, as there are restrictions for both architectures, and since Intel has more experience with this feature they are likely to have a more robust version of branch fusion.)
Bulldozer can now retire up to 4 Macro-ops per cycle, up from 3 in the STARS architecture. It is difficult for me to compare the out-of-order engine between STARS and Bulldozer, as they seem so dissimilar. I can say that it seems a lot more changed than just being able to retire 33% more instructions per cycle. Mostly the difference seems to be moving from dedicated lanes using dedicated ALUs and AGUs, to a shared approach.
Another major change is in the Memory Subsystem. AMD went away from the two-level load-store queue (where different functions were performed in in each level), and adopted a simple 40 entry entry load queue, with a 24 entry store queue. This actually increases the memory operations by 33% over STARS, but still keeps it ~20% less than Nehalem. The new memory subsystem also has an out-of-order pipeline, with a predictor that determines which loads can pass stores. (STARS had a *mostly* in-order memory pipeline) This brings Bulldozer to parity with Nehalem, as Intel has used this technique since C2D. Another change is that L1 cache is now duplicated in L2 cache (which Intel has been doing as long as I remember). Although L3 cache is still exclusive.
Bulldozer now implements true power gating. Although unlike Intel who gates at each core, they power gate at the module level. This shouldn't really effect IPC, but might effect the max frequency so it is a point to bring up when discussing changes to performance. The ability to completely shut off modules should allow higher turbo frequencies than we saw in Thuban, but we won't know what they are until we see some reviews.
Well, those are the main differences that I know of. Add that to the fact that this processor was actually designed to work on a 32nm process versus a 130nm process like STARS, and you should see additional efficiencies. I expect a good IPC improvement, along with a large clockspeed boost. Although I can't say how much, and I really am looking more for parity with Nehalem based processors than I am with Sandybridge based processors.
References:
Butler, Mike. "Bulldozer" A new approach to multithreaded compute performance. Hot Chips XXII, August 2010.
[URL]http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=1[/URL]
We have done enough profiling that you shouldn't see any bottlenecks. If I had to choose between implementations, sharing some front end (that is larger and wider) is a better bet then sharing execution pipelines.
The challenge on HT is that there is a limited amount of bandwidth available in the pipelines. Looking at some of the SPEC int rate numbers, it looks like ~14% increase from HT, which means that workload is ~85% efficient at best. (keep in mind there is some overhead...)
The real challenge is that software developers continue to try to squeeze more efficeincy out of their products, so the benefit from HT drops. If you get to 90% efficient, your HT benefit drops to ~10% or less.
Now, you could easily boost the efficeincy of HT by adding more pipelines. But that is no different than adding more cores. Would you rather have two cores with 4 pipelines or one core with 8 pipelines and HT?
Some of this obviously becomes a math problem at that point, but having more core is a better choice because an 8-pipeline core is going to suck up a huge amount of power, even when it is not busy, but with 2 4-pipeline cores you could shut one down and cut the power in hald during periods of low utilization.
Everything in life is a tradeoff.
Any comment on how additional threads will be prioritized with different modules, if at all? I'm not sure, but I seem to remember you mentioning last year that you guys were working with software companies to push threads 2/3/4 onto other modules. Is that plan working, is it moot b/c the computer will automatically do it anyway (like it does with HT), etc?
Any comment on how additional threads will be prioritized with different modules, if at all? I'm not sure, but I seem to remember you mentioning last year that you guys were working with software companies to push threads 2/3/4 onto other modules. Is that plan working, is it moot b/c the computer will automatically do it anyway (like it does with HT), etc?
So if that works as planned, will running 4 simultaneous threads mean that you will have all 8 cores running at full voltage/frequency? So is there a performance/watt hit if you are running 1-4 threads compared to 5-8 threads?
One other thing you have to realize, is that since the processor is power gated at the module level, it may be considered preferable to have threads bunched together on the same module even with the minor penalties they get from sharing resources. The reason for this is that if they are grouped in the same module, then the other modules could be disabled, and the active module would get a higher turbo-core boost, which may make up for the penalties from sharing resources.
No comment on work we are doing with other companies, never allowed to comment on that.
Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.
Quote:
Originally Posted by bryanW1995
Any comment on how additional threads will be prioritized with different modules, if at all? I'm not sure, but I seem to remember you mentioning last year that you guys were working with software companies to push threads 2/3/4 onto other modules. Is that plan working, is it moot b/c the computer will automatically do it anyway (like it does with HT), etc?
No comment on work we are doing with other companies, never allowed to comment on that.
Quote:
Originally Posted by drizek
So if that works as planned, will running 4 simultaneous threads mean that you will have all 8 cores running at full voltage/frequency? So is there a performance/watt hit if you are running 1-4 threads compared to 5-8 threads?
see next comment.
Quote:
Originally Posted by Martimus
One other thing you have to realize, is that since the processor is power gated at the module level, it may be considered preferable to have threads bunched together on the same module even with the minor penalties they get from sharing resources. The reason for this is that if they are grouped in the same module, then the other modules could be disabled, and the active module would get a higher turbo-core boost, which may make up for the penalties from sharing resources.
Yes, people are starting to come around. Everyone was getting all wrapped around "how do I spread my threads out across modules so that I have one thread per module. Yes, you get a performance increase with that, but it is marginal. However, running threads on the same module would allow for a.) sharing of L2 cache for apps that are utilizing the same data set and b.) the other modules to be shut down, reducing power and increasing the ability to boost.
Lots of people don't get it. You have a maximum amount of power that the processor can consume. You may be better off concentrating the power on fewer modules to achieve higher clocks than try to spread threads out to get 100% of the L2 resources.
Ultimately all of this becomes really academic because threads start and finish at different times. Fire up a program and it might instantly utilize all of the threads, but once it starts running each thread is going to start and stop at a different time. Take a look at a F1 race. Every car starts out at the same place at the same time. Then some win by multiple laps and they never finish in order.
Too many people focus on the theoretical and orderly and not the reality of how things are processed.
1. core i7 refers to nehalem, westmare, and sandy bridge. Which vary in performance.
2. core i7 is more than 50% faster than Phenom II... so how can it be 50% faster then both?
3. I will believe it when I see it.
