- Oct 14, 2003
- 8,686
- 3,787
- 136
The point of Bulldozer isn't to have super high single threaded performance, or all out multi-threaded performance damning everything else. It's to provide reasonable performance in the former, with focus on the latter. It's time to analyze the architecture.
"Single-thread": When we say one architecture has performance advantage in "single-thread", what are we referring to here? True single thread performance using applications that only use 1 thread, or per core performance, which shows how the cores interact with each other?
Pure single thread performance is irrelevant. The gains with pure single thread performance with new architecture nowadays is not worth it. If we put AMD's Deneb/Thuban core as 1.0, here's how the single thread performance of modern CPU architectures will look like.
Deneb/Thuban=1.0
Core 2 Penryn=1.05
Core ix Nehalem/Westmere=1.10
Core ix Sandy Bridge=1.2
Even back in the days of Athlon vs first Pentium III's where the latter was thought to be legendary, the performance differences weren't much more than 10% in average. If it wasn't for quarterly increases in clock speed, the performance gains wouldn't have been worth it. Multi-thread is the new clock speed.
Multi-Core performance:
If you look at recent benchmarks though, Sandy Bridge looks to have far better than 20% gain in "single thread" performance over Deneb/Thuban. That's not because there's something magical going on, it's just that Sandy Bridge's multi-core capabilities are superior to Deneb/Thuban too.
That's where the "per-core" performance comes in. Note again this is different from pure single thread performance.
On paper we expect 2x clock to bring us 2x the performance, and same with 2x cores. Just like a car with just the engine being 2x more powerful doesn't bring 2x the max speeds, the same is with microprocessors. You need the supporting components to keep up.
Multi-thread advantage of Sandy Bridge over Deneb/Thuban comes with lots of reasons:
-Two level BTBs
-Memory controller with better latency/bandwidth
-Higher bandwidth caches
-Better handling of the data in the LLC with multiple cores
-SMT(Hyperthreading)
That's how a 1.2x advantage in single thread performance turns into 1.5-1.6x advantage in per core performance(or per chip performance).
Bulldozer changes:
Look at the die!: http://forums.anandtech.com/showthread.php?t=2146715
I've compared Westmere/Sandy Bridge/Bulldozer/Llano cores to each other. In most of the cases, the core sizes determine how much resources is in the architecture. Westmere/Sandy Bridge cores are both almost 2x larger than Llano, hence the significant performance advantage.
Bulldozer, compared to Sandy Bridge, with the former's integer "core" taken out and latter's L2 cache taken out, Sandy Bridge has only 5-10% larger core, or in this case resources. That tells me while there will still be an advantage for both single thread and per core advantage for Sandy Bridge, Bulldozer itself is a substantial improvement, and is probably on par with Westmere.
Now being on par with Westmere isn't a bad thing at all, because I said in the first few paragraphs that single thread performance gains are hard to get and rewards small.
Bulldozer adds:
-2 level BTB
-Significantly better memory controller
-L/S units with some restrictions lifted
-Module-based multi-threading which is superior to SMT
Oh, and it looks like the pipeline stage will end up similar to Nehalem/Sandy Bridge architecture with ~15 stages(Basically the only reason new processors have better branch prediction is to make up for having more mispredictions).
Bulldozer performance estimates
It's pretty much guaranteed that Bulldozer will be a decent amount faster than the 4 core, Sandy Bridge in well multi-threaded apps. There's little doubt about it. It's best gains will come with multi-media apps(like video editing apps) that take advantage of the FMAC, and in those applications, there will be additional up to 2x gain. Now how many programs can take advantage of the FMAC out of the bat is an unknown. If it needs to be recompiled, its probably easier than AVX.
I mean, how many PC users need more CPU power in doing things other than video editing and 3D rendering anyway? The exact applications Bulldozer will shine at. Games? Give me a break. At the resolutions most people run, it'll be GPU bound. Sandy Bridge might turn out to be 5-10% faster in low-thread count games at mostly CPU bound resolutions.
In fact, I wouldn't be surprised if it performs like the 6 core Sandy Bridge in multi-threaded apps, which are the type of applications you need 6 cores vs 4.
If you've taken a look at the Sandy Bridge and Westmere core, you'll notice that the L2 cache is not only small, but is very close to the CPU core. The space Intel uses for the L2 cache is not far off from the space Bulldozer requires for the extra integer "core".
-Small fast L2 cache in Westmere/Sandy Bridge = single thread focus
-Extra, small integer core in Bulldozer = multi-thread focus
Take 5-10% performance off Lynnfield and double the amount of cores. That'd probably make it faster than the 980X. ~20% for max Turbo is not unexpected for Bulldozer, and we can probably see 10% gains for most applications. If Orochi comes at 3.5GHz base clock, Intel better have higher clocked Sandy Bridge E chips to compete!
"Single-thread": When we say one architecture has performance advantage in "single-thread", what are we referring to here? True single thread performance using applications that only use 1 thread, or per core performance, which shows how the cores interact with each other?
Pure single thread performance is irrelevant. The gains with pure single thread performance with new architecture nowadays is not worth it. If we put AMD's Deneb/Thuban core as 1.0, here's how the single thread performance of modern CPU architectures will look like.
Deneb/Thuban=1.0
Core 2 Penryn=1.05
Core ix Nehalem/Westmere=1.10
Core ix Sandy Bridge=1.2
Even back in the days of Athlon vs first Pentium III's where the latter was thought to be legendary, the performance differences weren't much more than 10% in average. If it wasn't for quarterly increases in clock speed, the performance gains wouldn't have been worth it. Multi-thread is the new clock speed.
Multi-Core performance:
If you look at recent benchmarks though, Sandy Bridge looks to have far better than 20% gain in "single thread" performance over Deneb/Thuban. That's not because there's something magical going on, it's just that Sandy Bridge's multi-core capabilities are superior to Deneb/Thuban too.
That's where the "per-core" performance comes in. Note again this is different from pure single thread performance.
On paper we expect 2x clock to bring us 2x the performance, and same with 2x cores. Just like a car with just the engine being 2x more powerful doesn't bring 2x the max speeds, the same is with microprocessors. You need the supporting components to keep up.
Multi-thread advantage of Sandy Bridge over Deneb/Thuban comes with lots of reasons:
-Two level BTBs
-Memory controller with better latency/bandwidth
-Higher bandwidth caches
-Better handling of the data in the LLC with multiple cores
-SMT(Hyperthreading)
That's how a 1.2x advantage in single thread performance turns into 1.5-1.6x advantage in per core performance(or per chip performance).
Bulldozer changes:
Look at the die!: http://forums.anandtech.com/showthread.php?t=2146715
I've compared Westmere/Sandy Bridge/Bulldozer/Llano cores to each other. In most of the cases, the core sizes determine how much resources is in the architecture. Westmere/Sandy Bridge cores are both almost 2x larger than Llano, hence the significant performance advantage.
Bulldozer, compared to Sandy Bridge, with the former's integer "core" taken out and latter's L2 cache taken out, Sandy Bridge has only 5-10% larger core, or in this case resources. That tells me while there will still be an advantage for both single thread and per core advantage for Sandy Bridge, Bulldozer itself is a substantial improvement, and is probably on par with Westmere.
Now being on par with Westmere isn't a bad thing at all, because I said in the first few paragraphs that single thread performance gains are hard to get and rewards small.
Bulldozer adds:
-2 level BTB
-Significantly better memory controller
-L/S units with some restrictions lifted
-Module-based multi-threading which is superior to SMT
Oh, and it looks like the pipeline stage will end up similar to Nehalem/Sandy Bridge architecture with ~15 stages(Basically the only reason new processors have better branch prediction is to make up for having more mispredictions).
Bulldozer performance estimates
It's pretty much guaranteed that Bulldozer will be a decent amount faster than the 4 core, Sandy Bridge in well multi-threaded apps. There's little doubt about it. It's best gains will come with multi-media apps(like video editing apps) that take advantage of the FMAC, and in those applications, there will be additional up to 2x gain. Now how many programs can take advantage of the FMAC out of the bat is an unknown. If it needs to be recompiled, its probably easier than AVX.
I mean, how many PC users need more CPU power in doing things other than video editing and 3D rendering anyway? The exact applications Bulldozer will shine at. Games? Give me a break. At the resolutions most people run, it'll be GPU bound. Sandy Bridge might turn out to be 5-10% faster in low-thread count games at mostly CPU bound resolutions.
In fact, I wouldn't be surprised if it performs like the 6 core Sandy Bridge in multi-threaded apps, which are the type of applications you need 6 cores vs 4.
If you've taken a look at the Sandy Bridge and Westmere core, you'll notice that the L2 cache is not only small, but is very close to the CPU core. The space Intel uses for the L2 cache is not far off from the space Bulldozer requires for the extra integer "core".
-Small fast L2 cache in Westmere/Sandy Bridge = single thread focus
-Extra, small integer core in Bulldozer = multi-thread focus
Take 5-10% performance off Lynnfield and double the amount of cores. That'd probably make it faster than the 980X. ~20% for max Turbo is not unexpected for Bulldozer, and we can probably see 10% gains for most applications. If Orochi comes at 3.5GHz base clock, Intel better have higher clocked Sandy Bridge E chips to compete!
Last edited: