Some Bulldozer and Bobcat articles have sprung up

Ben90

Platinum Member
Jun 14, 2009
2,866
3
0
I'm really rooting for Bulldozer although I'm not getting my hopes up.
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
From Anands article

While there are two integer schedulers in a single Bulldozer module (one for each thread), there’s only one FP scheduler. There’s some hardware duplication at the FP scheduler to allow two threads to share the execution resources behind it. While each integer core behaves like an independent core, the FP resources work as they would in a SMT (Hyper Threading) system.

Does that means it will have a performance penalty when both threads need to execute an instruction in the FP Execution Unit simultaneously??? Because only one instruction can be executed (in the FP) each cycle??
 
Last edited:

Eeqmcsq

Senior member
Jan 6, 2009
407
1
0
Tom's Hardware's article brings up an interesting point. Can the OS scheduler treat all Bulldozer "cores" the same? Is there a performance hit if the OS schedules 2 threads onto the same module? Or should the OS scheduler look to schedule the 2nd thread on a different module than the 1st thread?
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
From Toms article
http://www.tomshardware.com/reviews/bulldozer-bobcat-hot-chips,2724-2.html

I also asked John about the front-end’s instruction/cycle capabilities and the shared L2’s capacity configuration, but neither of those details is available yet. What he could tell me was that the 128-bit FP units are symmetrical, and that, on any cycle, either integer core can dispatch a 256-bit AVX instruction (assuming software compiled to support AVX). Or, both integer cores can dispatch a single 128-bit instruction at the same time.

So, only when we have a 256-bit AVX instruction there will be a performance penalty. ??

If we have up to 128-bit instructions two of them can be dispatched simultaneously in the FP unit.
 

jones377

Senior member
May 2, 2004
450
47
91
Bulldozer looks like it was designed for servers primarily. Compared with Sandy Bridge, it's FPU/SSE/AVX performance looks completely inadequate. So a BD module can do either 2*128 SSE ops per cycle or 1*256 AVX per cycle (possibly 2*256 every 2 cycles). For SSE this would be the same as a Phenom2/Core/Nehalem core but with symmetrical units instead (should improve performance). Sandy Bridge on the other hand can do 2*256 AVX per core per cycle, 2x that of a BD module and 4x of a BD core.

An 8-core BD could potentially be outperformed by an X6 in stuff like media encoding! And it looks like it will be completely smoked by Sandy Bridge. I hope the other improvements will make up for this deficit in FPU thoughput or AMD will need software to be recompiled to use it's FMAC units just to keep up with the older generation.
 

bunnyfubbles

Lifer
Sep 3, 2001
12,248
3
0
I'm really rooting for Bulldozer although I'm not getting my hopes up.

Same here, and in any case new tech can't get here soon enough for me. I'm ready to upgrade but cannot yet justify leaving my "venerable" but still very potent (thanks to overclocking :D) C2Q platform.

A Phenom X6 rig would do the trick if I could eventually drop a Bulldozer into it, but it looks like that will require AM3+ :\

At least my wallet will be happy for the time being.
 

Soleron

Senior member
May 10, 2009
337
0
71
Does that means it will have a performance penalty when both threads need to execute an instruction in the FP Execution Unit simultaneously??? Because only one instruction can be executed (in the FP) each cycle??

I believe John Fruehe said they can be executed in the same cycle.
 

khon

Golden Member
Jun 8, 2010
1,319
124
106
Seems to me there very little new information here. Those same design schematics have been around for a while.

I was hoping AMD would start to provide some real numbers, but I'm not seeing any.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Seems to me there very little new information here. Those same design schematics have been around for a while.

I was hoping AMD would start to provide some real numbers, but I'm not seeing any.

Yes, that's true, but there are some very important information that can be gleaned from the article..

1. What did 4-wide per core mean? There was a debate whether it was 4 ALUs or 2 AGUs and 2 ALUs, or a combination of both. We now know from the Anandtech article its 2 ALUs and 2 AGUs.

2. Earlier rumors indicated anemic 8KB L1 cache for Bulldozer. The new info says that the L1 I-cache is practically unchanged while L1 D-cache is 16KB, but its per core(from now on when I say "core" it means what the manufacturer of that particular CPU prefers to call it).

3. The "integer cores require 5 percent more die" debate. The new info says its 12 percent per module, and 5 percent for entire ship. Interesting thing is we might be able to make rough die size estimates for Bulldozer.
 
Last edited:

JFAMD

Senior member
May 16, 2009
565
0
0
From Anands article



Does that means it will have a performance penalty when both threads need to execute an instruction in the FP Execution Unit simultaneously??? Because only one instruction can be executed (in the FP) each cycle??

No.

Each cycle can either excute a single 256-bit AVX dispatch OR two 128-bit FMAC dispatches.

So your FP choices are:

Core 1: 256-bit AVX
Core 2: none

Core 1: none
Core 2: 256-bit AVX

Core 1: 128-bit FMAC
Core 2: 128-bit FMAC

And that is per cycle. It is dynamic, so it can change on each cycle depending on demands.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Seems to me there very little new information here. Those same design schematics have been around for a while.

I was hoping AMD would start to provide some real numbers, but I'm not seeing any.

That is the press deck, not the deck that we are showing at hot chips.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Bulldozer looks like it was designed for servers primarily. Compared with Sandy Bridge, it's FPU/SSE/AVX performance looks completely inadequate. So a BD module can do either 2*128 SSE ops per cycle or 1*256 AVX per cycle (possibly 2*256 every 2 cycles). For SSE this would be the same as a Phenom2/Core/Nehalem core but with symmetrical units instead (should improve performance). Sandy Bridge on the other hand can do 2*256 AVX per core per cycle, 2x that of a BD module and 4x of a BD core.

An 8-core BD could potentially be outperformed by an X6 in stuff like media encoding! And it looks like it will be completely smoked by Sandy Bridge. I hope the other improvements will make up for this deficit in FPU thoughput or AMD will need software to be recompiled to use it's FMAC units just to keep up with the older generation.

I wish intel would be as forthcoming as AMD on future product data. As I have seen it discussed, SB is not 2x256 for AVX, but instead "double pumped"128-bit (i.e. running at 2X the speed it can put 2 128-bit instructions through the pipes per cycle.)

The best evidence for this is the die shot that shows the SB FPU being the same size as the current 128-bit FPUs.

If this is the case (as the experts have claimed) then you have 2 problems. First is heat. The FPU is one of the hotter parts of the chip. Running it at twice the clock speed is going to make it disproportionately hot. Not good.

The second issue is that in non-AVX code (i.e. everything available now) you only have a single 128-bit FPU (you can't double pump legacy code.) So you end up with 8 x 128-bit, or half of what Bulldozer has.
 

Kuzi

Senior member
Sep 16, 2007
572
0
0
From Toms article
http://www.tomshardware.com/reviews/bulldozer-bobcat-hot-chips,2724-2.html


So, only when we have a 256-bit AVX instruction there will be a performance penalty. ??

If we have up to 128-bit instructions two of them can be dispatched simultaneously in the FP unit.

There should be no performance penalty on Bulldozer when running 256bit instructions. The drop in performance is when running two 128bit instructions, because of some shared FP resources per module.

AMD will be targeting each Bulldozer Module (2-cores + 256bit FPU) against one Sandybridge Core (2-threads with HT + 256bit FPU). So a 4-Module Bulldozer can run 8 threads and 4x256bit FP instructions, the same as a Quad-core Sandy. The difference in performance is in the implementation and architecture.

I would actually worry more about Bulldozer's integer performance, since that's the area where AMD is really behind Intel now.
 

JFAMD

Senior member
May 16, 2009
565
0
0
From Toms article

So, only when we have a 256-bit AVX instruction there will be a performance penalty. ??

If we have up to 128-bit instructions two of them can be dispatched simultaneously in the FP unit.

A single 256-bit instruction can be dispatched in one cycle as well.

There is no performance penalty anywhere.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Wow that article from Anand shows just how horribly late AMD is with BD.

Actually it was originally supposed to be earlier but that was a 45nm design that was different. That design was scrapped and we went to 32nm. You really can't compare the same.
 

Rifter

Lifer
Oct 9, 1999
11,522
751
126
Im waiting till we have benchmarks till i get excited. To early to tell how this is going to work out in real world performance. I just really hope that AMD catches intel by surprise on this one like they did with the A64. It would cause price drops for intel, and force them to put some real effort into their products. Something which AMD has not been able to force them to do since the A64.
 

Tsavo

Platinum Member
Sep 29, 2009
2,645
37
91
Actually it was originally supposed to be earlier but that was a 45nm design that was different. That design was scrapped and we went to 32nm. You really can't compare the same.

So, does it rock or not? Cough it up!
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Im waiting till we have benchmarks till i get excited. To early to tell how this is going to work out in real world performance. I just really hope that AMD catches intel by surprise on this one like they did with the A64. It would cause price drops for intel, and force them to put some real effort into their products. Something which AMD has not been able to force them to do since the A64.

That's funny. Because from Pentium III to end of Pentium 4 we got $1500+ processors that got little FSB and cache bumps, and in 2006 Intel went with big advancements every year. What's the most expensive desktop CPU now? Oh right, the $999 CPU that isn't classified mainstream like the first Willamette Pentium 4's were.

To be fair, AMD got bigger advancements than before. Every year the demise of x86 and/or Moore's Law is predicted but every time they were proven wrong and the manufacturers do better than before. Yet complaints increase at the same time. Oh well, I guess the CPU sales are increasing so it doesn't really matter.
 

Sylvanas

Diamond Member
Jan 20, 2004
3,752
0
0
Interesting read. Thanks for the info JFAMD. With the changes to pipeline depth I wonder what clockspeeds will be like this time around, another Barcelona (new arch launched at lower than expected clockspeeds) would not be desirable. It looks like AMD is getting it done right this time. Also, Quad channel DDR3 perhaps? :)
 

Tsavo

Platinum Member
Sep 29, 2009
2,645
37
91
Interesting read. Thanks for the info JFAMD. With the changes to pipeline depth I wonder what clockspeeds will be like this time around, another Barcelona (new arch launched at lower than expected clockspeeds) would not be desirable. It looks like AMD is getting it done right this time. Also, Quad channel DDR3 perhaps? :)

Quad channel in the server space, double everywhere else.

I do believe.
 

brybir

Senior member
Jun 18, 2009
241
0
0
I am excited for Bulldozer just in a general sense of enjoying new technology that pushes the boundaries.

I also enjoy alternative (not meant in a bad way) development of solutions. Look at each Bulldozer module. It could have been 1 integer unit and one FPU unit and AMD could have perhaps developed some sort of SMT or other type of hyper threading to make it work. Instead, they recognized that 80-90% of workloads are heavy integer so they added a second integer unit which, if the integer and its components are good, should provide better performance in threaded applications than a SMT or hyper threading scenario (2 Bulldozer modules with 4 integer cores vs. 2 core sandy bridge with 4 threads, if all else is equal, the 4 hardware integer units should have an advantage most of the time).


Definitely looking forward to the comparisons. Was just about to drop the $ for a 6 core Thuban, but now I may wait until next summer and see what my $ will get me.