Phenom II X4 945 or FX 4100?

Hi-Fi Man

Senior member
Oct 19, 2013
601
120
106
I've got a buddy of mine with a Phenom II X4 945 right now. I managed to salvage an FX 4100 out of a crap board and was wondering how exactly does the 4100 compare to the Phenom nowadays. I remember the Phenoms were usually faster back in the day but I'm curious if the newer instruction set on the FX has given it the edge over time.
 

NTMBK

Lifer
Nov 14, 2011
10,232
5,012
136
The 4100 at least supports AVX and SSE4.1/4.2 instructions, so he will get better compatibility with the 4100. There's a few games these days that won't run on Phenom II.
 

LTC8K6

Lifer
Mar 10, 2004
28,520
1,575
126
The 4100 at least supports AVX and SSE4.1/4.2 instructions, so he will get better compatibility with the 4100. There's a few games these days that won't run on Phenom II.
Okay, that would be a decider.
 

Eug

Lifer
Mar 11, 2000
23,586
1,000
126
If you’re going to go through the trouble if swapping CPUs, perhaps you should consider spending some bux on a faster 95 W multi-core FX chip from eBay (if your mobo BIOS supports it that is).

If I could have I would upgraded my system to a 95 W FX 6xxx or 8xxx, but I couldn’t so I went with the 95 W Phenom II 1055T. (The 1065T was too expensive for its marginal improvement in performance.) I'm pleased with the Phenom 1055T, but was not entirely happy with my previous Athlon II X3 435 even for just regular daily non-gaming usage (eg. surfing and Netflix). Your two chips are closer in speed to my Athlon tri-core than to my Phenom hex-core, whereas even a cheap FX 6100 would be significantly faster than my Phenom hex-core.

Just a thought, if you plan on keeping the machine and want to use it for more than word processing and email, etc.
 
Last edited:

MajinCry

Platinum Member
Jul 28, 2015
2,495
571
136
Bulldozer is a performance regression from Phenom II, and also has worse draw call performance. I'd say stick with the Phenom II. If it was a Piledriver CPU, then ditch the Phenom for the newer instruction sets.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Bulldozer has higher performance(wider core and wider I/O), lower nominal power (HKMG gate from 32nm), and has AVX and XOP(Alt-AVX2) which are supported in JIT/Runtime in DX11.x/DX12 (SM6.x) compilers. XOP is used as replacement to AVX2, when it does not exist.
 

ao_ika_red

Golden Member
Aug 11, 2016
1,679
715
136
Check board's CPU compatibility and if you can, get FX43xx CPU instead. That FX41xx CPU is a dud. Also consider FX-8300 and FX-8370e because both fall in same 95W TDP but you get 8 threads instead of just 4.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,674
3,795
136
Bulldozer has higher performance(wider core and wider I/O)

Nope.

lower nominal power (HKMG gate from 32nm)

Maybe, but it is still nothing impressive.

and has AVX

True.

and XOP(Alt-AVX2) which are supported in JIT/Runtime in DX11.x/DX12 (SM6.x) compilers. XOP is used as replacement to AVX2, when it does not exist.

Considering XOP is only in the Con cores and has already been ditched in Zen, I think it's safe to say it's not seeing much use.
 
  • Like
Reactions: ao_ika_red

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
10h -> 72 macro-ops in flight and 24 macro-ops in execution; 3 macro-op execution blobs. Scheduler A can't execute Scheduler B or C micro-op. ALU0 gets ALU-op 1 and AGU0 gets AGU-op 8, but AGU0 can not get AGU-op 24 and ALU0 can not get ALU-op 9.
15h -> 128 macro-ops in flight and 40 macro-ops in execution; 4 macro-op execution blobs. There is no independent schedulers. So, EX0 can get any ALU ops from macro-op 1 to macro-op 40 and AGLU0 can get any AGU ops from macro-op 1 to macro-op 40. (AGLU0/AGLU1 can also do simple ALU tasks, which are used a lot so, 4 ALU ops can be given if macro-op 1 to macro-op 40 have the correct ops for all four ALUs.)

Each Bulldozer core allows for significantly more memory ops and computational ops. While, the I/O in the Zambezi is also larger and faster than it is in Thuban, thus as well within Deneb. So, nominal performance of any Bulldozer processor will always be faster than the nominal performance of any Greyhound+ processor.
 

amd6502

Senior member
Apr 21, 2017
971
360
136
10h -> 72 macro-ops in flight and 24 macro-ops in execution [....]
15h -> 128 macro-ops in flight and 40 macro-ops in execution; 4 macro-op execution blobs.

Each Bulldozer core allows for significantly more memory ops and computational ops. While, the I/O in the Zambezi is also larger and faster than it is in Thuban, thus as well within Deneb. So, nominal performance of any Bulldozer processor will always be faster than the nominal performance of any Greyhound+ processor.

Micro ops? Those are interesting stats if true, but it still wouldn't represent the width of the pipeline, but more the width*depth of the pipeline. And we all know dozer has a pipeline on the long side.

On the FPU side, I think maybe yes, the FPU is wider (yet shared), so dozer might get higher FPU single thread.

But as far as integer width it's k10's 6 wide (3+3) vs dozer's 4 wide (2+2). The AGU can substitute sometimes as ALU, but I thought this was just the case for later generation dozers (I very well might be wrong on this guess).

Back to the OP, I think of BD more as a prototype dozer, and PD as the first real dozer. BD is fine for general purpose home computing, but I wouldn't use a quad or even hex core BD for gaming. (I haven't used phenom in quite a while, so I don't know how it compares; you might google gta fps videos on youtube; my guess is it's getting too old and won't fare better than 4100.) I second the suggestion that you might not bother swapping CPU unless you can find a PD generation, like 4350 or 6300 minimum (4300 is the only model FX-PD that has half cache disabled). Also, what model board is this?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Micro ops? Those are interesting stats if true, but it still wouldn't represent the width of the pipeline, but more the width*depth of the pipeline. And we all know dozer has a pipeline on the long side.
Micro-ops are either ALU or AGU, Macro-ops are both, macro-ops are the internal RISC interpretation of the CISC AMD64 ISA.
Width is 3-wide and height is 12-long for K8+(Greyhound 00f_10h).
Width is 4-wide and height is 15-long for K10(Bulldozer 15f_00h).
Width is 8-wide and height is 17-long for Zen(17f-00h).
On the FPU side, I think maybe yes, the FPU is wider (yet shared), so dozer might get higher FPU single thread.
10h -> 42 FPU instructions in execution. 14 entries per for FADD, FMUL, FMISC.
15h -> 64 FPU instructions in execution. Unified scheduler for the two FMACs and two FADD(+1FMISC)s.
But as far as integer width it's k10's 6 wide (3+3) vs dozer's 4 wide (2+2). The AGU can substitute sometimes as ALU, but I thought this was just the case for later generation dozers (I very well might be wrong on this guess).
Greyhound's LSU only could handle 2 AGUs, thus only 2 ALUs would have been active in generic workloads. Bulldozer LSU handles double the Greyhound LSU. So, load/store frames are halved with Bulldozer. Faster to work, quicker to finish.

Bulldozer requires no optimization, other than ISA for FMAC usage. Other than that Bulldozer optimized is not much faster than Bulldozer not-optimized.
Greyhound requires significant optimization, to get around load-store unit and dependencies, etc.

If one has the opportunity to go FX-4100 over Phenom II X4 945, they should take the FX-4100.

Priority of upgrade; Deneb to Zosma(Thuban-QC with Turbo Core 1.0) to FX-4100(Zambezi-QC with Turbo-core 2.0). The better upgrade is the farthest to the right.
 
Last edited:
  • Like
Reactions: amd6502

Hi-Fi Man

Senior member
Oct 19, 2013
601
120
106
I probably should have clarified that my question was more out of curiosity. The board he has now is an ASRock K10N78 which will not support any FX chips. It's a decent board nonetheless.

The board that I got the FX from is even worse though because it's a Gigabyte M68MT-SP2 rev3.1 which is one of those nForce 630a leftover boards adapted to work with AM3+. The board doesn't even officially support Vishera and only supports PCIe 1.1. It also has a barebones 3+1 phase design which I don't really trust to run anything. Plus the retention tab broke as soon as I released the lever on the heatsink!
 

Ancalagon44

Diamond Member
Feb 17, 2010
3,274
202
106
The 4100 at least supports AVX and SSE4.1/4.2 instructions, so he will get better compatibility with the 4100. There's a few games these days that won't run on Phenom II.

I had Phenom II until late last year, and there was nothing that would not run on it.

I mean, sure, a lot of the stuff would not run well, but everything ran on it. I never did any scientific computing though.
 

amd6502

Senior member
Apr 21, 2017
971
360
136
I probably should have clarified that my question was more out of curiosity. The board he has now is an ASRock K10N78 which will not support any FX chips. It's a decent board nonetheless.

The board that I got the FX from is even worse though because it's a Gigabyte M68MT-SP2 rev3.1 which is one of those nForce 630a leftover boards adapted to work with AM3+. The board doesn't even officially support Vishera and only supports PCIe 1.1. It also has a barebones 3+1 phase design which I don't really trust to run anything. Plus the retention tab broke as soon as I released the lever on the heatsink!

Well bottom end boards are well matched I wouldn't swap anything.

Micro-ops are either ALU or AGU, Macro-ops are both, macro-ops are the internal RISC interpretation of the CISC AMD64 ISA.
Width is 3-wide and height is 12-long for K8+(Greyhound 00f_10h).
Width is 4-wide and height is 15-long for K10(Bulldozer 15f_00h).
Width is 8-wide and height is 17-long for Zen(17f-00h).10h -> 42 FPU instructions in execution. 14 entries per for FADD, FMUL, FMISC.

There are always a lot of estimates floating around for depth of pipelines. I've seen varying info. IMHO I think the below guesses are are reasonable:

K10 (and K7,K8) may have been 6 wide (3+3) and the depth of k10 was roughly comparable with bobcat, jaguar (which is 12 and 13. see: https://www.realworldtech.com/jaguar/ )

As for dozer it's suspected the depth is a little upwards of 20. "The exact number is not known, but it's in the lower twenties." https://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/2

For zen, supposedly maximum 192 instructions in flight (that's the retire queue). So my guess zen is ~20 deep. The integer core is 6 wide, and there are two FPU units each with two pipelines. So there are a total of (10 pipelines counting FPU). Retire width is 8. 192/10~19.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
There are always a lot of estimates floating around for depth of pipelines. I've seen varying info. IMHO I think the below guesses are are reasonable:
So my guess zen is ~20 deep.

Bobcat is 12-long (Shown by AMD; Bobcat Pipeline slide) (14-cycle mispredict length)
Jaguar is 13-long (Shown by AMD; Jaguar Pipeline slide) (15-cycle mispredict length)
Bulldozer is 15-long (Mispredict; L1 is 1 cycle and L2 is 4 cycle => 20 cycles minus 5 => 15-long pipeline)
Zen is 17-long (Mispredict; L1 is 1 cycle and L2 is 1 cycle => 19 cycles minus 2 => 17-long pipeline)
L0i does not have a calculation penalty as it is less than the pipeline. L0i hits on the control unit, loop, or macro-op caches is always less.
 
Last edited:

amd6502

Senior member
Apr 21, 2017
971
360
136
Bobcat is 12-long (Shown by AMD; Bobcat Pipeline slide) (15-cycle mispredict length)
Jaguar is 13-long (Shown by AMD; Jaguar Pipeline slide) (16-cycle mispredict length)
Bulldozer is 15-long (Mispredict; L1 is 1 cycle and L2 is 4 cycle => 20 cycles minus 5 => 15-long pipeline)
Zen is 17-long (Mispredict; L1 is 1 cycle and L2 is 1 cycle => 19 cycles minus 2 => 17-long pipeline)
L0i does not have a calculation penalty as it is less than the pipeline. L0i hits on the loop or macro-op caches is always less.

I don't see what the cache cycle has to do with this. Also, that was minimum mispredict cycles. The real penalty is in the 20s typically.

The minimum branch prediction penalty of the Bulldozer chip is indeed in the same range as Pentium 4. However, the maximum penalty could be a horrifying 100 cycles or more on the P4, while it's a lot lower on Bulldozer. In most common scenarios, the Bulldozer's branch misprediction penalty will be below 30 cycles.

Secondly, the Pentium 4's pipeline was 28 ("Willamette") to 39 ("Prescott") cycles. Bulldozer's pipeline is deep, but it's not that deep. The exact number is not known, but it's in the lower twenties. Really, Bulldozer's pipeline length is not that much higher than Intel's Nehalem or Sandy Bridge architectures (around 16 to 19 stages).

The author states that BD has a depth in the lower 20s. This implies that the minimum penalty of 20 listed for BD in the table is less than the depth of the pipeline. (Wild guess: It may in some mispredicts just keep the pipeline flowing and disregard some results coming out.)

Also if zen is 10 pipes wide (counting fpu), and supposing it had 17 stages (meaning 170 max instructions in flight), why would it need a retire queue of 192. That's over 20 extra entries.

Table of minimum mispredict penalty:
Architecture Branch Misprediction Penalty
AMD K10 (Barcelona, Magny-Cours) 12 cycles
AMD Bulldozer 20 cycles
Pentium 4 (NetBurst) 20 cycles
Core 2 (Conroe, Penryn) 15 cycles
Nehalem 17 cycles
Sandy Bridge 14-17 cycles

If I revise down my ~20 depth closer to your estimates, say to 18, then:

18*6=108
And (192-108)=84.

Dividing 84 by 4 (width of FPU) gives our estimate for the FPU pipeline depth, 21.

So Zen might be 18 deep on the integer side and 21 for the FPU.

Otherwise, with 17 deep it'd be ~22 deep. (90=192-17*6) / 4 ~ 22.

Let's go with 18. These just count the number of stages after the dispatch. So one still needs to add some stages on the front end. So we're back to something like ~20 or 21.
 
Last edited:
Jan 10, 2018
86
2
11
If we compare both processors on performance level than FX4100 is the clear winner. New FX series contain small die which makes them power efficient with new memory support with high bandwidth which makes them more efficient as compared to older models.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,674
3,795
136
Never in the year 2018 did I expect anyone to be recommending a Bulldozer CPU. If it were a Piledriver 6/8 core, an argument could be made. More often than not though, it is better to save the money for something better IMO.

The PD CPU's were at least an honest alternative to Intel. And the low power Steamroller and Excavator APU's were actually surprisingly good. It's too bad the OEM's never bothered with the more desirable ones.

As for NostraSeronx, full disclosure, he is the biggest Con core cheerleader I have seen. He seems to get these ideas that we have not seen the last of them, or that parts of their technology will resurface at sometime. Take a look at the first page of the Steamroller review below though to see why calling BD wider than K10 is rather disingenuous. Skip to the part about "Front End Improvements".

https://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture

Also, when you have to do some number manipulation to try to show that BD has a 15 stage pipeline, well, I don't know what to say.

Bobcat is 12-long (Shown by AMD; Bobcat Pipeline slide) (14-cycle mispredict length)
Jaguar is 13-long (Shown by AMD; Jaguar Pipeline slide) (15-cycle mispredict length)
Bulldozer is 15-long (Mispredict; L1 is 1 cycle and L2 is 4 cycle => 20 cycles minus 5 => 15-long pipeline)
Zen is 17-long (Mispredict; L1 is 1 cycle and L2 is 1 cycle => 19 cycles minus 2 => 17-long pipeline)
L0i does not have a calculation penalty as it is less than the pipeline. L0i hits on the control unit, loop, or macro-op caches is always less.

All that said, the FX 4100 would be better in certain areas. Video transcoding comes to mind. As transcoding is very forgiving of long pipelines (see: P4), plus the availability of AVX, the FX would be a much better fit there. Anything else that makes heavy use of AVX would also benefit a good bit. As for games, there may be some games that cannot run on K10 at all, but I seriously doubt an FX 4100 would run most of those acceptable anyway. If you are going to go for an FX, see if you could grab an FX-6300 somewhere instead. That would give you two extra cores, and Piledriver vs Bulldozer. PD really got power consumption under control compared to BD.
 
  • Like
Reactions: amd6502 and f2bnp

naukkis

Senior member
Jun 5, 2002
705
576
136
I have A6-3670@3ghz and 750K Athlon @4.4GHz, both clocks that are easily achieved. 750K is about twice as fast in normal destop use and still in use, 3670 is so slow that it is pain to use even for web browsing. Phenom 945 vs FX4100 should be almost identical situation so yes there's massive difference in those chips performance. Those Anandtech benches don't have any single-thread integer application to that difference to show up, multithreaded performance difference is smaller because 945 is real 4-core cpu and FX4100 isn't, yet FX is still faster in multitreaded applications too.
 
  • Like
Reactions: amd6502

amd6502

Senior member
Apr 21, 2017
971
360
136
Take a look at the first page of the Steamroller review below though to see why calling BD wider than K10 is rather disingenuous. Skip to the part about "Front End Improvements".

https://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture

Well, that's a good link, I completely didn't realize K10 had such a narrow (3 instruction wide) front end. So Nosta had a good point, and in a sense, dozer was 33% wider than K10, with its 4 instruction decode. Though in another sense (number of pipelines) K10 was wider than dozer (3+3 vs 2+2). Here is a comparison of BD-era front ends: https://www.realworldtech.com/bulldozer/5/

I have a revised estimate for Zen depth. If we assume that Zen is similar to Bobcat with 8 stages until dispatch, and if we assume the retire queue, 192 entries, is split evenly between 4-wide FPU and 6-wide Integer core (i.e. 96+96 = 192 ) then the number of stages after dispatch is:
FPU 96/4=24
INT 96/6=16

So adding the estimated number of stages before dispatch (~8) to the number of stages after dispatch gives us an estimate of a 24 deep pipeline for Zen (and 32 deep for its FPU). This is really close to what some expect for dozer (maybe just 1 or 2 stages longer).
 
Last edited:

Thunder 57

Platinum Member
Aug 19, 2007
2,674
3,795
136
Correction, estimates above all wrong. Also, the minimum mispredict penalty for dozer is actually 15 cycles (for unconditional direct branches). Thank you Nosta, for the source https://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf

So it seems like the length of dozer's pipeline has been greatly exaggerated. It could be as low as 15.

Except that it's not. Otherwise it would outperform Phenom by a noticeable amount. I don't get it. Look at the benchmarks.