Will AMD ever be able to compete with Intel again?

Torn Mind · May 30, 2014

Capital of the human variety is very important in the semiconductor industry, where having a larger quantity of expensive skilled labor gives the company potential to do more.

Strategic mistakes also hurt them, such as delaying the process node cadence way back then. That agreement with GloFlo is absolutely killing them as well.

bononos · May 30, 2014

TreVader said:
.....
AMD is better at engineering. It's manufacturing and money that got intel here, not intelligent design. They would have ridden the sinking ship of net burst straight to bankruptcy if they hadn't payed of Dell and others to buy their processors.

AMD is a the superior processor engineering company. From a business perspective, they just don't have intels killer instinct. They aren't willing to bribe their way into devices which is how stuff works now.

Thats a terrible argument considering that AMD's current bulldozer family is a flop and Intel has AMD beat from top to all the way to the bottom especially after the arrival of Haswell. Why even talk about netburst when that is already very old news.

raghu78 · May 30, 2014

AtenRa said:
As i have said, first Gen product was not performing as it should. Second gen products like FX8350 using the derived Bulldozer mArchitecture (Vishera) are performing as they should off. Third gen (Kaveri) products have higher IPC and MT performance using the Bulldozer derived architecture (steamroller).
All four products including next year Carrizo are using the Bulldozer derived architecture. So, i wouldn't judge the mArchitecture from a first gen product only and dismiss the other 3.

Sorry but I have to disagree. Even now the same problems exist

1. Weak single thread performance (which affects the vast majority of desktop apps)
2. Weak FPU performance
3. Very poor cache performance
4. Poor perf/sq mm and perf/watt.
5. Too much die area wasted on cache. The amount of cache on a FX-8350 is ridiculous. Whats worse is the cache latency is horrible. Even with 16MB of cache FX-8350 cannot compete with a core i7 3770k / 4770k which has only 9 MB cache (1 MB L2 + 8 MB L3).

AMD's products have to be judged in respect to the competition. That means Sandy/Ivy/Haswell for Bulldozer/Piledriver/Steamroller. In contrast the Cat cores have showed what an efficient design should be like. Jaguar/Puma are very competitive with Silvermont. Beema and Mullins are good examples of efficient and competitive products. Kaveri is poor in terms of perf/sq mm from a CPU point of view though the GPU is very good. Still AMD is handicapped by low bandwidth on the APUs and needs to quickly move to HBM to solve the problem.

AMD needs to design a high end core which has the same die area as Broadwell/Skylake and has to be be competitive in perf/watt and perf/sqmm. Cache perfomance too needs to be competitive with Intel's best. TSMC 16FF+ matches Intel 14nm in performance. So there is no reason why their high end server and desktop FX cannot be manufactured there to match Intel's best in 2016. Samsung 14LPP also should be close in performance to TSMC 16FF+. Its definitely possible.

VirtualLarry · May 30, 2014

raghu78 said:
TSMC 16FF+ matches Intel 14nm in performance. So there is no reason why their high end server and desktop FX cannot be manufactured there to match Intel's best in 2016. Samsung 14LPP also should be close in performance to TSMC 16FF+. Its definitely possible.

AFAIK, all AMD CPUs have to be mfg'ed at GF starting this year or next. That's why they've been back-porting Kabini to GF from TSMC. Likely the console chips too, although I'm less certain about that.

NostaSeronx · May 30, 2014

raghu78 said:
1. Weak single thread performance (which affects the vast majority of desktop apps)
2. Weak FPU performance
3. Very poor cache performance
4. Poor perf/sq mm and perf/watt.
5. Too much die area wasted on cache. The amount of cache on a FX-8350 is ridiculous. Whats worse is the cache latency is horrible. Even with 16MB of cache FX-8350 cannot compete with a core i7 3770k / 4770k which has only 9 MB cache (1 MB L2 + 8 MB L3).

1. AMD Dozer/Driver has strong singlethreaded performance but weak multithreaded performance.
2. The weak FPU is the outcome of SMT and lower power consumption targets.
3. This is do to several weird things going on or bad measurements.
4. This is an issue with 45/32nm PDSOI not 28nm Bulk.
5. The L3 is there for multithreading and mutinode processing. It also uses an upgraded coherency protocol. Which supports Modified Unwritten that is a form of Hardware Transactional Memory.

This is beneficial when one core writes a piece of data that multiple other cores want to read.

CHADBOGA · May 31, 2014

NostaSeronx said:
1. AMD Dozer/Driver has strong singlethreaded performance but weak multithreaded performance.

Where are the benchmarks which show this "strong singlethreaded performance"? 😵

Torn Mind · May 31, 2014

Anyone who says that Bulldozer or Piledriver has strong singlethreaded performance is clearly not speaking in terms of it relative to the architecture it was supposed to compete with, but rather in comparison to older CPUs from a bygone era such as Athlon XPs and Pentium 4s. A single Sandy Bridge core absolutely beats the Bulldozer counterpart it is supposed to compete with. Strong single-threaded performance just means a single core can perform operations(mathmatical ones like add, subtract, divide, multiple, square root, etc and logical ones like or, and, not, etc) in a certain amount of time.

NostaSeronx · May 31, 2014

CHADBOGA said:
Where are the benchmarks which show this "strong singlethreaded performance"? 😵

I'm referring to the cores which are capable of 2 arithmetic ops and 2 memory ops or 4 logical ops. In comparison to 00h/10h which can only do 3 arithmetic or 3 logic ops or 3 memory ops. You don't need bentmarks to figure out single threaded performance. Single-threaded performance isn't determined by the accelerator but by the core.

00h/10h takes 2 cycles to finish 3 macro-ops; 3(LOAD+PROD+STORE)
15h 00h-3(x)h takes 2 cycles to finish 4 macro-ops; 4(LOAD+PROD+STORE)

15h has strong singlethreaded performance but once you go multithreaded you hit memory pipeline stalls.

TreVader · May 31, 2014

bononos said:
Thats a terrible argument considering that AMD's current bulldozer family is a flop and Intel has AMD beat from top to all the way to the bottom especially after the arrival of Haswell. Why even talk about netburst when that is already very old news.

"Old News"

Fact is, intel wouldn't be around if they weren't cutthroats about margins and using all ethical (and unethical) means to beat the competition. That mentality creates a market leader but does not necessarily create the best business to do the job.

They have something decent with the Core architecture and I'm not saying it's bad. I'm not saying AMD builds faster processors, I'm saying they could.

Intel to this day can just match apples first processor in IPC. They are not magic. They do sometimes make good products.

erunion · May 31, 2014

CHADBOGA said:
Where are the benchmarks which show this "strong singlethreaded performance"? 😵

He is talking theoretically, referring to the arch's shared resources causing multithreaded bottlenecks.

If we were talking in a practical sense we might say that BD has poor single threaded performance but worse multithreaded performance.

raghu78 · May 31, 2014

NostaSeronx said:
I'm referring to the cores which are capable of 2 arithmetic ops and 2 memory ops or 4 logical ops. In comparison to 00h/10h which can only do 3 arithmetic or 3 logic ops or 3 memory ops. You don't need bentmarks to figure out single threaded performance. Single-threaded performance isn't determined by the accelerator but by the core.

00h/10h takes 2 cycles to finish 3 macro-ops; 3(LOAD+PROD+STORE)
15h 00h-3(x)h takes 2 cycles to finish 4 macro-ops; 4(LOAD+PROD+STORE)

15h has strong singlethreaded performance but once you go multithreaded you hit memory pipeline stalls.

Whatever the reasons the fact is Bulldozer'S REAL WORLD single thread performance is bad. Thats why you still had Bulldozer falling behind K10. 2 arithmetic ops against K10 which can do 3 arithmetic ops is a regression. AMD should have gone for 3 - 4 ALUs and 3 Load/Store units per cluster. I think AMD will move to 4 ALUs for Excavator which is how it should have been from the start. Also the overall design started with shared decode and has now moved to separate decode for each integer cluster in a module. This points to a concession from AMD that the original design was having problems or bottlenecks.

NostaSeronx · May 31, 2014

raghu78 said:
Whatever the reasons the fact is Bulldozer'S REAL WORLD single thread performance is bad.

That is the floating point unit not the core.

raghu78 said:
Thats why you still had Bulldozer falling behind K10. 2 arithmetic ops against K10 which can do 3 arithmetic ops is a regression.

If you want to finish an op 00h/10h actually had the throughput of 1.5 arithmetic ops. As 00h/10h could not do an arithmetic op and a memory op at the same time. 15h can do an arithmatic op and a memory op at the same time.

00h/10h -> 3 arithmetic micro-ops per 2 cycles
15h -> 4 arithmetic micro-ops per 2 cycles.

There was no regression. This is kind of silly because every architecture since P6 for Intel had the ability to do a Mem op and an ALU op. It took 16 years for AMD to release an architecture with the throughput of Intel.

raghu78 said:
AMD should have gone for 3 - 4 ALUs and 2 - 3 AGLUs per cluster. I think AMD will move to 4 ALUs for Excavator which is how it should have been from the start. Also the overall design started with shared decode and has now moved to separate decode for each integer cluster in a module. This points to a concession from AMD that the original design was having problems or bottlenecks.

Bulldozer/Piledriver;
1 EX(AL/MUL/BRANCH)
1 EX(AL/DIV/COUNT)
2 AGLU(L/MEM/MOV)

Steamroller/Excavator;
1 EX(AL/MUL/BRANCH)
1 EX(AL/DIV/COUNT)
2 EX(AL)
4 AGLU(L/MEM/MOV)

The decode isn't the only thing that went from shared to unshared. The Instruction Fetch Unit, Instruction Dispatch, and the FP Decode are all "unshared" per core with Steamroller.

Everything done with Steamroller was to feed the cores more and faster because it had more units.

raghu78 · May 31, 2014

NostaSeronx said:
That is the floating point unit not the core.

Sorry but how many ever times you say Bulldozer had lesser IPC and single thread performance than K10 - both integer and FP.

If you want to finish an op 00h/10h actually had the throughput of 1.5 arithmetic ops. As 00h/10h could not do an arithmetic op and a memory op at the same time. 15h can do an arithmatic op and a memory op at the same time.

00h/10h -> 3 arithmetic micro-ops per 2 cycles
15h -> 4 arithmetic micro-ops per 2 cycles.

The problem is there are far too many cases when the code is utilizing those 3 arithmetic ops in parallel which means Bulldozer in those cases is a regression to K10.

There was no regression.Bulldozer/Piledriver;
1 EX(AL/MUL/BRANCH)
1 EX(AL/DIV/COUNT)
2 AGLU(L/MEM/MOV)

Steamroller/Excavator;
1 EX(AL/MUL/BRANCH)
1 EX(AL/DIV/COUNT)
2 EX(AL)
4 AGLU(L/MEM/MOV)

Again you are wrong. Since we can only talk of launched products Steamroller does not have 4 EX as you say. It has only 2 EX. The part about 2 EX being disabled in Steamroller or Excavator having 4 EX is speculation at the most. You have been saying things without any proof. Since you are wrong about Steamroller atleast stop misleading others.

The decode isn't the only thing that went from shared to unshared. The Instruction Fetch Unit, Instruction Dispatch, and the FP Decode are all "unshared" per core with Steamroller.

Everything done with Steamroller was to feed the cores more and faster because it had more units.

Again you are wrong. Instruction fetch and dispatch is shared

http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/3

"In Bulldozer and Piledriver, each integer core had its own independent scheduler but the two cores shared a single fetch and decode unit. Instructions would come in and decodeded operations would be fed to each integer pipe on alternating clock cycles. In Steamroller the decode hardware is duplicated in each module, so now each integer core gets its own decode unit. The two decode units are shared by the one FP unit."

There is only one Instruction Fetch and Dispatch unit in a module It feeds both the decoders. The integer clusters have separate Decode and scheduler units. FP unit shares both the decode units and has separate scheduler of its own.

NostaSeronx · May 31, 2014

raghu78 said:
Sorry but how many ever times you say Bulldozer had lesser IPC and single thread performance than K10 - both integer and FP.

General Purpose Integer is better than 10h/12h.
Accelerator Integer is better than 10h/12h.
Accelerator Floating Point is about the same to 10h/12h if FMA.

raghu78 said:
The problem is there are far too many cases when the code is utilizing those 3 arithmetic ops in parallel which means Bulldozer in those cases is a regression to K10.

Nope, it does not work that way. You can't have continuous arithmetic ops in 00h/10h/12h. There is always a delay for the memory ops.

raghu78 said:
Again you are wrong. Since we can only talk of launched products Steamroller does not have 4 EX as you say. It has only 2 EX. The part about 2 EX being disabled in Steamroller or Excavator having 4 EX is speculation at the most. You have been saying things without any proof. Since you are wrong about Steamroller atleast stop misleading others.

Proof: http://images.anandtech.com/doci/7677/DieShot%20-%20Kaveri%20Main.png

raghu78 said:
There is only one Instruction Fetch and Dispatch unit in a module It feeds both the decoders. The integer clusters have separate Decode and scheduler units. FP unit shares both the decode units and has separate scheduler of its own.

There is two instruction fetch units and two dispatch units.

While previous AMD64 family 15h processors had a single 32-byte fetch window, AMD Family 15h,
models 30h–4Fh processors have two 32-byte fetch windows...

Each decode has its own dispatch buffer which is the dispatch unit. Each FP decode is coupled to one of the dispatch/decode units.

raghu78 · May 31, 2014

NostaSeronx said:
General Purpose Integer is better than 10h/12h.
Accelerator Integer is better than 10h/12h.
Accelerator Floating Point is about the same to 10h/12h if FMA.

k. explain this. Phenon II X4 980 (3.7 Ghz) beats FX-8150 (4 Ghz) by 10% in Sunspider which is a primarily single threaded integer benchmark.

http://techreport.com/review/21813/amd-fx-8150-bulldozer-processor/11

Cinebench R11.5 single thread performance is 1.12 on Phenom II X4 980 at 3.7 Ghz compared to 1.03 for FX-8150 at 4 Ghz .

http://techreport.com/review/21813/amd-fx-8150-bulldozer-processor/14

Actually normalized for clock the gap is even higher. But anyway I have provided proof that K10 was faster in both single thread integer and FP.

Proof: http://images.anandtech.com/doci/7677/DieShot - Kaveri Main.png
There is two instruction fetch units and two dispatch units.Each decode has its own dispatch buffer which is the dispatch unit. Each FP decode is coupled to one of the dispatch/decode units.

http://images.anandtech.com/doci/7677/08 - CPU Improvements.jpg

I show an official AMD presentation which shows Steamroller has a single instruction fetch unit per module and 4 execution pipes per integer cluster- 2 EX and 2 AGU. And you show me a die shot. What did you prove ? Nothing. You keep spreading misinformation on the internet from ocn to semiaccurate to anandtech. Give it a break.

NostaSeronx · May 31, 2014

Sunspider is a cache test not an integer/Vinteger/Vfloating point test. It mainly shows the issues with caches. Sunspider is first a browser benchmark then a cache benchmark.
Cinebench R11.5 is a FP test for SSE3 instructions for VFloating. No FMA so of course it isn't going to have the same score.

raghu78 said:
And you show me a die shot.

Well what best way to show you is that it is physically there.

NTMBK · May 31, 2014

Seronx, please stop making stuff up.

raghu78 · May 31, 2014

NTMBK said:
Seronx, please stop making stuff up.

well said :thumbsup: seronx has been spreading the same misinformation at many other forums like ocn and semiaccurate. frankly its getting tiring. :biggrin:

ViRGE · May 31, 2014

If you guys want to discuss BD single-threaded performance in depth, that's best left to another thread. It doesn't belong here.

-ViRGE

Gikaseixas · May 31, 2014

AMD might catch Intel again yes, it is possible why not? They did it before but problem is not anytime soon. It would take a great effort from them and GF to produce something worth comparing to the actual Intel line-up.
As things are now, Intel can still relax but they need to keep an open eye.

raghu78 · May 31, 2014

Gikaseixas said:
AMD might catch Intel again yes, it is possible why not? They did it before but problem is not anytime soon. It would take a great effort from them and GF to produce something worth comparing to the actual Intel line-up. As things are now, Intel can still relax but they need to keep an open eye.

I agree. the earliest is Q1 or Q2 2016 when the new high performance x86-64 core based Opterons and FX CPUs are out. The process will be the Samsung 14 FINFET licensed by Globalfoundries. So AMD has a tough road till they deliver a competitive high end core in 2 years time. Maybe by next year this time we will have a glimpse at the true successor of the original AMD Opteron and AMD FX.

Whats even more exciting is high end APUs which combine AMD's world class GPU tech with the new x86-64 core. I am drooling at the prospect of 4 powerful cores and a 1024 sp GCN GPU on a SOC connected to HBM (high bandwidth memory) with 100+ Gb/s. 2016 could not come fast enough :thumbsup:

pw257008 · May 31, 2014

VirtualLarry said:
And how many of those AMD laptops, are "worthless" Kabini models (OK, the A4-5000 isn't bad), and how many are mobile Richland or Kaveri?

I see tons of Kabini models of desktops around here too, and AIOs. (Granted, I bought one.) But, man are they slow. I'm not even sure that the E1-2500 is faster than my E-350 CPU, even though it is several generations newer. (The E-350 is 1.6Ghz Brazos, the E1-2500 is 1.4Ghz Kabini.)

A6-5200s have sort of taken over the $300-$330 range, where Ivy Pentiums used to reign. I have to assume based on this that for some reason Haswell Pentiums are not as attractive to manufacturers as high end Kabinis (this may be manufacturer cost related [batteries or motherboards], or possibly Intel has increased the price on those now that they have Bay Trail notebook models available), while Kabinis offer high enough performance for this price range while maybe Bay Trail is a better price/performance fit for the 250-290 range.

parvadomus · May 31, 2014

NostaSeronx said:
Well what best way to show you is that it is physically there.

Which part in that pic are the EXs?

NostaSeronx · May 31, 2014

parvadomus said:
Which part in that pic are the EXs?

This is about as far as I gotten in labeling all the parts in Steamroller: https://i.imgur.com/86WVawn.jpg

I made a few mistakes in the strokes but it is near perfected to a 1.x mm² degree.

parvadomus · May 31, 2014

NostaSeronx said:
This is about as far as I gotten in labeling all the parts in Steamroller: https://i.imgur.com/86WVawn.jpg

I made a few mistakes in the strokes but it is near perfected to a 1.x mm² degree.

Fair enough. I found a green/red die shot that seems to be steamroller, I think you saw it already, looks like it matches with AMD's official die shot.

Going back on topic, we can wait for June 4th for next FX CPU, it might be an "unlocked SR" or something more, and see how behind AMD really is (hope its not only mobile Kaveri).

Will AMD ever be able to compete with Intel again?

Lifer

Diamond Member

Diamond Member

No Lifer

Diamond Member

Platinum Member

Lifer

Diamond Member

Platinum Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Elite Member, Moderator Emeritus

Platinum Member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member