[Techpowerup] AMD "Zen" CPU Prototypes Tested, "Meet all Expectations"

Page 25 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Where do you think this will land performance wise

  • Intel i7 Haswell-E 8 CORE

  • Intel i7 Skylake

  • Intel i5 Skylake

  • Just another Bulldozer attempt


Results are only viewable after voting.

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
That's just not true. Trinity CPU part is 40-50% bigger than the Sandy Bridge 2C CPU part, and it also consumes more power for less performance.

Trinity has 2x the iGPU die area and graphics performance of the SB 2C. But the two CPU Modules are almost the size of the two SB CPU cores.

It is also faster in MT loads than 2C 4T SB and Ivy.

A10-5800K (Trinity) vs Core i3 3225 (IvyBridge)

xx0f5.jpg


mmv1n4.jpg
 
Last edited:

pepone1234

Member
Jun 20, 2014
36
8
81
We know from the published numbers that 14nm LPP is much better than Intel s 14nm at a 2.4GHz frequency, i posted the difference a few months ago, now thanks to a SA member whe have the infos that GF process is as good at 3GHz than at 2.4GHz, and hence much better than Intel s at 3GHz.

So the power figures you posted are relevant.

Globalfoundries better than someone at someting?

I find it difficult to believe it.
 

NTMBK

Lifer
Nov 14, 2011
10,520
6,035
136
We know from the published numbers that 14nm LPP is much better than Intel s 14nm at a 2.4GHz frequency, i posted the difference a few months ago, now thanks to a SA member whe have the infos that GF process is as good at 3GHz than at 2.4GHz, and hence much better than Intel s at 3GHz.

So the power figures you posted are relevant.

I'll believe it when we have a measurable chip in reviewer's hands. Until then, we don't "know" a damn thing.
 

Abwx

Lifer
Apr 2, 2011
11,997
4,954
136
I'll believe it when we have a measurable chip in reviewer's hands. Until then, we don't "know" a damn thing.

14nm LPP uplift over 28nm HPP has been published with the most usefull datas one can ask for, of course their meanings are accessible only for people who have a formal training in the matter.

I repost the numbers so you can make your own conclusions and eventualy contradict me....

GLOBAL FOUNDRIES FINFETS VS 28NMs

PPA RESULT : LVT


GF PROCESS 28 SLP 28 HPP 14 LPP



SPEED (SS.0P90V.WORST 125C/-40C) (SS.0P765V.WORST 125C/-40C) (SS.0P72V.WORST 125C/-40C)


FMAX GHZ 0.97 1.17 2.41

RELATIVE SPEED 1 1.2 2.48





POWER (FF.1P10V.125C) (FF.0P935V.125C) (FF.0P88V.125C)


TOTAL DYNAMIC 158 210 310
POWER (mW)


RELATIVE DYN 1 1.3 1.9
POWER

TOTAL LEAKAGE 70 119 18.6
POWER (mW)

RELATIVE LEAK 1 1.7 0.27
POWER


Globalfoundries better than someone at someting?

I find it difficult to believe it.

The process is from Samsung...
 
Last edited:

myocardia

Diamond Member
Jun 21, 2003
9,291
30
91
We know from the published numbers that 14nm LPP is much better than Intel s 14nm at a 2.4GHz frequency, i posted the difference a few months ago, now thanks to a SA member whe have the infos that GF process is as good at 3GHz than at 2.4GHz, and hence much better than Intel s at 3GHz.
lmfao, so now you're 'quoting' unnamed posters from the new version of the amdzone forums as references? Seriously?
 

Abwx

Lifer
Apr 2, 2011
11,997
4,954
136
Haven't you heard? Intel sucks, and everything Global Foundries (and by extension, AMD) touches turns to gold.

I guess that it s desperation from your part to board an ad hominem waggon..

The datas posted above are clear, but because you understand nothing about EE you re left being of bad faith.

I once posted comparisons with Intel s process, rather than asking questions you re getting in a defensive position, wich say that you re not interested in a tech debate but rather in downplaying whatever is related to AMD while promoting your beloved Intel in the same row.

Firms like Meryl Lynch pay engineers to compute thoses datas and make comparisons like the one i posted, you can always ask them to provide you their datas, or perhaps that you know better than me what it is about..?.

Here a link to help you in this homework, if you can master 10% of what is in this page you ll get to the same conclusions as i did.

https://en.wikipedia.org/wiki/MOSFET
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I skipped 10 pages since my seemingly last visit with my forum app. Did I miss any new information? ;)

Over at SA there are at least links to papers about SPEC, ISAs, etc. which are good for my game.

Intel certainly added a lot of cool features to the core, but why does SKL still have the same basic floorplan as Penryn? What are the tradeoffs here?
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
I skipped 10 pages since my seemingly last visit with my forum app. Did I miss any new information? ;)

Yes, you missed the amazing reveal about how awesome I am, then the thoroughly evil cover-up perpetuated by those-who-must-not-be-named.
 

myocardia

Diamond Member
Jun 21, 2003
9,291
30
91
edit: Never mind, I've changed my mind about this thread.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
23,180
13,266
136
That's just not true. Trinity CPU part is 40-50% bigger than the Sandy Bridge 2C CPU part, and it also consumes more power for less performance. The bloated L3 and HT links just make things worse on the server, but by no means they are the main problem.

A Vishera die shot:

http://www.guru3d.com/articles-pages/amd-fx-8350-processor-review,2.html

Take a look at how much the L3 takes up . . . then take a look at how large are the L2 cache blocks. AMD clearly had problems with cache density, and I do not think the problem has simply vanished either.

Now check out Kaveri (and Trinity, and LLano):

http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/4

It certainly appears in the Trinity and Kaveri shots that the ratio of die space taken up by the modules vs. the accompanying L2 blocks has changed, with the L2 taking up a more-modest amount of real estate relative to the modules themselves. So relative cache density improved a bit, maybe. But the other thing to keep in mind, is how much die space appears to be committed to the GPU. It certainly appears that AMD could have made a 4m SR chip sans L3 that would be smaller than the current Kaveri die. There are those in the know who insist such a beast could never exist within an reasonable power envelope, and they probably know exactly why . . .

That is Core/Module Throughput, not what people mean by IPC (Single Thread Performance) here.

I guess, the point I want to make, is that defining Instructions Per Clock when only running one thread of software is completely useless, especially when dealing with AMD's CMT modules. If throughput is what you want to call it, then fine. In the end, that is the only number that really matters, so long as there is software that can scale to the full thread capacity of the chip.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
It certainly appears in the Trinity and Kaveri shots that the ratio of die space taken up by the modules vs. the accompanying L2 blocks has changed, with the L2 taking up a more-modest amount of real estate relative to the modules themselves. So relative cache density improved a bit, maybe.

AMD has pursuing a balance between iGPU and iGPU parts since Llano, with the CPU and GPU sharing roughly 50% of the die. That means that Trinity CPU part, roughly 120mm^2. can't keep up with Sandy Bridge CPU part, roughly 80mm^2. Sure, cache density was a problem, mostly caused by GLF Gate First approach, but I don't think you could cut down 30% of the die to reach parity with Intel parts in terms of die size by cutting cache alone.

But let's assume that they could cut down 40% of the CPU part by improving cache alone, would that solve AMD's problems? Not likely, as there would be a power consumption gap and also a performance gap. Sure, it might allow AMD to sell the CMT chips at extremely bad margins, an improvement over the utterly ruinous margins of today, but it wouldn't make CMT chips a viable product on the market.

In the end, I think you should look a CPU through a multiple performance variables prism and not through a monochromatic performance prism like you are doing now. It's not just about performance and AMD problem is not performance alone, the problem is what they have to sacrifice in order to reach these performance levels.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
It has been posted before but it seams certain people here like to FUD,

One Bulldozer Module at 32nm SOI Gate First is 19,42mm2 (without the 2MB L2)
One Sandybridge Core at 32nm Gate last is 18,4 mm2 (Including 256KB of the L2)

Llano_vs_SandyBridge_vs_Westmere.jpg


2dkmmc7.jpg


304840d1295356113-die-flaeche-der-8-kern-bulldozer-abgeschaetzt-amd_bulldozer.jpg
 

MrTeal

Diamond Member
Dec 7, 2003
3,919
2,708
136
A Vishera die shot:

http://www.guru3d.com/articles-pages/amd-fx-8350-processor-review,2.html

Take a look at how much the L3 takes up . . . then take a look at how large are the L2 cache blocks. AMD clearly had problems with cache density, and I do not think the problem has simply vanished either.

Now check out Kaveri (and Trinity, and LLano):

http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/4

It certainly appears in the Trinity and Kaveri shots that the ratio of die space taken up by the modules vs. the accompanying L2 blocks has changed, with the L2 taking up a more-modest amount of real estate relative to the modules themselves. So relative cache density improved a bit, maybe. But the other thing to keep in mind, is how much die space appears to be committed to the GPU. It certainly appears that AMD could have made a 4m SR chip sans L3 that would be smaller than the current Kaveri die. There are those in the know who insist such a beast could never exist within an reasonable power envelope, and they probably know exactly why . . .



I guess, the point I want to make, is that defining Instructions Per Clock when only running one thread of software is completely useless, especially when dealing with AMD's CMT modules. If throughput is what you want to call it, then fine. In the end, that is the only number that really matters, so long as there is software that can scale to the full thread capacity of the chip.

Unfortunately, there is still a large amount of software out there that is single threaded, and so measuring the IPC of a single thread is an important metric. If a single XV module can run some hypothetical integer benchmark and score 50 points single-threaded and 100 points multi-threaded, while an Intel core scores 75 points ST and 100 points MT, the latter would be a more versatile processor even if the throughput is the same in that benchmark.

I really wish that a Zen core with SMT will perform 40% better per clock than an XV module, but I just don't see that being what AMD is claiming.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Unfortunately, there is still a large amount of software out there that is single threaded, and so measuring the IPC of a single thread is an important metric. If a single XV module can run some hypothetical integer benchmark and score 50 points single-threaded and 100 points multi-threaded, while an Intel core scores 75 points ST and 100 points MT, the latter would be a more versatile processor even if the throughput is the same in that benchmark.

I really wish that a Zen core with SMT will perform 40% better per clock than an XV module, but I just don't see that being what AMD is claiming.

Yes for that benchmark, for the Data center, the cloud and for VRMs you better have 2x 50 than one 75 and the second at 25.

Also to point out here that if we want to be correct, IPC is also throughput. What we are looking for is Single Thread performance or if we are talking about the architecture better to use CPI (Cycles Per Instruction).
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Yes for that benchmark, for the Data center, the cloud and for VRMs you better have 2x 50 than one 75 and the second at 25.

Also to point out here that if we want to be correct, IPC is also throughput. What we are looking for is Single Thread performance or if we are talking about the architecture better to use CPI (Cycles Per Instruction).

Because it obviously would be impossible move threads between cores.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Because it obviously would be impossible move threads between cores.

Its not only that, having a less wide Core is easier to fully utilize all the resources and take 100% of the performance the core can give you in more situations than having a very wide high Single thread Core architecture.

So, if your workload has many concurrent "easy" jobs to execute, it is better to have more smaller cores than less bigger cores.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Its not only that, having a less wide Core is easier to fully utilize all the resources and take 100% of the performance the core can give you in more situations than having a very wide high Single thread Core architecture.

So, if your workload has many concurrent "easy" jobs to execute, it is better to have more smaller cores than less bigger cores.

Of course its easier to utilize a smaller core but that is where HT comes in. If a single thread is latency limited (or similarly limited) the second thread gets a proportionately bigger boost (something like 7-zip).

The larger the core the worse the perf/execution resources becomes. That is why HT tends to be used.

A perfect example would be Haswell vs. Power8. Power8 has vastly more execution resources but similar (slightly lower) ST performance as resources simply can't be used (aside: power8 is also designed for throughput and not latency which is why ST takes a hit). With HT though, Power8 easily outperforms Haswell physical core for core.

The point is that 75 + 25 is better than 50 + 50. You have better ST performance and basically the same MT performance. Without locking threads to cores each thread on the 75 + 25 core will have a performance around 50.
 

MrTeal

Diamond Member
Dec 7, 2003
3,919
2,708
136
Yes for that benchmark, for the Data center, the cloud and for VRMs you better have 2x 50 than one 75 and the second at 25.

Also to point out here that if we want to be correct, IPC is also throughput. What we are looking for is Single Thread performance or if we are talking about the architecture better to use CPI (Cycles Per Instruction).

Its not only that, having a less wide Core is easier to fully utilize all the resources and take 100% of the performance the core can give you in more situations than having a very wide high Single thread Core architecture.

So, if your workload has many concurrent "easy" jobs to execute, it is better to have more smaller cores than less bigger cores.

Yes it would be easier to utilize all the resources of that one core, but the challenge is utilizing the resources of eight of those cores on a chip if you don't have concurrent easy jobs to execute. This may be less of a issue in cloud servers or HPC, but in typical desktop workloads a processor that executes well threaded code in the same amount of time as a competitor but is 50% faster in single threaded code is (IMO) a more versatile processor.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Yes it would be easier to utilize all the resources of that one core, but the challenge is utilizing the resources of eight of those cores on a chip if you don't have concurrent easy jobs to execute. This may be less of a issue in cloud servers or HPC, but in typical desktop workloads a processor that executes well threaded code in the same amount of time as a competitor but is 50% faster in single threaded code is (IMO) a more versatile processor.

And here lies the problem for the Bulldozer and CMT, it was not efficient for the 2010 Desktop applications/workloads and still its not. The CMT is perfect for the server space but not for the Desktop. But unfortunately AMD couldn't use the Bulldozer only for the Server and create another architecture for the consumer segment.
Personally this is the only problem i found on the Bulldozer, the decision AMD made to go all about Servers first when they were behind in manufacturing nodes. The CMT was great for the job it was designed, it was not what the Desktop workload was at the time and still is not there yet. Perhaps in 2016-17 but not in 2011.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Of course its easier to utilize a smaller core but that is where HT comes in. If a single thread is latency limited (or similarly limited) the second thread gets a proportionately bigger boost (something like 7-zip).

The larger the core the worse the perf/execution resources becomes. That is why HT tends to be used.

A perfect example would be Haswell vs. Power8. Power8 has vastly more execution resources but similar (slightly lower) ST performance as resources simply can't be used (aside: power8 is also designed for throughput and not latency which is why ST takes a hit). With HT though, Power8 easily outperforms Haswell physical core for core.

The point is that 75 + 25 is better than 50 + 50. You have better ST performance and basically the same MT performance. Without locking threads to cores each thread on the 75 + 25 core will have a performance around 50.

SMT is good for multi-threading, its not as efficient as CMT for Multitasking.

When you have 2 concurrent loads that each need 50, having a single core capable of 75 and the second core capable for 25, you are less efficient than having two cores capable of 50 each. So although you have the same total throughput, because your two jobs are concurrent (multi-tasking) and not a single one split in two(multi-Threading) the CMT is the better architecture.

And this is the problem for the Bulldozer, in desktop we have Multi-Threading scenarios, in Server we have multi-Tasking scenarios.
 

NTMBK

Lifer
Nov 14, 2011
10,520
6,035
136
SMT is good for multi-threading, its not as efficient as CMT for Multitasking.

When you have 2 concurrent loads that each need 50, having a single core capable of 75 and the second core capable for 25, you are less efficient than having two cores capable of 50 each. So although you have the same total throughput, because your two jobs are concurrent (multi-tasking) and not a single one split in two(multi-Threading) the CMT is the better architecture.

And this is the problem for the Bulldozer, in desktop we have Multi-Threading scenarios, in Server we have multi-Tasking scenarios.

SMT schedules multiple threads simultaneously... the clue is in the name...
 

Abwx

Lifer
Apr 2, 2011
11,997
4,954
136
but in typical desktop workloads a processor that executes well threaded code in the same amount of time as a competitor but is 50% faster in single threaded code is (IMO) a more versatile processor.

But the i3/i5 for instance have good ST perf in isolation but disastrous ST perf in multitasking, as much as 60% lower ST perf.

If a CPU need 5s to execute task A and 5s to execute task B then processing the two task simultaneously is assumed as requiring 10s, that s completely wrong, in the case of the i3 it can be as long as 20-25s.

The exemple is Winrar MT + Cinebench ST, that s just 5 threads :

http://www.computerbase.de/2015-10/...gramm-multitasking-test-cinebench-plus-winrar

CB ST score of the i5-6600K is 165, launch Winrar simultaneously and CB ST score collapse to 77, with a i3 it s even worse, so the conclusion is clear, i3/i5 ST perf cant be sustained if there s two softs using 4 and 1 threads respectively.

So much for Intel s ST perf, it s completely rubbish in a multitasking environment.

You ll notice that the so called inferior ST perf of AMD doesnt collapse that much and there s twice the amount of cores, so the FX sustain its throughput and ST perf much better than those ultra hyped CPUs.

SMT schedules multiple threads simultaneously... the clue is in the name...

The clue is in the link above...

It schedule multiple threads but that s all what it does, once scheduled the threads land in a stalled pipeline.
 
Last edited: