Full Skylake reveal result? Waiting for Zen.

cmdrdredd · Aug 21, 2015

StrangerGuy said:
Remember when AMD had the gall to charge a higher asking price than a E6600 for their X2 6000+ when its practically worse in every single aspect except for raw clock speed? Reviewers were being overly kind for calling it "competitive"...Competitive my royal butt.

I do remember something like that. I also remember the day I sold my overclocked AMD x2 3800+ and built an e6400 system which I didn't think would happen. For quite some time Intel was just barely competitive with the P4 in my opinion.

witeken · Aug 21, 2015

Burpo said:
Knights Landing by year end..
http://www.theplatform.net/2015/07/13/momentum-building-for-knights-landing-xeon-phi/

" a whopping 72 cores across its over 8 billion transistors"
"and out of the Knights Landings cores, which have two vector math units each, and the near HBM memory will deliver more than 400 GB/sec of memory bandwidth."
"when implemented as a standalone processor, will come with two ports of Intels Omni-Path interconnect on the package"

With 72 cores enabled for a big 8 billion transistor design, it seems Intel is getting good at 14nm with triple patterning, improved FinFET and air gaps. Impressive. 14nm is here, a true 14nm, and no one else has it, except for a few phones on a vastly inferior process "incidentally" also called 14.

Edit: for comparison, Tesla K40 has 7.1B for 1.5 TFLOPS.

TheELF · Aug 21, 2015

AtenRa said:
No, i said that you will need more than 4 cores if you want to play games with thousands of units and people with quads may have trouble playing with maximum settings. As far as the 8-core FX, i said it could be close to Quad Core Core i7 at DX-12 games.

On the other hand you said that DX-12 will not bring anything more than DX-11 for the CPUs. Lets wait and see what will turn out to be next year with 16nm GPUs and onward.

And what we actually saw was a dualcore (i3) still being faster than fx-8xxx at max settings,no real need for quads,no fx being anywhere near any i7,so all in all no change from the dx11 games.

2is · Aug 21, 2015

cmdrdredd said:
I do remember something like that. I also remember the day I sold my overclocked AMD x2 3800+ and built an e6400 system which I didn't think would happen. For quite some time Intel was just barely competitive with the P4 in my opinion.

"quite some time" didn't last too long though. Willamette was complete trash with people opting to go with a Pentium 3 or Athlon Thunderbird instead. Northwood was a lot better and Northwood C regained Intel the performance crown once again over the Athlon XP line. Then Athlon 64 came out and took the performance crown back for most applications but things like Photoshop and encoding were still better on a Northwood C. Then the Athlon 64 X2 was introduced and was better at everything. P4 Prescott came and did not change the pecking order. The P4 dual core came after that and it was a little more competitive but overall the X2 was still a better CPU with FAR lower power consumption. Then came Conroe with a definitive advantage over AMD and it's been that way since.

Point being, than entire span of time where AMD > Intel lasted a couple years, but before the X2 and after Willamette, a time frame of 1+ years, Intel was still competitive even if they didn't have the performance crown.

It's been a LONG time since AMD has been as competitive as Northwood C was. In fact, the last time they were as competitive as Northwood C was, we need to go back to the K6 days.

Arachnotronic · Aug 21, 2015

witeken said:
With 72 cores enabled for a big 8 billion transistor design, it seems Intel is getting good at 14nm with triple patterning, improved FinFET and air gaps. Impressive. 14nm is here, a true 14nm, and no one else has it, except for a few phones on a vastly inferior process "incidentally" also called 14.

Edit: for comparison, Tesla K40 has 7.1B for 1.5 TFLOPS.

Or the ASPs on these suckers are so high and the proportion of Intel's shipments that they comprise is so low that less-than-stellar yields aren't a big deal.

Phynaz · Aug 21, 2015

Fjodor2001 said:
"writing style"? You mean as in source code writing? Or forum post writing?

How would it be possible to "confirm" your suspicion by that?

And BTW, does your son know all PMs in all Windows AAA game projects?

And regardless, are you really willing to go through all the trouble of contacting your son, and him in turn contacting all the PMs of that project, to what... ? Prove some suspicion you have about a poster on some Internet forum?

You should consider using the /joking tag anyway... or

Consider the number of AAA DX11 games. Consider how many development houses created those games. There aren't a lot.

Now consider of the staff members of those companies how many are engine code developers - specifically graphics pipeline developers (the only devs that would be touching DX11). That's a pretty small community. In communities that small people tend to get to know each other, at least by reputation.

Burpo · Aug 21, 2015

witeken said:
With 72 cores enabled for a big 8 billion transistor design, it seems Intel is getting good at 14nm with triple patterning, improved FinFET and air gaps. Impressive. 14nm is here, a true 14nm, and no one else has it, except for a few phones on a vastly inferior process "incidentally" also called 14.

Edit: for comparison, Tesla K40 has 7.1B for 1.5 TFLOPS.

KL=3+ TFLOPS

Dresdenboy · Aug 21, 2015

witeken said:
Impressive. 14nm is here, a true 14nm, and no one else has it, except for a few phones on a vastly inferior process "incidentally" also called 14

What value does your 14nm process trumping add to this thread?

sm625 · Aug 21, 2015

Burpo said:

That is impressive, it even comes with a camera.

IlllI · Aug 21, 2015

I'll probably wait as well. my 2500k is still going strong, but by then the itch to build something new will probably be too great. i just hope amd is competitive by then.. and still in business. always like to support the little guy/underdog when i can

looncraz · Aug 21, 2015

Enigmoid said:
The module design is 85-90% efficient when fully loaded. With only 1 core active, that core gets full access to the front end, its own integer execution units, and the FPU. Steamroller and excavator have improved this to some extent with each SR core getting its own 4-wide decoder (way more than it needs). ST efficiency will not improve in single thread workloads (per module) by removing CMT. If you took off the other integer core, L1 instruction cache, and decoder you would see the same performance as running 1 thread in a module (your core is no longer CMT).

That is 2 core performance, NOT single thread IPC.

http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper

Nice explanation.

Click to expand...

I didn't see any explanation in that link

In any event, the module overhead is always there. The OS is always scheduling something (I've seen the code, no guesswork ). The instructions are to halt the core, engage in context and ring mode switches, run the OS scheduler or kernel tasks, handle interrupts, and so on. Only "parked" cores get any reprieve from this background onslaught.

For the module, this means there is always something being addressed to each core, and there are more pipeline stages to deal with this - and that added latency is always present. You can't get away from it no matter what, even if a core is parked and not receiving anything other than C6 state commands (which keeps the core off, but ready).

This overhead is more meaningful than you might think. If one core is in a c-state, though, the overhead is only around 3%, IIRC. At best-idle, it is closer to 5~6%, but then there's thread scheduling from the OS to consider, which moves the overhead right up to about 15% on the original Bulldozer - though I think the nominal front-end overhead was somewhere around 10% in this scenario, which is pretty low when you really think about it what it takes to make that happen.

That means you should be able set thread affinity and get about an 8 or 9% improvement with an otherwise unladen system, by utilizing only one thread per module. However, this overhead is less for SIMD-heavy work loads, as the front-end is less stressed, and the burden shifts more to the caches, execution units, and schedulers.

Of course, the more you have going on for that second thread, the worse the overhead gets as not only the front-end is slowing things down now, but you also have the shared caches and dispatch controller to consider at that point (mostly the slow-arse L2, even with the help of the WCC).

Of course, not all of these come together to hurt performance at the same time, but when they do, that's around a 20% or so drop in IPC on Bulldozer due to the module, with about half of that being a full-time expense.

Please feel free to refute any of my math, this is all from the memory of core documentation I read even before the first FX-8150 entered production.

--

EDIT:

I got to thinking about corroborating evidence for my post and did a quick search for some benchmarks to show the scaling costs:

http://techreport.com/review/21865/a-quick-look-at-bulldozer-thread-scheduling/2

Here, they ran two threads using affinity masking.

0x55 is 10101010, so one core per module is scheduled for the task. The results seem to be pretty much in line with my above statements (seems my memory isn't as bad as I thought :-D).

Of course, we can't tell the IPC that is uniformly lost at all.

looncraz · Aug 21, 2015

2is said:
It's been a LONG time since AMD has been as competitive as Northwood C was. In fact, the last time they were as competitive as Northwood C was, we need to go back to the K6 days.

I think the last time they were that competitive was with the Phenom II when Intel was still with the Core 2 Quads.

In almost every that is a closer comparison, IMHO, but we do have to ignore power efficiency at that point.

I had a C2Q 9550 and a Phenom II X4 955. The phenom II beat (barely) the C2Q at stock, and when both were overclocked, they were about even. But, then, I was one of those rare ones that managed to get 4GHz out of the Phenom II...

VirtualLarry · Aug 21, 2015

looncraz said:
I think the last time they were that competitive was with the Phenom II when Intel was still with the Core 2 Quads.

In almost every that is a closer comparison, IMHO, but we do have to ignore power efficiency at that point.

I had a C2Q 9550 and a Phenom II X4 955. The phenom II beat (barely) the C2Q at stock, and when both were overclocked, they were about even. But, then, I was one of those rare ones that managed to get 4GHz out of the Phenom II...

That's a good point. AMD was competitive then, and they can be again. I have high hopes for Zen. Even if it only has SB IPC, I think it can still be a winner, with MOAR CORES. And if it has Haswell IPC, well, I could count that as some kind of Miracle, and also a slam dunk. I think that they would sell like hotcakes. As long as they were clocked high enough to be competitive in absolute terms.

jhu · Aug 21, 2015

Burpo said:
KL=3+ TFLOPS

And it can run general purpose code. That's huge. No need to use OpenCL if you don't want to.

Idontcare · Aug 21, 2015

Arachnotronic said:
Or the ASPs on these suckers are so high and the proportion of Intel's shipments that they comprise is so low that less-than-stellar yields aren't a big deal.

There is a lot of truth to this. When we made UltraSparc chips for SUN the yields were so low we didn't even bother talking about yields in terms of percentages. We talked about yields in terms of the average number of "net units built" or "NUBs" per wafer.

They would accept the process as being "production ready" if we got the yields up to 2 NUBs per wafer (the margins were that high). And SUN was willing to consider the process as "high yielding" if we could hit an average of 5 NUBs per wafer say a year after it went into HVM.

Imagine our surprise then when one day we managed to get the yields so high that we entered into "let's convert this into a percentage yield conversation and stop talking about NUBs", the yields were around 20% at the time. We thought SUN would be elated, but their response was "that's great, now reduce the wafer starts and make less wafers because we simply don't need all the chips your now higher-yielding process is delivering ex-fab".

Yeah, shot ourselves in the foot there, invested lots of time and engineering money to make the process higher yielding only to then have the wafer orders decrease (less net revenue added to the corporate bank account).

Anyways, this anecdotal story is just to support the notion that absurdly big but low-yielding dies are the gateway to massive revenues for a fab provided whoever is selling them (be it an internal or external customer) is able to sell them at 4-figure ASPs.

It doesn't take too many $3k chips from a wafer to pay for the expense of the entire wafer, even at 14nm wafer costs.

Arachnotronic · Aug 21, 2015

Idontcare said:
There is a lot of truth to this. When we made UltraSparc chips for SUN the yields were so low we didn't even bother talking about yields in terms of percentages. We talked about yields in terms of the average number of "net units built" or "NUBs" per wafer.

They would accept the process as being "production ready" if we got the yields up to 2 NUBs per wafer (the margins were that high). And SUN was willing to consider the process as "high yielding" if we could hit an average of 5 NUBs per wafer say a year after it went into HVM.

Imagine our surprise then when one day we managed to get the yields so high that we entered into "let's convert this into a percentage yield conversation and stop talking about NUBs", the yields were around 20% at the time. We thought SUN would be elated, but their response was "that's great, now reduce the wafer starts and make less wafers because we simply don't need all the chips your now higher-yielding process is delivering ex-fab".

Yeah, shot ourselves in the foot there, invested lots of time and engineering money to make the process higher yielding only to then have the wafer orders decrease (less net revenue added to the corporate bank account).

Anyways, this anecdotal story is just to support the notion that absurdly big but low-yielding dies are the gateway to massive revenues for a fab provided whoever is selling them (be it an internal or external customer) is able to sell them at 4-figure ASPs.

It doesn't take too many $3k chips from a wafer to pay for the expense of the entire wafer, even at 14nm wafer costs.

:biggrin:

StrangerGuy · Aug 21, 2015

2is said:
"quite some time" didn't last too long though. Willamette was complete trash with people opting to go with a Pentium 3 or Athlon Thunderbird instead. Northwood was a lot better and Northwood C regained Intel the performance crown once again over the Athlon XP line. Then Athlon 64 came out and took the performance crown back for most applications but things like Photoshop and encoding were still better on a Northwood C. Then the Athlon 64 X2 was introduced and was better at everything. P4 Prescott came and did not change the pecking order. The P4 dual core came after that and it was a little more competitive but overall the X2 was still a better CPU with FAR lower power consumption. Then came Conroe with a definitive advantage over AMD and it's been that way since.

Point being, than entire span of time where AMD > Intel lasted a couple years, but before the X2 and after Willamette, a time frame of 1+ years, Intel was still competitive even if they didn't have the performance crown.

It's been a LONG time since AMD has been as competitive as Northwood C was. In fact, the last time they were as competitive as Northwood C was, we need to go back to the K6 days.

K6 was utter trash in games even with a Geforce 2 Ti, particularly in UT99 compared to a stock clocked Celeron-As. Not to mention general stability where Intel wins by a mile.

Abwx · Aug 21, 2015

StrangerGuy said:
K6 was utter trash in games even with a Geforce 2 Ti, particularly in UT99 compared to a stock clocked Celeron-As. Not to mention general stability where Intel wins by a mile.

When this card was released the Athlon was about to reach 1GHz, dont know how you associated it with the K6, i suppose that it s the K6-2 because the K6-3 was up to the Pentium 3 in integer code.

StrangerGuy · Aug 21, 2015

Abwx said:
When this card was released the Athlon was about to reach 1GHz, dont know how you associated it with the K6, i suppose that it s the K6-2 because the K6-3 was up to the Pentium 3 in integer code.

Yeah, because you are already CPU limited in 1998, downgrading a 2002 GPU to a 1998 one is going to matter very much.

And who cares about K6-3's integer performance when the true bottlenecks are the crappy FPU and the terrible Super7 chipsets.

You might wanna check out the AA, not the automotive one, because you always seem to be drunk in this forums.

Enigmoid · Aug 21, 2015

looncraz said:
I didn't see any explanation in that link

In any event, the module overhead is always there. The OS is always scheduling something (I've seen the code, no guesswork ). The instructions are to halt the core, engage in context and ring mode switches, run the OS scheduler or kernel tasks, handle interrupts, and so on. Only "parked" cores get any reprieve from this background onslaught.

For the module, this means there is always something being addressed to each core, and there are more pipeline stages to deal with this - and that added latency is always present. You can't get away from it no matter what, even if a core is parked and not receiving anything other than C6 state commands (which keeps the core off, but ready).

This overhead is more meaningful than you might think. If one core is in a c-state, though, the overhead is only around 3%, IIRC. At best-idle, it is closer to 5~6%, but then there's thread scheduling from the OS to consider, which moves the overhead right up to about 15% on the original Bulldozer - though I think the nominal front-end overhead was somewhere around 10% in this scenario, which is pretty low when you really think about it what it takes to make that happen.

That means you should be able set thread affinity and get about an 8 or 9% improvement with an otherwise unladen system, by utilizing only one thread per module. However, this overhead is less for SIMD-heavy work loads, as the front-end is less stressed, and the burden shifts more to the caches, execution units, and schedulers.

Of course, the more you have going on for that second thread, the worse the overhead gets as not only the front-end is slowing things down now, but you also have the shared caches and dispatch controller to consider at that point (mostly the slow-arse L2, even with the help of the WCC).

Of course, not all of these come together to hurt performance at the same time, but when they do, that's around a 20% or so drop in IPC on Bulldozer due to the module, with about half of that being a full-time expense.

Please feel free to refute any of my math, this is all from the memory of core documentation I read even before the first FX-8150 entered production.

Windows scheduler performance has been fixed greatly in windows 8 and 10. Windows 7 likes to bounce threads around which causes minor performance losses of cache needs to be flushed and rewritten. This behaviour can be seen on Intel CPUs as well (though intel does a better job of masking this problem).

You are referring to stuff like this.

Also called the module Penalty. I'm not talking about the module penalty. I'm talking about single thread performance, namely the IPC of the individual core (not module).

The performance from loading of one thread per module would NOT increase by any appreciable amount should AMD strip out the CMT resources from the core. The performance of the remaining thread (shared and individual portion of the front end, execution units, and FPU) would be the same pre and post CMT. CMT just adds poorly scaling (compared to real cores) extra thread.

EDIT:

I got to thinking about corroborating evidence for my post and did a quick search for some benchmarks to show the scaling costs:

http://techreport.com/review/21865/a-quick-look-at-bulldozer-thread-scheduling/2

Here, they ran two threads using affinity masking.

0x55 is 10101010, so one core per module is scheduled for the task. The results seem to be pretty much in line with my above statements (seems my memory isn't as bad as I thought :-D).

Of course, we can't tell the IPC that is uniformly lost at all.

This is the module penalty when running

Two threads per module vs. 1 Thread per module.

Obviously you lose the module penalty. But this has nothing to do with singlethread IPC.

Post #19 from AtenRa

Measuring IPC per module is wrong, simple because Bulldozer module has 80% the performance of two cores.

It is like measuring Core i7 2600K with one core + HT instead of measuring two cores without HT.

IPC must be measured in single core, then we measure per module (HT in Intel CPUs) to measure the performance gains of CMT/HT OVER single core.

What OPs results shows is the performance gains from the CMT architecture and not that IPC increases with CMT disabled.

People saying that IPC is lower than Deneb/Thuban are wrong if they measure the 4M/8C Bulldozer to 6C Thuban performance. Measure Single core in both and then draw your IPC conclusion

http://forums.anandtech.com/showthread.php?t=2198404&page=2

Abwx · Aug 21, 2015

StrangerGuy said:
And who cares about K6-3's integer performance when the true bottlenecks are the crappy FPU and the terrible Super7 chipsets.

3Dnow and the FP perf was no more a problem...in games..

Short memory..?..

jhu · Aug 21, 2015

Abwx said:
3Dnow and the FP perf was no more a problem...in games..

Short memory..?..

There were barely any games that used 3Dnow (I don't recall any). The problem with the K6 FPU was that the shortest instruction took 2 cycles (FADD, FMUL, etc., even the 3Dnow ones) whereas P6 was effectively 1 cycle because it was pipelined. Games back then were way more dependant on yhe FPU than they are now.

jhu · Aug 21, 2015

Idontcare said:
There is a lot of truth to this. When we made UltraSparc chips for SUN the yields were so low we didn't even bother talking about yields in terms of percentages. We talked about yields in terms of the average number of "net units built" or "NUBs" per wafer.

They would accept the process as being "production ready" if we got the yields up to 2 NUBs per wafer (the margins were that high). And SUN was willing to consider the process as "high yielding" if we could hit an average of 5 NUBs per wafer say a year after it went into HVM.

Imagine our surprise then when one day we managed to get the yields so high that we entered into "let's convert this into a percentage yield conversation and stop talking about NUBs", the yields were around 20% at the time. We thought SUN would be elated, but their response was "that's great, now reduce the wafer starts and make less wafers because we simply don't need all the chips your now higher-yielding process is delivering ex-fab".

Yeah, shot ourselves in the foot there, invested lots of time and engineering money to make the process higher yielding only to then have the wafer orders decrease (less net revenue added to the corporate bank account).

Anyways, this anecdotal story is just to support the notion that absurdly big but low-yielding dies are the gateway to massive revenues for a fab provided whoever is selling them (be it an internal or external customer) is able to sell them at 4-figure ASPs.

It doesn't take too many $3k chips from a wafer to pay for the expense of the entire wafer, even at 14nm wafer costs.

What's considered a good yield? Is 20% considered good, or is it normally considered bad but was good in this case?

nismotigerwvu · Aug 21, 2015

Abwx said:
3Dnow and the FP perf was no more a problem...in games..

Short memory..?..

Very true, 3Dnow brought the gaming performance of the K6-2 to basically a dead heat with the Pentium III clock for clock.

http://www.anandtech.com/show/160/10

Technically it isn't the most fair comparison in terms of the underlying architectures by giving the K6 SIMD support while the the Pentium III was running X87 code, but SSE was a year later to market. While Nvidia never really took advantage of 3dnow, 3DFX sure did and a Voodoo II equipped K6-2 was a very competent machine.

VirtualLarry · Aug 21, 2015

nismotigerwvu said:
3DFX sure did and a Voodoo II equipped K6-2 was a very competent machine.

I can attest to that. A friend of mine had a rig like that.

Full Skylake reveal result? Waiting for Zen.

Lifer

Diamond Member

Diamond Member

Diamond Member

Lifer

Lifer

Diamond Member

Golden Member

Diamond Member

Diamond Member

Senior member

Senior member

No Lifer

Lifer

Elite Member

Lifer

Diamond Member

Lifer

Diamond Member

Platinum Member

Lifer

Lifer

Lifer

Golden Member

No Lifer