ZEN ES Benchmark from french hardware Magazine

bjt2 · Dec 27, 2016

Dresdenboy said:
I saw that, too (having the print edition). I think, those 8 uops (actually "instr") per cycle are wrong and should be 6 as shown at Hot Chips. There should also be 32B/cycle going from L2$ to L1I$.

Actually it was not specified: https://youtu.be/Ln9WKPEHm4w?t=1h16m15s , so they may be right.

The 32b/cycle link is present on the scanned image, on the right:

bjt2 · Dec 27, 2016

Anyway i am very pleased with the uop rom pointers in the uop cache, as answered to a question in the q&a: https://youtu.be/Ln9WKPEHm4w?t=1h29m50s

Dresdenboy · Dec 27, 2016

bjt2 said:
Actually it was not specified: https://youtu.be/Ln9WKPEHm4w?t=1h16m15s , so they may be right.

The 32b/cycle link is present on the scanned image, on the right:

Just nitpicking here, as "b" means "bit".

bjt2 · Dec 27, 2016

Dresdenboy said:
Just nitpicking here, as "b" means "bit".

Yes...

Obviously is Byte...

DrMrLordX · Dec 27, 2016

itsmydamnation said:
you know you could team up with Juanrga ........

Please no! If he does that, we are all finished. FINISHED!

ksec said:
AMD decide to attack the higher end simply because that is the only way to enter the market.

The high-end has better margins and better potential for growth. It also segues nicely into the workstation/server market which is another area with growth/revenue potential far in excess of AMD's recent stomping grounds. So for AMD, it was go big or go home.

qookap · Dec 27, 2016

Pokemon Processor

A processor without gaming...

DrMrLordX · Dec 27, 2016

No love for Geodude.

HexiumVII · Dec 27, 2016

we really need the good old days where amd destroyed intels p4. Intel survived fine, while AMD had the performance crown. anything less we might not see amd for much longer.

railroadmaster · Dec 28, 2016

HexiumVII said:
we really need the good old days where amd destroyed intels p4. Intel survived fine, while AMD had the performance crown. anything less we might not see amd for much longer.

Nah not gonna happen. Single threaded processor performance isn't going to increase substantially in the near term. Substantially higher single threaded performance would require greater heat, greater cost, larger die size, lower production yields and higher power consumption all of which the market conditions doesn't allow for.

Until we get transistor stacking, new materials, a better means of moving electricity between transistors or other technological innovations performance is going to stagnate for the near term. You can't crank out frequency increases and ipc increases year after forever, it's just not possible. Transistor shrinking is nearing the end due to the size of transistors being to small to manage due to near atomic size and due to increases in heat and electrical leakages.

Until Moore's law can be resumed with new technology neither AMD or Intel will produce substantially faster processors in the near term. Essentially AMD and Intel will compete with who has the most processor cores. AMD produces a 8 core processor, Intel responds with a 10 core processor, then AMD responds with a 12 core and so on.

William Gaatjes · Dec 28, 2016

krumme said:
Hmm i need this cache explained.

2. And why can you just increase associativity without increasing latency? I mean then why not go 1MB 32way? Bwe 256 is 8 way and skl 256 is 4 way...

Interesting question :
https://en.wikipedia.org/wiki/CPU_cache#Associativity
According to wikipedia :

The replacement policy decides where in the cache a copy of a particular entry of main memory will go. If the replacement policy is free to choose any entry in the cache to hold the copy, the cache is called fully associative. At the other extreme, if each entry in main memory can go in just one place in the cache, the cache is direct mapped. Many caches implement a compromise in which each entry in main memory can go to any one of N places in the cache, and are described as N-way set associative.[9] For example, the level-1 data cache in an AMD Athlon is two-way set associative, which means that any particular location in main memory can be cached in either of two locations in the level-1 data cache.

Choosing the right value of associativity involves a trade-off. If there are ten places to which the replacement policy could have mapped a memory location, then to check if that location is in the cache, ten cache entries must be searched. Checking more places takes more power and chip area, and potentially more time. On the other hand, caches with more associativity suffer fewer misses (see conflict misses, below), so that the CPU wastes less time reading from the slow main memory. The general guideline is that doubling the associativity, from direct mapped to two-way, or from two-way to four-way, has about the same effect on raising the hit rate as doubling the cache size. However, increasing associativity more than four does not improve hit rate as much,[10] and are generally done for other reasons (see virtual aliasing, below). Some CPUs can dynamically reduce the associativity of their caches in low-power states, which acts as a power-saving measure.[11]

If i understand it correctly, increasing associativity, takes more time and consumes more power. That is kind of a contradiction.
I guess with more associativity, for example 32 way, you would need to check 32 cache blocks.
I assume this would be done in parallel , all 32 blocks at once are read and compared. That would maybe not increase latency in a big way, but i have to be honest no idea.

William Gaatjes · Dec 28, 2016

railroadmaster said:
Nah not gonna happen. Single threaded processor performance isn't going to increase substantially in the near term. Substantially higher single threaded performance would require greater heat, greater cost, larger die size, lower production yields and higher power consumption all of which the market conditions doesn't allow for.

Until we get transistor stacking, new materials, a better means of moving electricity between transistors or other technological innovations performance is going to stagnate for the near term. You can't crank out frequency increases and ipc increases year after forever, it's just not possible. Transistor shrinking is nearing the end due to the size of transistors being to small to manage due to near atomic size and due to increases in heat and electrical leakages.

Until Moore's law can be resumed with new technology neither AMD or Intel will produce substantially faster processors in the near term. Essentially AMD and Intel will compete with who has the most processor cores. AMD produces a 8 core processor, Intel responds with a 10 core processor, then AMD responds with a 12 core and so on.

Yeah, it will be interesting to see if gallium nitride can take the place of bulk silicon.
There are already gallium nitride mosfets and some integrated circuits like drivers.
But for as far as i know there are no gallium nitride microcontrollers or processors. Maybe in design labs ?
Everybody is putting money on smaller silicon geometries.
I wonder what the smallest possible process is that can be used with gallium nitride and what the gate voltages are and transconductance at a scale for example 14nm.
Maybe gallium nitride is only useful for power components like for an smps and not really suited for small scale logic like a cpu.
But gallium nitride (discrete power)mosfets have higher efficiency compared to silicon (discrete power) mosfets.
Also, to build a chip, there are (CAD) libraries that are used to create the masks. These masks are used to transfer the geometry on the mask onto the silicon chip. Building a chip layer by layer. I wonder if the libraries for gallium nitride are still in development.
Also the use of gallium nitride is patented, so that is another cost factor.

Much to wonder about :

krumme · Dec 28, 2016

William Gaatjes said:
Interesting question :
https://en.wikipedia.org/wiki/CPU_cache#Associativity
According to wikipedia :

If i understand it correctly, increasing associativity, takes more time and consumes more power. That is kind of a contradiction.
I guess with more associativity, for example 32 way, you would need to check 32 cache blocks.
I assume this would be done in parallel , all 32 blocks at once are read and compared. That would maybe not increase latency in a big way, but i have to be honest no idea.

Thanx. You want to
1. Lower miss rate
2. Lower latency

Associativity lowers miss rate as i understand thereby improving perf. But at the same time i simply have a problem accepting it doesnt lead to more latency.

Take eg a direct mapped cache. Speculation here is much easier leading to a faster read. On the opposite side a 8 way implementation must be more difficult in that regard? Or what?

StrangerGuy · Dec 28, 2016

frozentundra123456 said:
Yea, lost in all this enthusiasm, is the fact that 8 core Zen with no IGP is only going to compete in a very small portion of the consumer market. So I dont really see much threat to the mainstream cpu market until Zen APUs come out in around a year or so. Like I said earlier in this thread, the major threat right now will be to the 6900k, which is clearly overpriced, and maybe to the 6700k and 6800k, depending on pricing and final performance, including overclocking.

Meh, both camps can price their HEDT to the moon for all I care. Mainstream Zen has to go really low to get me remotely interested, something like a 6600K/6700K equivalent for ~$100/200, and I'm saying that even if I'm still using my 6 year old OC 2500K. This is not remotely close to a C2D moment where 3GHz Conroe slaughtered everything prior in every single app back in 2006.

Tuna-Fish · Dec 28, 2016

krumme said:
Thanx. You want to
1. Lower miss rate
2. Lower latency

Associativity lowers miss rate as i understand thereby improving perf. But at the same time i simply have a problem accepting it doesnt lead to more latency.

Take eg a direct mapped cache. Speculation here is much easier leading to a faster read. On the opposite side a 8 way implementation must be more difficult in that regard? Or what?

In a set associative cache, the request is typically sent to all sets in parallel. The only latency added is a single mux which picks the correct result in the end. The cost is power, as twice as many transistors switch. If the cache size is kept as the same, increasing associativity generally does not have a latency cost. However, if the cache size is doubled by doubling the associativity, you get a latency cost because the cache takes more room on die, and this means that the distance to the farthest corner is increased, which increases wire delay.

krumme · Dec 28, 2016

Tuna-Fish said:
In a set associative cache, the request is typically sent to all sets in parallel. The only latency added is a single mux which picks the correct result in the end. The cost is power, as twice as many transistors switch. If the cache size is kept as the same, increasing associativity generally does not have a latency cost. However, if the cache size is doubled by doubling the associativity, you get a latency cost because the cache takes more room on die, and this means that the distance to the farthest corner is increased, which increases wire delay.

Ok i thought so.
But as zen have double l2 size as skl and double associativity but at same latency doesnt that then look better on paper?
And if - does that typically come at a cost of freq headroom then? (Obviously it seems very process dependant - and we are back at FO4?)

KTE · Dec 28, 2016

bjt2 said:
uop cache: 2k uops. AFAIK INTEL have 1.5K uops and in Zen anyway for microcoded instructions there should be only the pointer, unlike INTEL.

No, Intel has the the same pointers for microcoded instructions since Sandy Bridge but yes, 1.5K, as in 32*6*8, as each line is 6 uops. 6 uops/cycle from L0->L0 Queue too.

And it's 8-way, able to deliver 4 uops/cycle with a cache hit on a complete hit (which can be 32B) foregoing the traditional front-end.
So if Zen has 2K L0 then it would make sense to be 8 uops per line, 8 way associative and each delivery at 8 uops/cycle.

uop cache throughput: 8uops/cycle. This was not specified in hot chips slides (I checked). The diagram is simplified (lacks ucode rom and stack memfile). INTEL has 1.5K uops and 6uops/cycle throughput.

I think there's a possibility of them being right without knowing so, because it's possible they were confused with uops retire, which is 8 uops/cycle. For instance, Hiroshige-sensei also has it at 6 uops/cycle.

That would make sense when queue->dispatch is 6 uops/cycle.

bjt2 · Dec 28, 2016

krumme said:
Thanx. You want to
1. Lower miss rate
2. Lower latency

Associativity lowers miss rate as i understand thereby improving perf. But at the same time i simply have a problem accepting it doesnt lead to more latency.

Take eg a direct mapped cache. Speculation here is much easier leading to a faster read. On the opposite side a 8 way implementation must be more difficult in that regard? Or what?

With 4 way associativity, you have four comparators, whose result goes in a demuxer 4 way to 1, complexity "2" (2 bits). An 8 way associativity has double the comparators, so double the power, but has an 8 to 1 demuxer, that has complexity "3" (3 bits), or 1.5 times the 4 way... The FO4 increase, but not so much. Indeed skylake 8 way has only 1 cicle more than haswell 4 way. Probabily in preparation of 512-1024KB cache for skylake x.

bjt2 · Dec 28, 2016

KTE said:
No, Intel has the the same pointers for microcoded instructions since Sandy Bridge but yes, 1.5K, as in 32*6*8, as each line is 6 uops. 6 uops/cycle from L0->L0 Queue too.

Ok, good to know. I find it strange though, since the decoding capability is 4-1-1-1(-1), so it seems as if the microcoded uops were generated by the decoder...

KTE said:
And it's 8-way, able to deliver 4 uops/cycle with a cache hit on a complete hit (which can be 32B) foregoing the traditional front-end.
So if Zen has 2K L0 then it would make sense to be 8 uops per line, 8 way associative and each delivery at 8 uops/cycle.

So the line is 6 uops but it can deliver only 4 uops cycle? It seems strange to me...

KTE said:
I think there's a possibility of them being right without knowing so, because it's possible they were confused with uops retire, which is 8 uops/cycle. For instance, Hiroshige-sensei also has it at 6 uops/cycle.

That would make sense when queue->dispatch is 6 uops/cycle.

Also the microcode rom is said to output 6 uops/cicle, after getting the pointer from the microcode queue... But anyway >6 uops/cycle is useful only for the peak. Or if the dispatch is 6+4 and not max 6 int and max 4 fp and max 6 total...

Rickyyy369 · Dec 28, 2016

http://library.madeinpresse.fr/samples/MPqY2Vg2I71P-f

The binary along the top of the image on the last page translates to "ZenOC@Air=5G".

krumme · Dec 28, 2016

Rickyyy369 said:
http://library.madeinpresse.fr/samples/MPqY2Vg2I71P-f

The binary along the top of the image on the last page translates to "ZenOC@Air=5G".

Yep.cool. !! lol BAM

krumme · Dec 28, 2016

Confirmed binary taken from article:
010110100110010101101110010011110100001101000000010000010110100101110010001111010011010101000111

is
ZenOC@Air=5G

Edit: confirmed the string is from the print

.vodka · Dec 28, 2016

Oh, cool! I wonder what else these guys are telling us in plain sight. I like this more and more. Zen having similar OC capabilites to Kabylake would do wonders!

Now, is the 8c16t version capable of doing 5GHz, or is it the 4c8t version? Should it be full 8c16t... well..

However it may be, the 4c8t part with this high of a clock ceiling WILL sell. 6700k/7700k won't have it that easy. As it was seen in the numbers in the publication, the main difference between 8c16t SR and the competition in gaming and other tasks was clocks, putting the Intel parts ahead. The raw performance is there, and the clocks seem to be there too. Maybe not out of the box, but through overclocking. I like this.

rvborgh · Dec 28, 2016

So that is what Fottemberg meant about frequency surprise

krumme said:
Confirmed:
010110100110010101101110010011110100001101000000010000010110100101110010001111010011010101000111

is
ZenOC@Air=5G

.vodka · Dec 28, 2016

Amazing. High IPC, without a decrease in clock speed relative to CON cores?

I understand the uarch is also quite important in being able to reach 5GHz, but the process plays its part too. Is this confirmed to be glofo, samsung, or even TSMC?

krumme · Dec 28, 2016

rvborgh said:
So that is what Fottemberg meant about frequency surprise

Yeaa. He was right about bwe ipc june 2015 and also this. This is a freakshow. Damn...

ZEN ES Benchmark from french hardware Magazine

Senior member

Senior member

Golden Member

Senior member

Lifer

Member

Lifer

Senior member

Member

Lifer

Lifer

Diamond Member

Diamond Member

Golden Member

Diamond Member

Senior member

Senior member

Senior member

Member

Diamond Member

Diamond Member

Golden Member

Member

Golden Member

Diamond Member