Info 64MB V-Cache on 5XXX Zen3 Average +15% in Games

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Toggle sidebar Toggle sidebar

K

Kedas

Senior member

Jun 1, 2021

#1

Well we know now how they will bridge the long wait to Zen4 on AM5 Q4 2022.
Production start for V-cache is end this year so too early for Zen4 so this is certainly coming to AM4.
+15% Lisa said is "like an entire architectural generation"

Last edited: Jun 1, 2021

Reactions: Tlh97 and Gideon

Sort by date Sort by votes

I

igor_kavinski

Lifer

Jun 2, 2021

#51

IntelUser2000 said:
Because Logic transistors such as ALUs, FPUs, branch predictors, decoders use a lot of power per transistor while caches are very power efficient. Power is pretty much the biggest limiter to performance nowadays.

Bit off topic but how come CPU designers haven't thought of using persistent cache for storing decoded instructions? Something connected directly to the CPU with lowest possible latency so the CPU decoders don't have to work as hard? Wouldn't that save a lot of CPU time and increase instruction throughput? Most consumer workloads are just repetitive in nature. Boot PC, load OS, launch frequently used software, use frequently used functions of said software.

Upvote -1 Downvote

I

igor_kavinski

Lifer

Jun 2, 2021

#52

eek2121 said:
Yep, DDR5 latency will be similar to DDR4.

Any link please? All the Alder Lake engineering sample performance leaks show worse latency. At least the ones that I've seen.

Upvote 0 Downvote

Makaveli

Diamond Member

Jun 2, 2021

#53

igor_kavinski said:
Any link please? All the Alder Lake engineering sample performance leaks show worse latency. At least the ones that I've seen.

www.anandtech.com

Insights into DDR5 Sub-timings and Latencies

www.anandtech.com

www.anandtech.com

This is your Alderlake sample here

www.gamepressure.com

DDR5 With Significant Performance Gains Compared to DDR4

Memory manufacturer Longsys published a benchmark of new DDR5 RAM chips. The results in data write and read speeds are much better than in the case of DDR4. Is there going to be a revolution in performance?

www.gamepressure.com

www.gamepressure.com

And this is the text below the numbers

CL40 latency (40-40-40-77) may raise some concern, but it's worth remembering that this is only the number of cycles needed to receive data from a memory cell. Better bandwidth requires higher clock speeds for RAM chips - but at the same time, the time per cycle decreases. That's why, despite increasing CL values in successive generations of DDR memory, the access latency actually remains at a similar level.

Last edited: Jun 2, 2021

Reactions: Elfear

Upvote 0 Downvote

IntelUser2000

Elite Member

Jun 2, 2021

#54

By the way another reason V-cache is faster than Intel's eDRAM is because SRAM is faster than DRAM.

igor_kavinski said:
Bit off topic but how come CPU designers haven't thought of using persistent cache for storing decoded instructions?

What do you mean by a persistent cache? Persistent as in no data lost when power is off?

There's no technology that's fast enough to be used in CPUs while being non-volatile. Unless you mean something else.

Reactions: Tlh97, lightmanek and Makaveli

Upvote 0 Downvote

J

JoeRambo

Golden Member

Jun 2, 2021

#55

IntelUser2000 said:
And why would V-cache not require tags? I assume the stacks will have tags in them. It sounds like it's just an extension of the existing L3.

Of course they will have tags, but in case of EDRAM L4, tags were on BDW chip, replacing actual L3 to save on latency. Otherwise you have to go to actual L4 cache each time ( might as well convert it to system side cache and call it a day at that point ). AMDs solution is simply more of good old L3 cache, they don't even need to replicate L2 shadow tags in the chip as that is already handled by L3 cache complex in original CCD, just SRAM for actual L3 lines and tags with whatever redundancy they need to achieve almost 100% yield.

Reactions: Joe NYC, Tlh97, lightmanek and 1 other person

Upvote 0 Downvote

IntelUser2000

Elite Member

Jun 2, 2021

#56

JoeRambo said:
Of course they will have tags, but in case of EDRAM L4, tags were on BDW chip, replacing actual L3 to save on latency.

Yes, I hadn't thought of that. Good point!

They changed that with Skylake. I wonder how much performance impact was there by changing the system? I thought the poor performance of the GT4e version might have had to do with that.

Upvote 0 Downvote

H

Hougy

Member

Jun 2, 2021

#57

IntelUser2000 said:
I know what you mean.

The Square root law in microprocessors say that if you increase the number of transistors by X%, the actual performance you get is square root of X. So if you quadruple the number of transistors, you get twice the performance. It's not just die area that's quadrupled. The power use is quadrupled as well, since 4x the transistors.

You can use clever engineering and ideas to overcome that somewhat but new ideas are much harder to come by.

Intel's Cypress Cove(the 14nm version of Icelake's core) uses 37% more transistors for 18% performance. Follows the law almost exactly.

But this is a generalized statement. Because Logic transistors such as ALUs, FPUs, branch predictors, decoders use a lot of power per transistor while caches are very power efficient. Power is pretty much the biggest limiter to performance nowadays.

Caches, and the way AMD does it will add die area and cost, but is a very power efficient way to increase performance.

Thanks again for the informative reply.

What I was thinking initially is that instead of using 36mm^2 of silicon (44,4% more) in cache and get 15% more performance, they could use it in a mixture of logic and cache to get 20% more performance according to the square root law. But that approach would probably incur in many more additional costs since the Zen 3 die would have to be remade.

If I understood correctly, you're saying now is that because cache uses less power, X% more cache could result in more than sqrt(X) of performance increase. That was not the case here. Looks like the Zen 3 cores are not in a huge need of more cache.

Upvote 0 Downvote

IntelUser2000

Elite Member

Jun 2, 2021

#58

Hougy said:
If I understood correctly, you're saying now is that because cache uses less power, X% more cache could result in more than sqrt(X) of performance increase. That was not the case here. Looks like the Zen 3 cores are not in a huge need of more cache.

In terms of power use, it'll add something like less than 5% on the desktop chips. And you'll definitely get 5%+ performance gains in average(that's what we got with Broadwell!), and scenarios like games and other applications where it has a larger dataset, 10-20%, which is absolutely huge.

Nevermind the square root law. That's superlinear scaling!

Reactions: trollspotter, Tlh97, moinmoin and 3 others

Upvote 0 Downvote

I

igor_kavinski

Lifer

Jun 2, 2021

#59

IntelUser2000 said:
There's no technology that's fast enough to be used in CPUs while being non-volatile. Unless you mean something else.

1Mb-16Mb-Serial-HP-MRAM.pdf (avalanche-technology.com)

Could this not be repurposed, if only just to store the decoded instructions between reboots?

Upvote -1 Downvote

I

igor_kavinski

Lifer

Jun 2, 2021

#60

Makaveli said:
This is your Alderlake sample here

DDR5 With Significant Performance Gains Compared to DDR4

Memory manufacturer Longsys published a benchmark of new DDR5 RAM chips. The results in data write and read speeds are much better than in the case of DDR4. Is there going to be a revolution in performance?

www.gamepressure.com

I'm confused. Isn't that link showing 97% increased latency? Elsewhere on the internet, I've read comments that DDR5's latency will be on par with DDR4's only past 8000 MT/s.

Upvote 0 Downvote

M

maddie

Diamond Member

Jun 2, 2021

#61

Curious here. Anyone has an idea of the memory size needed for storing say, 90% of most used X86 instructions converted to uops?

Upvote 0 Downvote

M

maddie

Diamond Member

Jun 2, 2021

#62

igor_kavinski said:
I'm confused. Isn't that link showing 97% increased latency? Elsewhere on the internet, I've read comments that DDR5's latency will be on par with DDR4's only past 8000 MT/s.

It's in the linked article.

CL - CAS Latency

CL40 latency (40-40-40-77) may raise some concern, but it's worth remembering that this is only the number of cycles needed to receive data from a memory cell. Better bandwidth requires higher clock speeds for RAM chips - but at the same time, the time per cycle decreases. That's why, despite increasing CL values in successive generations of DDR memory, the access latency actually remains at a similar level.

Reactions: cytg111, Tlh97 and Makaveli

Upvote 0 Downvote

I

igor_kavinski

Lifer

Jun 2, 2021

#63

maddie said:
That's why, despite increasing CL values in successive generations of DDR memory, the access latency actually remains at a similar level.

The Dark Knight: Intel's Core i7

www.anandtech.com

www.anandtech.com

The reduction in latency is particularly impressive when you jump up to DDR3-1600, it only takes 33.5ns to access main memory.

technosports.co.in

Intel’s Rocket Lake CPUs were hiding a feature similar to AMD’s Infinity Fabric Clock

MSI is going to unveil new Z590 mobo models, so it would be interesting to see how this would affect the performance of intel’s CPUs.

technosports.co.in

To show the memory latency of the Rocket Lake CPU, the Chiphell user first tested a DDR4-4000 kit with 18-20-20-40 1T timings. The 1:1 threshold led to 61.3 nanosecond latencies in AIDA64 and with a DDR4-3600 14-14-14-34 2T kit with uncore clocks set to 4,100 MHz and got 50.2 ns latency.

Nehalem running DDR3-1600 needed 33.5ns to access main memory.
Rocket Lake running DDR4-3600 needs 50.2ns

Maybe I'm misunderstanding something? Yes, Rocket Lake fetches tons more data than Nehalem in a given time period but actual RAM access latency has increased, no? DDR5's leaked benchmarks are showing that it will go over 100ns. How is that better?

Upvote 0 Downvote

I

igor_kavinski

Lifer

Jun 2, 2021

#64

maddie said:
Curious here. Anyone has an idea of the memory size needed for storing say, 90% of most used X86 instructions converted to uops?

Due to the sheer amount of instructions CPUs can process per second, this is likely going to be in gigabytes. For the average PC user, it could be much less if they cycle between just a few frequently used applications/games.

AMD Zen Architecture Interview with Sam Naffziger & Chief Architect Mike Clark | GamersNexus

stub Every now and then, a content piece falls to the wayside and is archived indefinitely -- or just lost under a mountain of other content. That’s what happened with our AMD Ryzen pre-launch interview with Sam Naffziger, AMD Corporate Fellow, and Michael Clark, Chief Architect of Zen. We...

www.gamersnexus.net

www.gamersnexus.net

You have this expensive logic block chunking away. Now we just stuff those micro-ops into the op-cache, all the decoding done, and the hit-rate there is really high [Clark: up to 90% on a lot of workloads], so that means we’re only doing that heavy-weight decode 10% of the time.

AMD is doing this on-chip, just not in a persistent non-volatile manner. This seems to suggest that the opcache size might be in megabytes at most.

Upvote 0 Downvote

M

maddie

Diamond Member

Jun 2, 2021

#65

igor_kavinski said:
The Dark Knight: Intel's Core i7

www.anandtech.com

Intel’s Rocket Lake CPUs were hiding a feature similar to AMD’s Infinity Fabric Clock

MSI is going to unveil new Z590 mobo models, so it would be interesting to see how this would affect the performance of intel’s CPUs.

technosports.co.in

Nehalem running DDR3-1600 needed 33.5ns to access main memory.
Rocket Lake running DDR4-3600 needs 50.2ns

Maybe I'm misunderstanding something? Yes, Rocket Lake fetches tons more data than Nehalem in a given time period but actual RAM access latency has increased, no? DDR5's leaked benchmarks are showing that it will go over 100ns. How is that better?

Check out the table on this page for the actual ram module latencies.

DDR3 1600 : 13.75 ns - 7.5 ns CAS 11 - 6
DDR4 3600 : 10.56 ns - 8.33 ns CAS 19 - 15

CAS latency - Wikipedia

en.wikipedia.org

en.wikipedia.org

Reactions: Mopetar, cytg111, Tlh97 and 2 others

Upvote 0 Downvote

J

JoeRambo

Golden Member

Jun 2, 2021

#66

No idea what Anand was using in 2008 for Nehalem, but even two different Aida versions give different and uncomparable results.

Reactions: Tlh97 and Makaveli

Upvote 0 Downvote

M

maddie

Diamond Member

Jun 2, 2021

#67

igor_kavinski said:
Due to the sheer amount of instructions CPUs can process per second, this is likely going to be in gigabytes. For the average PC user, it could be much less if they cycle between just a few frequently used applications/games.

AMD Zen Architecture Interview with Sam Naffziger & Chief Architect Mike Clark | GamersNexus

stub Every now and then, a content piece falls to the wayside and is archived indefinitely -- or just lost under a mountain of other content. That’s what happened with our AMD Ryzen pre-launch interview with Sam Naffziger, AMD Corporate Fellow, and Michael Clark, Chief Architect of Zen. We...

www.gamersnexus.net

AMD is doing this on-chip, just not in a persistent non-volatile manner. This seems to suggest that the opcache size might be in megabytes at most.

I think I asked this question incorrectly. The present micro ops cache is only a few KB in size. All I wanted was how big it would be for most of X86 instructions.

As an example Haswell was ~ 4 B/micro-op, so how many for decoding most of X86 instructions.

edit: Corrected size of data

Last edited: Jun 2, 2021

Upvote 0 Downvote

G

gk1951

Member

Jun 2, 2021

#68

Will the X570 socket support this new cpu?

Upvote -1 Downvote

Makaveli

Diamond Member

Jun 2, 2021

#69

gk1951 said:
Will the X570 socket support this new cpu?

Considering the demo cpu on stage was a zen 3 chip I would think so, however until AMD officially says something I would hold onto your wallet.

Upvote 0 Downvote

G

gdansk

Platinum Member

Jun 2, 2021

#70

gk1951 said:
Will the X570 socket support this new cpu?

On paper there is no reason it shouldn't. But who knows what AMD will validate/allow.

Upvote 0 Downvote

Rigg

Senior member

Jun 2, 2021

#71

gk1951 said:
Will the X570 socket support this new cpu?

I think it would be a massive blunder to not support x570 and b550 (at least) assuming this ends up being released on the AM4 socket. It's hard to think that a lot of people would invest in a new dead end AM4 chipset when Zen 4 and AM5 are likely to be released within a year of when this is available. I think they'll probably release a new chipset but It would be a pretty limited market if these chips were exclusive to it. I think the most likely scenario is we see these CPU's as an XT refresh in Q1 2022 on AM4 with desktop Zen 4 9-12 months later on AM5. This is the best case scenario for a clean transition to DDR5.

Upvote 0 Downvote

J

jamescox

Senior member

Jun 2, 2021

#72

maddie said:
I think I asked this question incorrectly. The present micro ops cache is only a few KB in size. All I wanted was how big it would be for most of X86 instructions.

As an example Haswell was ~ 4 B/micro-op, so how many for decoding most of X86 instructions.

edit: Corrected size of data

It sounds like you are trying to re-invent microcode. For common instructions, the decoding is built directly into the hardware. The instruction may contain references to registers or addresses, so the decoded operation for the back end is a much wider thing that contains all of the necessary information. This is not a RISC op. It may have register renaming, dependency information, and all sorts of things required by the back end OOO execution engine. More complex instructions and all of the old legacy x86 instructions that do not result in a few micro-ops are probably handled in microcode, which basically is used to break the possibly complex instruction down into whatever stream of micro-ops are required.

Upvote 0 Downvote

IntelUser2000

Elite Member

Jun 2, 2021

#73

igor_kavinski said:
1Mb-16Mb-Serial-HP-MRAM.pdf (avalanche-technology.com)

Could this not be repurposed, if only just to store the decoded instructions between reboots?

MRAM is not being used for a reason. For CPU caches it's not proven/fast enough to run at 3GHz+ frequencies. Then if you consider additional production cost needed to have it as a separate package you end up with little gains but many disadvantages.

uop cache is in the range of KB. AMD's 4k instruction uop cache is probably equal to 16-32KB in size.

Reactions: Tlh97 and lightmanek

Upvote 0 Downvote

N

naukkis

Senior member

Jun 3, 2021

#74

maddie said:
I think I asked this question incorrectly. The present micro ops cache is only a few KB in size. All I wanted was how big it would be for most of X86 instructions.

As an example Haswell was ~ 4 B/micro-op, so how many for decoding most of X86 instructions.

edit: Corrected size of data

Seems like you try to make micro-op cache to convert x86 ISA to something else? That's not how mops-cache work, x86-instructions won't decode to something that can be used to some other now yet decoded x86-instruction. One instruction can be decoded to about unlimited number of different internal instructions - and most important thing with MOP-cache is that it isn't designed to hold decoded instructions but instruction sequences - predecoded and stored instruction sequence that code is repeating so when looping so there's whole instruction flow to execute available for cpu execution engine without need to decode that part of code sequence again. RISC-designs can, and use MOP-cache too as it more efficient way to do looping even when instruction decoding is easy.

Reactions: Tlh97 and Thibsie

Upvote 0 Downvote

T

Timmah!

Golden Member

Jun 3, 2021

#75

Mopetar said:
It's called 3D V-cache for a reason. Treating the problem as though you're working with a traditional two dimensional chip with a larger area doesn't make sense in this context. They're just building on top of existing real estate, much like we add additional floors to buildings because trying to spread the same amount of office space out over a single floor at ground level would be too expensive from a real estate perspective.

Regarding the building analogy, do we know how the data travel to this additional "floor"? Is it accessible at single place, like a floor would be via stairway/elevator shaft, or its connected on many points all over the surface of the chip?

Upvote 0 Downvote

You must log in or register to reply here.

Share:

Facebook X (Twitter) Reddit Tumblr WhatsApp Email Link

TRENDING THREADS

Discussion Intel current and future Lakes & Rapids thread
- Started by TheF34RChannel
- Jun 18, 2017
- Replies: 23K
CPUs and Overclocking
Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)
- Started by DisEnchantment
- Sep 29, 2022
- Replies: 10K
CPUs and Overclocking
T
Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads
- Started by Tigerick
- Aug 22, 2022
- Replies: 7K
CPUs and Overclocking
Discussion Apple Silicon SoC thread
- Started by Eug
- Nov 10, 2020
- Replies: 6K
CPUs and Overclocking
Question Raptor Lake - Official Thread
- Started by Hulk
- Dec 5, 2021
- Replies: 5K
CPUs and Overclocking

Top Bottom

This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.

Accept Learn more…