Question Zen4c vs E core Die area.

itsmydamnation · Jun 14, 2023

Exist50 said:
. Especially if you have per thread QoS requirements. Many Bergamo deployments will probably have SMT disabled for that reason.

What you mean like all those other cloud x86 servers....... wait ... what ?

BorisTheBlade82 · Jun 14, 2023

Abwx said:
... and SMT does change nothing to the equation because power goes proportionaly with augmented throughput if frequency is kept constant.

I tried to get to the bottom of this a while back. My observation was, that when comparing two identical CPUs in the same TDP budget, where one has SMT and the other has not, there is almost no power tax for SMT.
In my specific case it was a Renoir 4700U vs. 4800U. The latter had around 25% more throughput at almost exactly the same consumption in CB23.
I can only assume, that power gating is not as finely grained as to disable every single pipeline stage for any fraction of time not being in use.

itsmydamnation · Jun 15, 2023

BorisTheBlade82 said:
I tried to get to the bottom of this a while back. My observation was, that when comparing two identical CPUs in the same TDP budget, where one has SMT and the other has not, there is almost no power tax for SMT.
In my specific case it was a Renoir 4700U vs. 4800U. The latter had around 25% more throughput at almost exactly the same consumption in CB23.
I can only assume, that power gating is not as finely grained as to disable every single pipeline stage for any fraction of time not being in use.

Also the bigger the total core count ,interconnect and IO becomes the less impact a single core not power/clock gating for 100us has on total power consumption.

Exist50 · Jun 15, 2023

Markfw said:
There is no product from Intel out there with nothing but e-cores, so how can we compare ?

There's ADL-N, if we want to get technical, but the entire context of this conversation is theoretical abstractions based on what limited information we have available. Even Bergamo isn't out yet.

Abwx said:
For someone who want to understand what it is about when it comes to frequency and power he should first understand what it is about here :

You should start with your own link. Dynamic power scales proportionally to Cdyn * Frequency * Voltage^2. You're focusing only on frequency, ignoring both the Cdyn and Voltage terms. Zen 4c isn't half the area with the same VF curve.

Abwx said:
Also it s about sure that Zen 4c use more power constrained libraries, so it should be a little more efficient than Zen 4 at same low frequencies.

It actually doesn't. What's both interesting and impressive about Zen 4c is that they actually changed very little beyond targeting a lower frequency point. If anything, now that its benefits are proven in the wild, we'll likely see more divergence in future gens.

Abwx said:
I m taking a best case figure for the e cores, even at only 20% IPC difference there s roughly 50% more power to get the same ST perf, and SMT does change nothing to the equation because power goes proportionaly with augmented throughput if frequency is kept constant.

If you care primarily about ST, then neither Bergamo nor Intel's Forest line makes sense. The big factor that you need to include is that one E-core is substantially smaller than one Zen 4c core. So from a product level, we'd see something more like 128 Zen 4c/5c vs 256c Crestmont. Makes things more interesting.

Markfw said:
And there are places where I use it, and I am sure even in the cloud it can come in handy.

The workloads targeted by these chips, by and large, do not make heavy use of vector instructions. It's certainly nice to have, but the heaviest vector workloads are stuff like AI, and often latency bound, hence running on the bigger cores.

Really the best reference would be to compare to Graviton, as that's why AMD and Intel created these products to begin with.

Exist50 · Jun 15, 2023

repoman27 said:
Isn't Bergamo on AMD's flavor of N5HPC though, not N4P?

Yes, think that's correct. Either way, will certainly be better than Intel 4. Intel 3, we'll see.

Exist50 · Jun 15, 2023

itsmydamnation said:
What you mean like all those other cloud x86 servers....... wait ... what ?

A similar tradeoff does exist with current products. For example, core counts for the frequency optimized SKUs, trading throughput for stronger individual cores/threads.

Many cloud workloads demand a certain level of performance per thread. Say, for example, a web server, which is expected to respond in a certain amount of time. This is also important for AWS/Azure/GCP pricing tiers. So looking at the problem here, if you actually take advantage of SMT and the throughput benefits it provides, you end up sacrificing a substantial amount of per thread performance. If your baseline is too low (say, equivalent to a 1.5GHz Zen 4 core), then you simply can't just grab the max core count SKU and run it with SMT.

But you pay the hardware vendor based on the core count, not how many threads you run. This can create interesting pricing niches when you think about it, and you can find some fun examples floating around.

Zen 4c is particularly attractive in this regard because you can fit ~twice the cores per area (i.e. per dollar), but each individual core in 1T mode provides more performance than each thread on Zen 4 with SMT. I brought this up in the old "future of SMT" thread, but this pricing dynamic has significantly reduced the importance of SMT. It's still useful for some cases (flexibility on one machine, max throughput regardless of perf per thread, etc), but no longer irreplaceable.

itsmydamnation · Jun 15, 2023

your completely ignoring IO, if your making heavy DB queries etc then the more virtual threads the better

H433x0n · Jun 15, 2023

Exist50 said:
Though N4P is still the better node.

Where do you see that? I don’t see any public data on Intel 3 node characteristics. The available data on Intel 4 HP has it pretty far ahead of N4P.

I’ve seen you say this a few times and it seems like you’ve got a good pulse on this particular topic so I’m genuinely curious where that sentiment comes from.

BorisTheBlade82 · Jun 15, 2023

itsmydamnation said:
Also the bigger the total core count ,interconnect and IO becomes the less impact a single core not power/clock gating for 100us has on total power consumption.

Generally yes. But specifically Renoir does not have that much uncore overhead.
I ran it with different cTDPs in order to find out its max. Energy efficiency. The result was 12w for 8 cores - only below that margin the uncore started to eat too much into the power budget.

desrever · Jun 15, 2023

Exist50 said:
Zen 4c is particularly attractive in this regard because you can fit ~twice the cores per area (i.e. per dollar), but each individual core in 1T mode provides more performance than each thread on Zen 4 with SMT. I brought this up in the old "future of SMT" thread, but this pricing dynamic has significantly reduced the importance of SMT. It's still useful for some cases (flexibility on one machine, max throughput regardless of perf per thread, etc), but no longer irreplaceable.

Each zen4c thread as vCPU is probably stronger than what is needed in a lot of cloud use cases, disabling SMT would be a waste. Some workloads wouldn't want to do it but for general use case, it is more than capable. With all IO being equal, they can sell a SMT core as 2 vCPUs vs just 1 CPU if disabled, makes it way more attractive to do that. Considering SMT basically increase throughput by >20% for "free" and they can price these threads directly as vCPU, why wouldn't they want this? Obviously there is limits but this has been the case for most Intel/AMD servers in the cloud for a long time now. 1 zen4c thread is probably more powerful than 1 thread of icelake or w/e they would replace.

Some workloads require more performance than 1 thread can offer which is why they have the SKUs without SMT but its not like SMT isn't useful.

coercitiv · Jun 15, 2023

BorisTheBlade82 said:
I tried to get to the bottom of this a while back. My observation was, that when comparing two identical CPUs in the same TDP budget, where one has SMT and the other has not, there is almost no power tax for SMT.
In my specific case it was a Renoir 4700U vs. 4800U. The latter had around 25% more throughput at almost exactly the same consumption in CB23.

That is the exact opposite of what me and others have observed, SMT proportionally increases power usage based on the increase in throughput it is able to provide for a given workload.

I would also caution you not to measure the energy impact of SMT on two different CPU dies, let alone two different bins. We do this when we have no other choice, but in the case of SMT we can use the same CPU die and just disable/enable SMT, removing die variance and binning variance. Testing with a TDP cap that is easily reached by the die even with SMT disabled can also be tricky, as enabling SMT may drop clocks, keeping the die in a more efficient operating point. This can interfere with measurements depending on what one needs to evaluate. Combine a better binned die with a relatively low TDP cap and the efficiency double-dip can make SMT look like free performance. (which isn't necessarily false if all one wants is to improve efficiency)

Exist50 · Jun 15, 2023

desrever said:
Each zen4c thread as vCPU is probably stronger than what is needed in a lot of cloud use cases, disabling SMT would be a waste. Some workloads wouldn't want to do it but for general use case, it is more than capable. With all IO being equal, they can sell a SMT core as 2 vCPUs vs just 1 CPU if disabled, makes it way more attractive to do that. Considering SMT basically increase throughput by >20% for "free" and they can price these threads directly as vCPU, why wouldn't they want this? Obviously there is limits but this has been the case for most Intel/AMD servers in the cloud for a long time now. 1 zen4c thread is probably more powerful than 1 thread of icelake or w/e they would replace.

Some workloads require more performance than 1 thread can offer which is why they have the SKUs without SMT but its not like SMT isn't useful.

For general use cases, they're positioning Genoa as the default. Bergamo is more of a targeted option. And of course, in a vacuum, CSPs would love to have more threads to offer and to net the "free" throughput SMT provides, but ultimately, customers demand more than just raw throughput. There will certainly be deployments of Bergamo with SMT, but don't be surprised if many companies disable it. I expect that will be particularly common with web-heavy deployments (e.g. Google, Meta, etc).

Though on the topic, AWS now defines 1 vCPU as 1 core, not one SMT thread. I think that might also help them avoid any side channel concerns with SMT. Will be interesting to see if Microsoft and Google follow.

Shmee · Jun 15, 2023

What exactly is Zen 4c? Is this a Zen 4 refresh?

Timorous · Jun 15, 2023

Shmee said:
What exactly is Zen 4c? Is this a Zen 4 refresh?

Density optimised Zen 4. The chiplet itself is less than 10% larger and it has double the core count vs the standard Zen 4 CCD. That is split into 2 CCXs each with 16MB of L3 so per core L3 is halved.

moinmoin · Jun 15, 2023

Shmee said:
What exactly is Zen 4c? Is this a Zen 4 refresh?

In AMD pictures:

Abwx · Jun 15, 2023

Exist50 said:
You should start with your own link. Dynamic power scales proportionally to Cdyn * Frequency * Voltage^2. You're focusing only on frequency, ignoring both the Cdyn and Voltage terms. Zen 4c isn't half the area with the same VF curve.

This show that you dont really understand the thing...

Putting the capacitance and frequency is using twice the same parameter in a way.

The current through an ideal mosfet increase as the square of the voltage.

To increase the current by a X factor , and hence frequency by the same X ratio, you ll have to increase voltage by sqrt(X)

FI to increase frequency by a 2 factor voltage must be increased by 1.414x
Power will be increased by 2 if we account only this factor, but since frequency is also increased by a 2 factor the whole power increase by a 4 ratio.

So we can write that P(f) = f^2 without normalizing the equation, FI if a CPU use 100W at 5GHz the normalized relation would be :

P(f) = 4.f^2 with frequency unities in GHz.

That is, power increase quadratically in respect of frequency, but keep in mind that it s a theorical best case and that real mosfets do not exhibits that good of a power/frequency slope, generaly the exponent is between 2.2 and 2.8 depending of the process.

As for the capacitance it is not needed in this relation because it is assumed as being at its maximal value since we are talking of a CPU that work at full throughput.

Now Intel put great care to linearize its process as much as possible and they have generally a better slope than TSMC who seems more concerned about time to market, if we look at ADL FI they manage to have a 2.2 exponent while TSMC s 7nm process hoover at 2.6-2.8 depending of the exact process iteration, but that s only part of the story because TSMC has lower cpacitance to begin with, so at low power/low frequency their process has a better perf/watt at equivalent node.

igor_kavinski · Jun 15, 2023

BorisTheBlade82 said:
My observation was, that when comparing two identical CPUs in the same TDP budget, where one has SMT and the other has not, there is almost no power tax for SMT.
In my specific case it was a Renoir 4700U vs. 4800U.

Your observation may not hold true for Intel architectures. Intel engineers haven't seemed to figure out "free" SMT.

Exist50 · Jun 15, 2023

Abwx said:
Putting the capacitance and frequency is using twice the same parameter in a way.

All this tells me is that you have no idea what that term is. To give the simplest possible example, if you have two identical transistors instead of one, that term would ~double. So no, it is not in any way redundant, nor scales with frequency. It's a constant for a given design and workload.

BorisTheBlade82 · Jun 15, 2023

coercitiv said:
I would also caution you not to measure the energy impact of SMT on two different CPU dies, let alone two different bins. We do this when we have no other choice, but in the case of SMT we can use the same CPU die and just disable/enable SMT, removing die variance and binning variance.

Yep, I gladly and wholeheartedly agree with you. The trouble is, that I do not have a machine at my disposal that would allow me that kind of thorough testing.

Markfw · Jun 15, 2023

BorisTheBlade82 said:
Yep, I gladly and wholeheartedly agree with you. The trouble is, that I do not have a machine at my disposal that would allow me that kind of thorough testing.

The only thing about SMT that I have noticed, is when you have an application that is very sensitive to L3 cache, either disabling it, or running 50% of the normal jobs helps a lot. Other than that, I see no penalty. If I thought it was worth testing to prove to some people that it makes no change, I would test it for you. But some people are "always right" and can not be convinced.

igor_kavinski · Jun 15, 2023

Markfw said:
The only thing about SMT that I have noticed, is when you have an application that is very sensitive to L3 cache, either disabling it, or running 50% of the normal jobs helps a lot.

Yeah. The HT threads vie with the normal threads for resources, putting pressure on the limited cache. AMD V-cache is the solution.

A/// · Jun 15, 2023

unrelated but that new linus video has him using a chinese mini pc with a 12490f or whatever with a slightly larger cache. I went to bed six minutes in because I couldn't take anymore of his whiny voice but I wonder if that particular china only cpu was a kind of test bed for future cpus with larger than standard l3 cache or anything else.

id post more on this but feel awful having spent a few hours under rain the other day repairing my shed.

igor_kavinski · Jun 15, 2023

A/// said:
id post more on this but feel awful having spent a few hours under rain the other day repairing my shed.

Could have just put a tarp on the shed (or the part of it that needed repair) and ran for cover.

igor_kavinski · Jun 15, 2023

A/// said:
I wonder if that particular china only cpu was a kind of test bed for future cpus with larger than standard l3 cache or anything else.

Could just be rejected 12600K dies with the E-core cluster disabled.

igor_kavinski · Jun 15, 2023

igor_kavinski said:
Could just be rejected 12600K dies with the E-core cluster disabled.

What do you know? It actually might be: https://www.tomshardware.com/news/i...ew-chinas-exclusive-black-edition-gaming-chip

C0 stepping chips like our 12490F actually have a total of eight P-cores and eight E-cores, but Intel disables the extra cores to trim it down to a 6+0 design.

Question Zen4c vs E core Die area.

Diamond Member

Senior member

Diamond Member

Platinum Member

Platinum Member

Platinum Member

Diamond Member

Golden Member

Senior member

Senior member

Diamond Member

Platinum Member

Memory & Storage, Graphics Cards Mod Elite Member

Golden Member

Diamond Member

Lifer

Lifer

Platinum Member

Senior member

Moderator Emeritus, Elite Member

Lifer

Diamond Member

Lifer

Lifer

Lifer