Question Zen 6 Speculation Thread

soresu · Sunday at 4:09 PM

poke01 said:
and even by then who knows what those Cambridge folks cook up to counter it

Cambridge as in ARM in general or the Cambridge CPU design team?

Because the latter aint done much worth writing home about since A53 going by lackluster showings of A5xx µArch lineage, including C1 Nano which is poor as hell for the timespan since A510.

It would be interesting if ARM went down the road of pursuing these slice out of order µArch's for future Cx Nano IP, though I strongly doubt that will happen.

DrMrLordX · Sunday at 10:06 PM

Hulk said:
Anyway while doing this my CPU load is about 70% while the GPU is about 90%. Thing is the 9950X still isn't tapped out for cores. I know this is anecdotal because I'm looking at one of my use cases, but what I really need is more ST performance, not moar cores. Same thing when I'm running Studio One.

That's perfectly normal. 16c is already overkill for most users. It's up to you to figure out what (if anything) to do with all that extra compute capacity.

Moving the halo desktop product from 16c->24c let's them shift 16c SKUs to lower price tiers (maybe, and that doesn't mean 16c is necessarily going to be cheaper) and lets them increase the price target for their new halo product. 16 P cores is already more than enough for a significant number of power users.

gdansk · Sunday at 10:08 PM

DrMrLordX said:
maybe

They may kill 16c SKUs. Too choppy.

DrMrLordX · Sunday at 10:12 PM

gdansk said:
They may kill 16c SKUs. Too choppy.

Maybe? It wouldn't be a huge loss (per se), but it does make the argument for dual CCD SKUs a little weird when you can get 12c on one CCD.

Tuna-Fish · Monday at 5:21 AM

gdansk said:
They may kill 16c SKUs. Too choppy.

Isn't there supposed to be a 8c low-cost CCD alongside the 12c main CCD?

I'd assume that the SKUs are 6, 8, 12 (+X3D), 16, 20 and 24 (+X3D).

With the 16c SKU very close in price to the 12c one, but actually worse in games, because it has less L3 per CCD.

luro · Monday at 5:40 AM

Tuna-Fish said:
Isn't there supposed to be a 8c low-cost CCD

I think that’s Zen 7

Gideon · Monday at 6:55 AM

Hulk said:
The more I analyze the core usage during my work flow the more I'm wondering how much improvement from from 16 to 24 cores will provide outside of Cinebench and other benches? For example, last night after shooting a video of my daughters musical I'm home splitting the instrumental and vocals out of the audio track using UVR on pretty high settings while rendering a video using Vegas Pro using the Voukoder encoder to H.265. It's tough on the processor as it "lights up" transistors in a dense way that creates hot spots.

Anyway while doing this my CPU load is about 70% while the GPU is about 90%. Thing is the 9950X still isn't tapped out for cores. I know this is anecdotal because I'm looking at one of my use cases, but what I really need is more ST performance, not moar cores. Same thing when I'm running Studio One.

I know this is impossible from a "building the stack" point-of-view for AMD as well as the Cinebench competition, but what would be nice if they built a processor for me would be 1 CCD devoted soley to ST performance through architecture and frequency with perhaps 8 or 10 cores, and the second CCD for MT with 12 or 14 cores.

While I doubt you yourself would be interested at all in this, I really wish someone would deliver you a M5 Pro Mac mini (once it's released this spring) and you'd run a similar workloads on it for comparison.

Studio One and UVR are natively supported. Vegas Pro is not (which might be a dealbreaker), but DaVinci Resolve IS supported along with Voukoder (beta).

An alterantive is to setup a test-case on an open source video others could replicaet, but this doesn't really doable here (since it's not a canned benchmark but actual complex workflow). The other thing is that the base 16GB probably isn't enough for running these in parallel, or is it a non-issue?

All in all given the large caches, the unified memory and PCIE5 equivalent SSD (and all the other things mentioned in this interview of the Anand Lal Shimpi: https://forums.anandtech.com/threads/apple-silicon-soc-thread.2587205/post-41585933, I'd assume it would run very well.

Tangopiper · Monday at 7:37 AM

Tuna-Fish said:
Isn't there supposed to be a 8c low-cost CCD alongside the 12c main CCD?

Zen 6 = 12 core and 32 core CCDs

Zen 7 = 8 core, 16 core and 30-something core CCDs.

jdubs03 · Monday at 7:52 AM

Gideon said:
While I doubt you yourself would be interested at all in this, I really wish someone would deliver you a M5 Pro Mac mini (once it's released this spring) and you'd run a similar workloads on it for comparison.

I’ll take a delivered M5 Pro Mac Mini 😉
If someone is being charitable and all…

Hulk · Monday at 11:38 AM

Gideon said:
While I doubt you yourself would be interested at all in this, I really wish someone would deliver you a M5 Pro Mac mini (once it's released this spring) and you'd run a similar workloads on it for comparison.

Studio One and UVR are natively supported. Vegas Pro is not (which might be a dealbreaker), but DaVinci Resolve IS supported along with Voukoder (beta).

An alterantive is to setup a test-case on an open source video others could replicaet, but this doesn't really doable here (since it's not a canned benchmark but actual complex workflow). The other thing is that the base 16GB probably isn't enough for running these in parallel, or is it a non-issue?

All in all given the large caches, the unified memory and PCIE5 equivalent SSD (and all the other things mentioned in this interview of the Anand Lal Shimpi: https://forums.anandtech.com/threads/apple-silicon-soc-thread.2587205/post-41585933, I'd assume it would run very well.

I hear you but I'm 50 years into x86.

coercitiv · Monday at 1:20 PM

Hulk said:
I hear you but I'm 50 years into x86.

Damn, I didn't even have to caption this one.

OneEng2 · Monday at 11:51 PM

Hulk said:
From what I remember the biggest flaw with Bulldozer was CMT. Basically two logical processors sharing execution resources. It was an attempt to maximize compute/die area. Kaveri fixed a lot of the front end starvation issues by adding per core decoders and better branch prediction but by this time it was too late because Skylake was "born" about this same time and the Bulldozer era cores couldn't compete. A new direction was needed and AMD realized that and Zen was born.

It's been a hot minute, but I thought it a combination of sharing the decode and fetch, lack of bandwidth and of course, very poor cache design.

It's been an awful long time though and you could be right.

Hulk said:
The more I analyze the core usage during my work flow the more I'm wondering how much improvement from from 16 to 24 cores will provide outside of Cinebench and other benches? For example, last night after shooting a video of my daughters musical I'm home splitting the instrumental and vocals out of the audio track using UVR on pretty high settings while rendering a video using Vegas Pro using the Voukoder encoder to H.265. It's tough on the processor as it "lights up" transistors in a dense way that creates hot spots.

Anyway while doing this my CPU load is about 70% while the GPU is about 90%. Thing is the 9950X still isn't tapped out for cores. I know this is anecdotal because I'm looking at one of my use cases, but what I really need is more ST performance, not moar cores. Same thing when I'm running Studio One.

I know this is impossible from a "building the stack" point-of-view for AMD as well as the Cinebench competition, but what would be nice if they built a processor for me would be 1 CCD devoted soley to ST performance through architecture and frequency with perhaps 8 or 10 cores, and the second CCD for MT with 12 or 14 cores.

I don't think AMD is designing for desktop 😉.

poke01 said:
Everyone needs more ST, not everyone needs more cores.

Not DC. They want as many cores as they can get and all the bandwidth to feed it. AMD has said many times that Zen 6 was "Server First" design.

I do like Hulks idea of the one "super core" 😉 to add as one of the CCD's. I do wonder if this gets them into scheduling trouble though.

basix · Tuesday at 2:46 AM

There was a post or news related to DC and ST, maybe even here in this forum. It basically said, that per thread performance gains much upwind because memory cost is so high these days. Allegedly even turning off SMT to get higher ST performance. Many applications have fixed memory pools per thread / instance. The higher your ST performance, the better the perf / cost ratio.

coercitiv · Tuesday at 3:03 AM

basix said:
There was a post or news related to DC and ST, maybe even here in this forum. It basically said, that per thread performance gains much upwind because memory cost is so high these days. Allegedly even turning off SMT to get higher ST performance. Many applications have fixed memory pools per thread / instance. The higher your ST performance, the better the perf / cost ratio.

It was a post on the side-effects of memory shortage.

StefanR5R · Tuesday at 9:50 AM

Also remember that there are silly software licensing schemes in some not so narrow niches, which call for high per-core performance in servers.

Hulk · Tuesday at 10:00 AM

poke01 said:
No ISA is inherently better. There’s plenty of ARM designs with worse efficiency than
x86 designs.

Im looking forward to nova lake with APX but I believe it won’t be till Unified core that Intel can totally beat ARM and even by then who knows what those Cambridge folks cook up to counter it.

So Intel's IBoT gets to my optimization point exactly. Intel is basically intercepting code on the fly and reordering it to operate more efficiently on the Arrow Lake platform. For example, making the branch predictor miss less often or making sure the right threads go to the right cores. This can bring massive performance increases because a generic compiler is used that just works for all cpus. It would be a nightmare to have to compile, validate and test for both AMD and Intel cpus, nevermind various generations.

Apple does not have this restriction. Software only needs to be compiled for one basic architecture and that is a HUGE advantage for the Apple CPU's. It is convenient that IBoT arrived perfectly timed to support my supposition. I will admit that!

Hulk · Tuesday at 10:02 AM

adroc_thurston said:
No.
The biggest issue with Bulldozer was awful caches.

Zen was started in 2012, long-long before SKL reached the market.

"Born" is a metaphor referring to "brought into the world."
Of course there was a gestation period before the actual birth of Zen.

Nothingness · Tuesday at 12:43 PM

Hulk said:
So Intel's IBoT gets to my optimization point exactly. Intel is basically intercepting code on the fly and reordering it to operate more efficiently on the Arrow Lake platform. For example, making the branch predictor miss less often or making sure the right threads go to the right cores. This can bring massive performance increases because a generic compiler is used that just works for all cpus. It would be a nightmare to have to compile, validate and test for both AMD and Intel cpus, nevermind various generations.

Apple does not have this restriction. Software only needs to be compiled for one basic architecture and that is a HUGE advantage for the Apple CPU's. It is convenient that IBoT arrived perfectly timed to support my supposition. I will admit that!

The Apple advantage is that their uarch is good at anything you throw at it. That should be the case of AMD and Intel uarch too. If anything similar is made for Apple, you'll see similar gains. The thing is that Intel needs that tool to be competitive.

coercitiv · Tuesday at 1:07 PM

Nothingness said:
The Apple advantage is that their uarch is good at anything you throw at it. That should be the case of AMD and Intel uarch too. If anything similar is made for Apple, you'll see similar gains. The thing is that Intel needs that tool to be competitive.

I think the presumed advantage for Apple CPUs is that this kind of optimization is already baked in at compilation time, anything intended for MacOS is targeting a single vendor and a relatively small family of cores.

On the other side, we have a very weird situation with iBOT, there are measurable benefits to binary optimization but no practical way to leverage such optimization. All we get for now is a completely tainted Geekbench score database, where Arrow Lake and Panther Lake CPUs will feature scores indicative of what they could achieve but likely never will.

Nothingness · Tuesday at 2:34 PM

coercitiv said:
I think the presumed advantage for Apple CPUs is that this kind of optimization is already baked in at compilation time, anything intended for MacOS is targeting a single vendor and a relatively small family of cores.

If Intel tool limits itself to code layout, then Apple compiler has no advantage here. Also Apple has been extending the ISA and changed the uarch several times (bpred changes, number of units, etc.) so they are facing similar issues as Intel and AMD.
I agree with you they have an overall advantage, but I'm not sure it's at the compiler level (I don't think anyone proved their compiler does things specific, and IIRC Andrei checked that himself back then).

On the other side, we have a very weird situation with iBOT, there are measurable benefits to binary optimization but no practical way to leverage such optimization. All we get for now is a completely tainted Geekbench score database, where Arrow Lake and Panther Lake CPUs will feature scores indicative of what they could achieve but likely never will.

Agreed. But if they reached the improvement without cheating, I'd blame PrimateLabs for not pushing their benchmark enough 🙂

Jan Olšan · Tuesday at 2:35 PM

I don't see the scores as particularly tainted. Geekbench is such a mess anyway (the ST Mk.2 a.k.a. MT score for example) and has a huge variance anyway.

It would be nice if the iBOT thing got people's attention and it lead to such binary optimizers becoming a more maisntream thing, which could be a useful thing, so Intel providing a nudge would be welcome.

I'm just not totally sure if it is viable wrt risk of miscompilation (transcompilation?) bugs. Kinda sucks if you can get your game or program 8% faster BUT you don't know it's working properly.

Or, if this could nudge compiler devs and software devs to up their game, to use more thorough performance tuning, runtime SIMD detection and selection and other performance boosting tricks by default. Perhaps it would turn out tools like iBOT are not so sorely needed and the mission would be accomplished this way.

Nothingness · Tuesday at 2:46 PM

Jan Olšan said:
Or, if this could nudge compiler devs and software devs to up their game, to use more thorough performance tuning, runtime SIMD detection and selection and other performance boosting tricks by default. Perhaps it would turn out tools like iBOT are not so sorely needed and the mission would be accomplished this way.

I don't expect compiler devs to become lazy. OTOH the risk of other devs to care even less about optimization is high. They are already facing the issue of having to target different uarch and ISA (how long did it take for AVX2 to become mainstream? 10 years?), and given their tight schedules to release SW, and the existence of a tool that will take care of optimizing their work on each client platform, why spend time on that?

Hulk · Tuesday at 2:53 PM

coercitiv said:
I think the presumed advantage for Apple CPUs is that this kind of optimization is already baked in at compilation time, anything intended for MacOS is targeting a single vendor and a relatively small family of cores.

On the other side, we have a very weird situation with iBOT, there are measurable benefits to binary optimization but no practical way to leverage such optimization. All we get for now is a completely tainted Geekbench score database, where Arrow Lake and Panther Lake CPUs will feature scores indicative of what they could achieve but likely never will.

Exactly. Apple doesn't need the tool because software is ONLY compiled and optimized for Apple architectures. AMD currently is the one who doesn't need the tool because they are competing on the same instruction set as Intel.

Apple has great architecture, a tightly controlled platform, curated software, limited hardware variation, and they don't need compatibilies with ISA's that have been changing over the last 50 years.

If Apple were to compete in the x86 market today with a CPU it would be competitive (at best) with AMD and Intel. Apple engineers aren't any better or worse than the AMD/Intel engineers, they are working with the inherent advantages mentioned above.

This is not a knock on Apple, AMD, or Intel. It's simply the reality of the situation.

adroc_thurston · Tuesday at 2:54 PM

Hulk said:
Exactly. Apple doesn't need the tool because software is ONLY compiled and optimized for Apple architectures. AMD currently is the one who doesn't need the tool because they are competing on the same instruction set as Intel.

Apple has great architecture, a tightly controlled platform, curated software, limited hardware variation, and they don't need compatibilies with ISA's that have been changing over the last 50 years.

If Apple were to compete in the x86 market today with a CPU it would be competitive (at best) with AMD and Intel. Apple engineers aren't any better or worse than the AMD/Intel engineers, they are working with the inherent advantages mentioned above.

This is not a knock on Apple, AMD, or Intel. It's simply the reality of the situation.

No dawg they just make better cores.
That's it. Their CPU and SOC teams are just very very good.

Hulk · Tuesday at 2:59 PM

Jan Olšan said:
I don't see the scores as particularly tainted. Geekbench is such a mess anyway (the ST Mk.2 a.k.a. MT score for example) and has a huge variance anyway.

It would be nice if the iBOT thing got people's attention and it lead to such binary optimizers becoming a more maisntream thing, which could be a useful thing, so Intel providing a nudge would be welcome.

I'm just not totally sure if it is viable wrt risk of miscompilation (transcompilation?) bugs. Kinda sucks if you can get your game or program 8% faster BUT you don't know it's working properly.

Or, if this could nudge compiler devs and software devs to up their game, to use more thorough performance tuning, runtime SIMD detection and selection and other performance boosting tricks by default. Perhaps it would turn out tools like iBOT are not so sorely needed and the mission would be accomplished this way.

I agree. Scores are only "tainted" if you are compiling a benchmark in favor of one architecture without doing the same for the other. The reason for this is that you would be tricking people into believe one architecturer is better based on a score of a "suped up" benchmark for the architecture and the performance advantage would not be indicative of other appplications outside of that benchmark.

Now if Intel for example optimized Handbrake for Arrow Lake and did this transparently by telling everyone they developed an Arrow Lake version and it had 20% better performance, then I have no problem with that because it IS the actual application people will be using, not a performance reference point. Again, transparency is the key here.

In many ways this is similar to Intel, AMD, and nVidia having their own graphics software for their architecture, which all run the same games. I don't think we're going there because of the nightmare of software validation but that's the analogy. This "on the fly" fine tuning by Intel is clever. They just need to be totally transparent about how and when it works.

Question Zen 6 Speculation Thread

Diamond Member

Lifer

Diamond Member

Lifer

Golden Member

Member

Platinum Member

Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Senior member

Diamond Member

Elite Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member