Discussion Intel Nova Lake in H2-2026: Discussion Threads

Page 55 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

dullard

Elite Member
May 21, 2001
26,196
4,869
126
Wow, this is an aggressive path for Intel to be releasing basically 4 new core architectures inside of 12 months, if all goes well;) I wish them luck.
As others have said, Nova Lake will have Coyote Cove (which is just basically Panther Cove with a few things missing, and as 511 pointed out this is not to be confused with Panther Lake). The information that we have so far is that its focus is on larger IPC gains, efficiencies, and APX. https://www.tomshardware.com/pc-com...th-big-ipc-improvements-support-for-intel-apx
APX will require the software developers to recompile. So, don't expect instant gains from APX -- especially with initial reviews on older software. But over time as software is updated, you'll see improvements.
 
  • Like
Reactions: Hulk

Thunder 57

Diamond Member
Aug 19, 2007
4,302
7,113
136
As others have said, Nova Lake will have Coyote Cove (which is just basically Panther Cove with a few things missing, and as 511 pointed out this is not to be confused with Panther Lake). The information that we have so far is that its focus is on larger IPC gains, efficiencies, and APX. https://www.tomshardware.com/pc-com...th-big-ipc-improvements-support-for-intel-apx
APX will require the software developers to recompile. So, don't expect instant gains from APX -- especially with initial reviews on older software. But over time as software is updated, you'll see improvements.

We've seen this movie before with 64 bit. 64 bit alone didn't do much in most cases but those extra GPR's still needed software to be recompiled to know they were there.
 

511

Diamond Member
Jul 12, 2024
5,552
4,945
106
There are few stuff in APX that make your code have less branches iirc which is nice
 

dullard

Elite Member
May 21, 2001
26,196
4,869
126
We've seen this movie before with 64 bit. 64 bit alone didn't do much in most cases but those extra GPR's still needed software to be recompiled to know they were there.
True. Any CPU with significant new features often does better in hindsight than in the first reviews. This happens as software is recompiled, or even better optimized, for the new CPU.
 
  • Like
Reactions: Thunder 57

dullard

Elite Member
May 21, 2001
26,196
4,869
126
How does that work? Hopefully not like Branchless Doom :D .
The concept has been around for decades.

Standard Method
If-Then-Else statements when compiled into machine language have a lot of code and a lot of jumps and then all that code/jumps need to be predicted by branch prediction for optimum speed. That prediction that just doesn't work as well as you'd want once things get even remotely complex.
  1. The code needs to evaluate if something is true or false.
  2. If it is false, then run the false code.
    1. Then jump to the end of the true code.
  3. Otherwise run the true code.
  4. Then join the two paths back together.
Predicated If-Conversion Method
Instead of If-Then-Else statements, you can just write a couple lines of code and have only the necessary code run. Far fewer lines of machine language and no jumps. There is nothing to predict. Just the necessary code runs.
  1. Run True code if necessary
  2. Run False code if necessary
Half as many pseudocode lines. No branching. Nothing to predict. https://en.wikipedia.org/wiki/Predication_(computer_architecture)#Overview Now the compiler can choose to run #1 or #2 in any order that the compiler can determine will be most optimum. Or it could start up two threads and run both at the same time.

But, the drawback is that the code is complex, the predicated method just can't be done with such a limited number of registers. APX gives the compiler far more opportunities to remove the if-statements entirely.
 
Last edited:
  • Like
Reactions: Elfear and 511

Thunder 57

Diamond Member
Aug 19, 2007
4,302
7,113
136
The concept has been around for decades.

Standard Method
If-Then-Else statements when compiled into machine language have a lot of code and a lot of jumps and then all that code/jumps need to be predicted by branch prediction for optimum speed. That prediction that just doesn't work as well as you'd want once things get even remotely complex.
  1. The code needs to evaluate if something is true or false.
  2. If it is false, then run the false code.
    1. Then jump to the end of the true code.
  3. Otherwise run the true code.
  4. Then join the two paths back together.
Predicated If-Conversion Method
Instead of If-Then-Else statements, you can just write a couple lines of code and have only the necessary code run. Far fewer lines of machine language and no jumps. There is nothing to predict. Just the necessary code runs.
  1. Run True code if necessary
  2. Run False code if necessary
Half as many pseudocode lines. No branching. Nothing to predict. https://en.wikipedia.org/wiki/Predication_(computer_architecture)#Overview

But, the drawback is that the code is complex, the predicated method just can't be done with such a limited number of registers. APX gives the compiler far more opportunities to remove the if-statements entirely.

If-Then-Else-Switch all evaluate and costs time, but have been around since they work. Branch predicition is extremely accurate. It's an interesting conversation, but hasn't the jump to 16 GPR's and register renaming given us many gains? I'd like to see x86 match ARM with APX but I believe the returns will be limited.

I appreciate the link though.
 

dullard

Elite Member
May 21, 2001
26,196
4,869
126
If-Then-Else-Switch all evaluate and costs time, but have been around since they work. Branch predicition is extremely accurate. It's an interesting conversation, but hasn't the jump to 16 GPR's and register renaming given us many gains? I'd like to see x86 match ARM with APX but I believe the returns will be limited.

I appreciate the link though.
Even with perfect prediction, you have a jump and a join to execute. That and any branch prediction fail is a major delay, so even missing 1% of the time can slow things down quite a bit.

x86 vs ARM compiler difference discussion is actually well past my knowledge. And it is best in another thread.
 

MS_AT

Senior member
Jul 15, 2024
948
1,878
96
The concept has been around for decades.
I think you have went a bit too far with respect to APX. It simply introduces more conditional instructions that operate based on the status flags modified by preceeding instructions. The benefit is you save branch predictor buffer entries, the negative is that you introduce explicit dependency that cannot be reordered around. In other words, if your condition is very predictable stick to branches, if your condition is on the more random side use conditional instructions.
 

dullard

Elite Member
May 21, 2001
26,196
4,869
126
I think you have went a bit too far with respect to APX. It simply introduces more conditional instructions that operate based on the status flags modified by preceeding instructions. The benefit is you save branch predictor buffer entries, the negative is that you introduce explicit dependency that cannot be reordered around. In other words, if your condition is very predictable stick to branches, if your condition is on the more random side use conditional instructions.
That is how I interpret statements like this:
APX also adds to the x86 ISA’s predicated-execution capabilities, which should help compilers eliminate performance-sapping, hard-to-predict branches.
https://www.techinsights.com/blog/apx-biggest-x86-addition-64-bits

and
These enhancements expand the applicability of if-conversion to much larger code regions, cutting down on the number of branches that may incur misprediction penalties.
https://www.intel.com/content/www/u...ical/advanced-performance-extensions-apx.html

So, I explained predicated-execution / if-conversion.
 

Thunder 57

Diamond Member
Aug 19, 2007
4,302
7,113
136
x86 vs ARM compiler difference discussion is actually well past my knowledge. And it is best in another thread.

Agreed. All I'll say is EPIC certainly didn't end the compilier problem. They ended EPIC and OoOE lives on.
 

MS_AT

Senior member
Jul 15, 2024
948
1,878
96
So, I explained predicated-execution / if-conversion.
I do not dispute this, but I just found it a bit too complex in relation to what APX provides in reality. That's all;)

It's an interesting conversation, but hasn't the jump to 16 GPR's and register renaming given us many gains? I'd like to see x86 match ARM with APX but I believe the returns will be limited.
It's hard to say really how willing people will be to recompile. Actually I am looking forward to APX as I need additional GPRs for my... AVX512 code ;) [memory addresses and loop control are held in GPRs in case somebody is wondering].
 

511

Diamond Member
Jul 12, 2024
5,552
4,945
106
I do not dispute this, but I just found it a bit too complex in relation to what APX provides in reality. That's all;)


It's hard to say really how willing people will be to recompile. Actually I am looking forward to APX as I need additional GPRs for my... AVX512 code ;) [memory addresses and loop control are held in GPRs in case somebody is wondering].
NVL Buyer spotted 😛
 

Cardyak

Member
Sep 12, 2018
85
204
106
Not really. Arctic Wolf for example will likely be the 12-wide x86 core that Jim Keller was working on. 12-wide issue by itself will only bring few single digit % gains. So that's just an enabler. Other than that, we don't know anything about it. Ticks like Darkmont we can speculate much easier. How much of Skymont could we have got from Gracemont and Crestmont? Nothing really.
The wording "12-wide" is a little ambiguous. Indeed, early rumours indicate that Arctic Wolf will be 12-wide at decode, but that doesn't guarantee it will be 12-wide at rename stage.

If you look at Skymont, it's 9-wide at decode but only 8-wide at rename. This isn't necessarily a waste as it allows the decode to "overfill" and ensure the 8-wide machine is well utilized, but it still limits the overall design to being an 8-wide core.

I strongly suspect Arctic Wolf will be a similar affair, I'd bet good money that it will be 12-wide at decode stage but 10-wide for rename/allocate.
 

511

Diamond Member
Jul 12, 2024
5,552
4,945
106
I strongly suspect Arctic Wolf will be a similar affair, I'd bet good money that it will be 12-wide at decode stage but 10-wide for rename/allocate
Both Coyote and Arctic wolf are 12 wide decode and both are clustered decode.
The retirement is 16 Wide in Skymont iirc which is massive imo
 

LightningZ71

Platinum Member
Mar 10, 2017
2,714
3,421
136
It can easily be crazy wide after decode without blooming the XTOR budget if most of the ways are quite simple.
 

511

Diamond Member
Jul 12, 2024
5,552
4,945
106
@MS_AT Intel has thrown everything at the problem with NVL packing/extra cache/extra cores/extra pci-e/Integrated TB5 and they have fixed shortcoming of ARL as well
 
  • Haha
Reactions: Thunder 57

DavidC1

Platinum Member
Dec 29, 2023
2,196
3,349
106
It can easily be crazy wide after decode without blooming the XTOR budget if most of the ways are quite simple.
Skymont is ~30% larger than Crestmont iso-process if we exclude the FP increases, meaning it's a very good 1:1 increase from the performance increase it got. Based on their history I'm confident they'll get linear gains again, but it won't happen without innovation, which is what we cannot guess.

Wide is a waste especially without better branch prediction, so that's basically the ceiling on how much wide they can go before it hits diminishing returns quickly.
A lot? Isn't it just more of the same? Or are you able to name at least 3 distinct features which are not about making something bigger? (Clustered decode was there, it just got bigger, distributed schedulers were there, got bigger, more of execution units, bigger BTB and reorder buffer). Actually from memory, I think only Nanocode stands out as something that is new, and not just bigger. Of course might be wrong.
Would we have been able to guess them having 16-wide retire which they said it was an attempt to efficiently increase resources? Or how doubling ALUs which were all simple ones because it was "cheap to add"? What about having more stores over loads which is also in contrary to established expectations? They are quite distinct. That's why I focus on the E core team doing things efficiently not just expanding without thinking.

The wording "12-wide" is a little ambiguous. Indeed, early rumours indicate that Arctic Wolf will be 12-wide at decode, but that doesn't guarantee it will be 12-wide at rename stage.
If you look at Skymont, it's 9-wide at decode but only 8-wide at rename. This isn't necessarily a waste as it allows the decode to "overfill" and ensure the 8-wide machine is well utilized, but it still limits the overall design to being an 8-wide core.

I strongly suspect Arctic Wolf will be a similar affair, I'd bet good money that it will be 12-wide at decode stage but 10-wide for rename/allocate.
The typical use of "12-wide" is on the decode side. If they go 10-wide for rename/allocate, still it's overall similar result, because it's still a substantial 25% increase over the predecessor. Oh, and Gracemont was 5-wide while Crestmont was 6-wide rename/allocate. We got maybe 3% out of that, and Crestmont had few more small changes too. In reality, decode is just an enabler, and overall performance wise just one out of maybe dozen high-level features that dictate performance.

We don't even have full performance picture of the simple Tick+ cores in Pantherlake. There's no way we can guess what's going on in Arctic Wolf.
 
Last edited:

Fjodor2001

Diamond Member
Feb 6, 2010
4,677
759
126
  • Like
Reactions: lightmanek