Discussion Intel Binary Optimization Tool thread

511 · Mar 26, 2026

RPL Clocks sky high and the uncore is really fast in Raptor Lake not to mention lower latency than any other modern processor

poke01 · Mar 26, 2026

igor_kavinski said:
I'm pretty excited about Intel violating the trust of other prominent software developers.

as long as they don’t touch benchmarking software, i don’t care what Intel does to get themselves a win.

But it’s a messy way of gaining performance.

gdansk · Mar 26, 2026

511 said:
RPL Clocks sky high and the uncore is really fast in Raptor Lake not to mention lower latency than any other modern processor

Sure and in that manner quite unlike Zen 2. Yet it chews through code that Intel marketing says is "console optimized". A curious line they spin.

511 · Mar 26, 2026

gdansk said:
Sure and in that manner quite unlike Zen 2. Yet it chews through code that Intel marketing says is "console optimized". A curious line they spin.

That's only true for handful of games 🤣 better to not take marketing at face value. Also Lion Cove from uArch perspective kind of different than Skylake the ports has changed in LNC it's no longer unified ports and not to mention we have L0/L1/L2/L3 another cache hierarchy Golden Cove is not much changed in terms of cache hierarchy and ports (I don't mean the capacity obviously).

MS_AT · Mar 27, 2026

poke01 said:
But it’s a messy way of gaining performance.

What I don't like is that they probably (I am guessing, but they marketing does not want to give us real answers so take it with a grain of salt) is that they have made LLVM's BOLT (OSS) work on Windows but do not want to share the results with the rest of community. Their current compiler is downstream of LLVM what just make it even more likely.

While I understand some bits of HWPGO might not work for others (the profile gathering) the other part of implementation should be universal. I mean it would be then AMD/Qualcomm/nVidia problem to gather profiles for their CPUs. But that would also empower normal people to run that on their own, for their own software. The gains could be weak in that case, but at least this pseudo-secrecy black magic vibe from marketing would be gone.

Once again, I am not saying this is what they are exactly doing, just guessing based on what I understand from marketing messages. I can be wrong.

igor_kavinski · Mar 27, 2026

https://www.tomshardware.com/pc-components/cpus/intels-binary-optimization-tool-tested-and-explained-how-the-ibot-translation-delivers-up-to-18-percent-faster-gaming-performance-8-percent-on-average

Decent investigation with lots of data.

My takeaways:

250K is generally seeing more improvement from iBOT supported games than 270K. I think the 250K suffers due to only 6 P-cores and iBOT either makes those P-cores more performant or it makes the Skymonts performant enough to make up for the absence of two additional P-cores vs. 270K. This could've been tested by disabling E-cores but unfortunately, it's Tom's Hardware. Anand Lal Shimpi would definitely have delved deeper.

Also,

See the massive improvement 250K is experiencing with iBOT? I think that's the Skymont IPC improvement working its magic. Again, this should've been investigated.

The temps and power use also increase significantly in some cases, meaning the cores are working harder and less time idling. Doesn't always translate to a proportional increase in performance though. The clocks are also dropping a bit in a few cases. This could be due to rising temperatures and/or AVX2 codepaths causing localized hotspots which then need to be mitigated with downclocking. Just a guess.

Also, NOTICE the temps for 250K rising up to 5C in Far Cry and Cyberpunk. We know that the Skymonts can be harder to cool than the P-cores due to being denser silicon. So this lends credence to my theory that their IPC is getting boosted by iBOT trickery.

Nothingness · Mar 27, 2026

Too bad it's missing results of subtests of Geekbench.

511 · Mar 27, 2026

Nothingness said:
Too bad it's missing results of subtests of Geekbench.

here have this

Gigabyte Technology Co., Ltd. Z890M AORUS ELITE WIFI7 ICE vs Gigabyte Technology Co., Ltd. Z890M AORUS ELITE WIFI7 ICE - Geekbench

browser.geekbench.com

Nothingness · Mar 27, 2026

511 said:
here have this

Gigabyte Technology Co., Ltd. Z890M AORUS ELITE WIFI7 ICE vs Gigabyte Technology Co., Ltd. Z890M AORUS ELITE WIFI7 ICE - Geekbench

browser.geekbench.com

Thanks a lot! So that's not only large code bases that benefit, interesting.

Hulk · Mar 27, 2026

Let's look at one possible action IBOT may be performing and the ramifications.

Let's say IBOT determines there is certain data that is currently being ejected from cache and it would be more beneficial for this data to remain in cache for a particular software.

Now it would seem logical that this specific optimization would pay larger dividends in systems with smaller cache because that data would be less likely to be ejected from systems with larger cache.

Furthermore, this would imply that these optimizations would be more beneficial to Lion Cove rather than AMD's X3D parts with v-cache.

Is this basic logic reasonable?

LightningZ71 · Mar 27, 2026

In general, in many games, there are things that the E cores can be doing that don't require extreme low latency response. Offloading to those cores can reduce the context switches needed on the Pcores, making them more efficient for throughput. The performance difference in the above charts is almost exactly in line with the P core max boost frequency difference between the two SKUs, meaning that having only 6 P-cores isn't really a hold up if there are sufficient e-cores to carry the background load. Games typically only have 1-2 threads/processes that are latency critical with respect to performance. they are starting to get more secondary threads that have a lot of general work to do, so they need a good core to complete that in a timely manner so their output is available to the latency critical threads, so there is a need for performant secondary cores as well. Then, there's the various housekeeping and pre-work threads that can all be done efficiently by the e cores. It looks like the software is doing it's job with respect to getting threads where they are supposed to be.

I have to wonder if the uncore fixes/improvements/improved timings are playing a crucial part in getting this software to work right? If the uncore was much slower, moving those other threads and their data around would probably be more painful.

dttprofessor · Mar 27, 2026

LightningZ71 said:
In general, in many games, there are things that the E cores can be doing that don't require extreme low latency response. Offloading to those cores can reduce the context switches needed on the Pcores, making them more efficient for throughput. The performance difference in the above charts is almost exactly in line with the P core max boost frequency difference between the two SKUs, meaning that having only 6 P-cores isn't really a hold up if there are sufficient e-cores to carry the background load. Games typically only have 1-2 threads/processes that are latency critical with respect to performance. they are starting to get more secondary threads that have a lot of general work to do, so they need a good core to complete that in a timely manner so their output is available to the latency critical threads, so there is a need for performant secondary cores as well. Then, there's the various housekeeping and pre-work threads that can all be done efficiently by the e cores. It looks like the software is doing it's job with respect to getting threads where they are supposed to be.

I have to wonder if the uncore fixes/improvements/improved timings are playing a crucial part in getting this software to work right? If the uncore was much slower, moving those other threads and their data around would probably be more painful.

It's the work of APO

igor_kavinski · Mar 27, 2026

LightningZ71 said:
Offloading to those cores can reduce the context switches needed on the Pcores, making them more efficient for throughput.

That's a good point and could be tested by launching a game with higher than normal process priority on the 250K and see if that improves performance since the higher priority will minimize context switches on the P-cores.

igor_kavinski · Mar 27, 2026

Found some juicy papers.

https://cseweb.ucsd.edu//~voelker/pubs/etch-ntws97.pdf

https://ece.northeastern.edu/courses/ece3391/papers/ETCH.pdf

https://csslab-ustc.github.io/publications/2025/binary-opt.pdf

Schmide · Mar 27, 2026

igor_kavinski said:
Found some juicy papers.

https://csslab-ustc.github.io/publications/2025/binary-opt.pdf

This one is good. The others are a little dated. IMO

Momoka_ · Mar 27, 2026

I’m a bit curious whether iBOT utilizes Intel’s own ICPX.

CouncilorIrissa · Tuesday at 10:32 AM

Analyzing Geekbench 6 under Intel's BOT - Geekbench Blog

www.geekbench.com

gdansk · Tuesday at 10:37 AM

CouncilorIrissa said:
Analyzing Geekbench 6 under Intel's BOT - Geekbench Blog

www.geekbench.com

Wow, I didn't think they'd include autovectorization.
If AMD follows along in these shenanigans autovec may not even be a net win for Intel.

511 · Tuesday at 10:48 AM

this makes me question why just submit your results into the official benchmark since they can do so

dttprofessor · Tuesday at 11:02 AM

apx ready！

regen1 · Tuesday at 11:15 AM

They tested this on a PTL SKU(U9 386H).
Flagging BOT results is appropriate, at least for now.

MS_AT · Tuesday at 11:21 AM

gdansk said:
Wow, I didn't think they'd include autovectorization.
If AMD follows along in these shenanigans autovec may not even be a net win for Intel.

It's interesting if they handtune the code or they have some sort of algorithmic solution to the problem. Usually the auto autovec leaves a lot of perf on the table.

CouncilorIrissa said:
Analyzing Geekbench 6 under Intel's BOT - Geekbench Blog

www.geekbench.com

A pity they focused only on the most shocking example I was really curious what is there behind clang improvements😉

LightningZ71 · Tuesday at 11:35 AM

AutoVec is TYPICALLY better than no vectorization at all. For CPUs with a full rate implementation of AVX-512, even those suboptimal gains could be substantial.

I haven't dug into the details yet. I wonder if it's even taking older pre-AVX2 code and vectorizing as much of that as it can to AVX2?

gdansk · Tuesday at 11:39 AM

LightningZ71 said:
I haven't dug into the details yet. I wonder if it's even taking older pre-AVX2 code and vectorizing as much of that as it can to AVX2?

They suggest it's converting scalar code into vector code. I assume Intel didn't have someone do that by hand (what a waste that would be).

igor_kavinski · Tuesday at 11:46 AM

They've been doing that since GB 6.3 so almost two years they've been at it. I suppose that's when they got ARL Refresh silicon back from the fab or even NVL silicon!

Discussion Intel Binary Optimization Tool thread

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Member

Lifer

Lifer

Diamond Member

Member

Senior member

Diamond Member

Diamond Member

Member

Senior member

Senior member

Platinum Member

Diamond Member

Lifer