• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Discussion Intel Binary Optimization Tool thread

Previous discussion that started in other threads:




Machine code layout optimization explanation: https://easyperf.net/blog/2019/03/27/Machine-code-layout-optimizatoins

iBOT probably inspired by LLVM's BOLT: https://github.com/llvm/llvm-project/tree/main/bolt
 
Last edited:

1774523431450.png

1774523449964.png

1774523548778.png

You have been offering the Application Optimization tool (APO) to improve the performance of apps and games since the 14th Gen Core. How does iBOT differ from APO?
Hallock: "First, let's talk about APO: that is a technology that works at the OS level. It controls how many threads a game uses and what priority they are given. Many games determine this themselves via a so-called affinity mask, but the underlying assumptions are not always correct. For example, a CPU with eight physical cores and eight virtual threads is then seen as a CPU with sixteen physical cores.
Because of that incorrect assumption, a lot of unnecessary work is started that the circuits in that CPU cannot process efficiently at all. Managing the overhead of all those threads subsequently causes the game to actually slow down. APO enables us to identify such cases and make the CPU respond better to the game.
However, APO cannot fix errors in the code at the architecture level. You need a different tool for that. Our Binary Optimization Tool really works at the execution level of the microarchitecture."
Intel Application Optimization was introduced together with the 14th Gen Core processors and is enabled by default today. Intel Application Optimization was introduced alongside the 14th Gen Core processors and is enabled by default today.
Chynoweth: "To add something to what Rob said: I always say that APO optimizes hardware resources for the software and iBOT optimizes the software for the hardware. And the interesting thing is that a kind of gray area arises between those two, which, to be honest, we hadn't entirely foreseen when we started this. All sorts of interesting interactions emerge in that intermediate area."
Hallock: "So, APO and iBOT are different tools for different problems, on different layers of the software stack. Without those tools, gamers would end up in situations where somewhere between 5 and 25 percent of potential performance is locked behind an optimization choice that might not be right for the CPU the user actually has. We want to give that performance back to the user with APO and iBOT."
So how does Binary Optimization Technology optimize the software for the hardware? Hallock: "The basic idea is that developers have to make choices when optimizing their software. This often happens based on a specific architecture or processor series; only well-funded developers can look at a lot of different CPUs in their lab. The architecture on which a program or game is optimized—which can sometimes even be console hardware or an Arm CPU—influences what kind of code is used and which instructions are used. Even if code is optimized for, say, a different x86 architecture, we can still run it, but that does have a major impact on performance.
With iBOT, Intel now has a tool to 'open' a compiled workload. We cannot extract anything from it, but we can streamline or reorder things so that it aligns better with the execution pipeline of modern Intel CPUs, for example, so that the machine code fills the cache better or utilizes the branch predictor better. With iBOT, we look at features within it that run slower than they could, to replace them on-the-fly with code that works better on Intel's x86 architecture." Intel iBOT
Perhaps it would be good to briefly define exactly what 'machine code' is.
Hallock: "Machine code is an instruction at a very deep level. You are no longer looking at a programming language with context and logic, but at a kind of manual that tells the CPU what to do. For example: jump to memory address 0x00010 to retrieve a piece of data there."
Can you give a practical example of how iBOT can optimize that machine code? Hallock: "Let's take a hypothetical example of a program that caches something. On the left (in the image below, ed.) you see a simple 'if-then' construct. If this small piece of data has a value lower than 10, the next line of code goes to an error handler. And if that is not the case, you do the normal work instead. The 'then' is therefore the exception.
The only problem is that many algorithms for caching and branch prediction simply take the next line of code—that then-statement—as a starting point. But if that code is used in only 1 percent of cases, or even in 0.01 percent of cases, then the CPU caches something and jumps to something that is almost never used. That is a waste of valuable cache space and precious processor time." Intel iBOT example (1 / 2)
Hallock: "An example of how we can reorganize that with iBOT is to put the then statement in cold storage. You can move that to the L3 cache or even main memory, so that the actual working path is taken by default in the machine code.
So, the following happens: instead of first running the error handler, which fills the cache with things you probably won't use, you fill the cache with useful data that you *do* need. You then no longer discard important cache lines for a very exceptional scenario.
There are all kinds of examples like this in the field of caching, code hotspots, and branch prediction. But the idea is that we can restructure where and when the least used path ends up in the machine code, so that the processor can make better decisions as it runs through the code."
If you look purely at the machine code, you simply don't know whether a turn is taken almost always or almost never in practice. So, you have to run and analyze the code to know what is happening, it seems to me.
Chynoweth: "Yes, there is a lot involved. We record every jump taken in the code (a branch in jargon, ed.) in what we call the lastbranch records. "There we see which branches are taken, where they originate from and to, and whether that branch was accurately predicted by the branch predictor."
So, do you run a program on a sort of simulated processor, or on a real chip like the ones consumers have in their PCs?
"We use a real processor with a real workload.""Hallock: "That is exactly where you hit our advantage: we use a real processor with a real workload. Everyone does code profiling, but generic tools for that don't give you specific details about what happens *inside* the processor. For certain aspects of the microarchitecture, you are simply blind.
The in-house developed tool we use, Intel Hwpgo, uses information from a performance monitoring unit in the hardware. It provides information about specific events and gives, for example, the address information and those lastbranch records. This allows us to see exactly where the bottleneck is in the processor. "We originally devised this method for monitoring performance to help developers debug, but with iBOT, we have also found a great application for it for users."
Chynoweth: "What is important is that with this hardware-based profiling, you can be absolutely certain that your analysis is correct. Previously, developers often used an instrumented binary with all sorts of extra logging. However, that slows down execution by at least a factor of three; I have even seen a factor of ten. How do you then know whether the results of your profiling are caused by your own debug features or by real inefficiencies in the code?
Our measurements have less than one percent overhead. Hwpgo is without a doubt a superior method of profiling." It also costs hardly any performance while you are doing it, and you can apply it to real binaries in a production environment.
What kinds of optimizations can you do with all that information?
Hallock: "The technology consists of various modules, which are best viewed as a family of tools that each intervene in architectural aspects of software in their own way – think of branching, caching, or code hotspots.
These modules are remarkably generic, because the problems they address recur throughout the code. Which module is deployed depends heavily on the context: which compiler was used, with which compiler flags, how long ago the code was built, what it was intended for at the time, and even which processor the developer had on their desk back then.
Exactly which modules we use differs per workload, but they all come from the same toolbox and have the same goal: identifying and resolving architectural bottlenecks in the code."
I noticed that while the provisional list of supported games includes some recent titles, such as Cyberpunk 2077, it still consists mainly of older games. That is apparently where the most potential lies. Is that because developers have started programming better, or because compilers are improving and have started using newer x86 features?
Chynoweth: "A bit of everything, actually. For iBOT, we primarily look at games that are popular but run inefficiently. We also apply it mainly to titles where we do not have the opportunity to collaborate with the developers. For Cyberpunk 2077, we previously worked with the developers on a major patch that yielded a performance gain of about 30 percent. But take Shadow of the Tomb Raider, a game that is no longer actively being developed. Yet, many people still play that game, and we know there are specific bottlenecks that stand in the way of better performance. In such cases, we deploy the Intel Binary Optimization Tool to improve that efficiency and achieve better frame times." Furthermore, the focus so far has been almost exclusively on games. Is there also potential for iBOT in productive software?
Hallock: "In theory, every category of software is eligible for binary optimization. But that immediately raises some more fundamental questions: is it really worth the effort? Is the performance gain significant enough? Is this a sensible use of engineering time? And how does the community view this – do they see it as something useful and positive, or not? Those are more philosophical considerations.
In practice, it doesn't matter that much whether it concerns creative applications, productivity software, or games: wherever there is room for optimization, we could apply binary optimization. That is ultimately our goal as well. At the same time, we realize that non-game workloads likely have a higher threshold, certainly given their reputation, both among users and in the media. That is why we are deliberately rolling out this technology cautiously and methodically, and participation remains opt-in.
"We know how quickly the perception can arise of: 'Look, Intel is enabling a feature and getting a higher benchmark score as a result.'" "We want to avoid that at all costs." "We know how quickly the perception can arise of: 'Look, Intel turns on a feature and gets a higher benchmark score as a result.' We want to avoid that at all costs. We want to show that it is possible, and then work step by step towards a broader application. There is still enormous potential in this technology, but we are deliberately choosing a cautious approach so that the industry has time to understand what we are doing and it can develop gradually."
So, for example, you don't want to do Cinebench optimization, because people might see that as cheating?
Chynoweth: "It has to have a positive effect on the end user. Cinebench is a rendering benchmark, so let me take rendering as an example. We recently implemented Hwpgo together with Chaos Group, the developer of the popular renderer V-Ray. With optimizations similar to what we do with iBOT, we achieved an average performance gain of more than 15 percent. V-Ray customers, major film studios for example, benefited directly from that."
Maiorino: "That also shows that our engineers have built trust with partners like Chaos Group. Implementing a compiler update is no small feat; a lot can go wrong. Mike (Chynoweth, ed.) and his team have done an excellent job earning that trust and demonstrating that we want to unlock extra performance in close collaboration with our partners. That way, everyone benefits from that great performance gain."
You would think that heavy CPU workloads, such as rendering and video encoding, have already been optimized so extremely that there is little potential left. But you do see a future role for iBOT in those kinds of tasks, then? Hallock: "If we take video encoders as an example, a large part of the optimization time in such projects is spent on making the encoder itself faster, such as during scene analysis and coding. In this regard, the focus lies primarily on the algorithmic optimization of the codec one intends to implement.
However, this does not necessarily mean that optimization has been specifically aimed at a particular CPU architecture. Developers want the encoder to perform well, but may not consider optimizations specifically targeted at a single architecture—and in open-source projects, they often do not feel a direct obligation to do so.
In such cases, therefore, there is certainly still room for improvement. A workload may already be intensively generically optimized, while there is still much to be gained from architecture-specific optimization."
Chynoweth: "It is precisely with codecs that we sometimes see our biggest performance gains, which makes this a particularly interesting domain for us."
In an earlier presentation, you said that iBOT is now starting with the Core Ultra 200 Plus CPUs, but will become a fixed pillar for performance improvements in future CPU generations from now on. Hallock: "The first litmus test is actually the question: for which type of product is binary optimization useful? That will always be a product that is not bottlenecked by the GPU. For example, we do not apply it to Panther Lake 12 Xe, the configuration with a large integrated GPU. For an iGPU, it is very fast, but in many cases, it will be utilized to its maximum potential. In that case, there is little more to be gained from binary optimization; even if we make the CPU even faster, the GPU cannot suddenly produce more frames.
Panther Lake 4 Xe, with a less powerful iGPU, is specifically intended to be combined with a powerful discrete graphics card. Consequently, games in these systems will be bottlenecked by the GPU much less quickly, making it a more suitable scenario for iBOT. However, we do look at what makes sense per processor, which means the list of supported games can differ between, for example, Panther Lake for laptops and the Core Ultra 200 Plus CPUs for desktops.
Looking to the future, the rule of thumb is therefore: if "Since the platform is designed to work with a dedicated graphics card, we will implement iBOT in both laptops and desktops."
But such a graphics bottleneck only occurs in games. How does that apply to optimizations for pure CPU workloads?
Hallock: "That is correct; for pure CPU applications, any product can in principle benefit from binary optimization. But we are under no illusions; people buy these CPUs primarily for gaming, and that is why the ratio of games to other software in the support list is currently about 15 to 1."
You already mentioned that iBOT is not enabled by default; for now, it is opt-in. What makes you hesitant?
Hallock: "History is full of examples of 'turn on a feature and get more performance'. But that same history also shows that this did not always end well.
"You could, of course, say: this is great for gaming, so why isn't it just on by default?" "We understand very well that binary optimization might look like those earlier attempts at first glance. That is precisely why we felt it was essential to introduce this technology as conservatively as possible. We know that iBOT differs fundamentally from earlier attempts that may have looked superficially similar, but did something completely different in substance. Those earlier examples entailed risks for consumers and for public trust. That is why we are choosing to keep iBOT opt-in for the time being, so that people can get used to it gradually and fully understand what it does. We might enable it by default in the long run, but first we want to see how it is received. We felt that was the right, ethical approach.
You could, of course, say: this is great for gaming, so why isn't it just on by default? But we want to play it safe. In our opinion, that is better for ourselves and for the public." and for the industry as a whole."
Is that also due to the possibility that there are still side effects that you did not see in your tests, but which users might encounter? Or is it primarily a matter of perception and acceptance? Hallock: "Especially that second point. Let me be clear: we don't collect telemetry, so we don't know who turns on iBOT. What we *do* do is pay very close attention to the sentiment: to your reporting, reactions on your site, comments on YouTube, Reddit, X, Bluesky, and so on. We monitor all of that closely to gauge what people think about it. Based on that, we can potentially adjust our choices or, conversely, confirm them. We have yet to see that.
""Personally, I expect users will receive it well, by the way. After all, it's free extra performance.""Personally, I expect users will receive it well, by the way. After all, it's free extra performance. But at the same time, we understand that this is very different from how CPU optimization has been common practice over the past fifty years. Now we suddenly have an extra roadmap. We already had a hardware roadmap that leads to increasingly better gaming performance, but now we also have a software roadmap that improves gaming performance—at the same pace or perhaps even faster.
That is quite remarkable. There is nothing else like this. But when you introduce a major change, you can't turn the world from one to the next change it the next day. That's not how it works. So we want to do this step by step."
Finally, what should our readers remember about iBOT?
Chynoweth: "I would say that iBOT is truly a game changer in a sense, because it is going to change how software and hardware interact. The software now runs more harmoniously with the hardware. And with APO, the hardware is in turn optimized for the software. Looking ahead, we are even already exploring additional capabilities within the CPU itself.
We are working directly with the CPU designers to see what information the chip can return to us in real time, on a nanosecond scale, and how we can use that not only to increase the average frame rate but also to address frame consistency and other issues affecting the user experience. We are going to focus on that even more aggressively in the future." ""These are real extra fps, and we know that really counts for gamers.""Maiorino: "From my side, I mainly want to emphasize that this is a real fps gain. These aren't fake frames, no weird AI things happening off-screen. These are real extra fps, and we know that really counts for gamers."
Hallock: "What I mainly want to convey to the readers of Tweakers is that we are not removing a single piece of work that the binary is supposed to execute. Everything that was in the source code is still happening. Nothing is being skipped. But we can achieve a lot of performance gain by ensuring that the code reaches the processor more compactly and efficiently.
iBOT is therefore a very pure form of performance optimization, because you keep everything as it was intended, but simply execute it better. That is what we are really proud of: that we can make a workload run faster, while showing respect for the original, by keeping all original requests intact." I can imagine that it is sometimes very tempting to simply delete a piece of code if you see that something is being calculated unnecessarily or could easily have been skipped.
Hallock: "Yes, but it just stays in, because it was simply in the original. That is the right thing to do."
 
So, my first question is: Would this provide any benefit for an air-gapped computer with no internet connection? There are quite a few special purpose machines out there that are kept off the internet until specific updates need to happen. It can't download replacement binaries in that situation. Would there be any help at all from it still being installed?
 
Would there be any help at all from it still being installed?
Probably in future with Nova Lake when/if they expand support to include APX optimizations. But if they do it on a per application basis instead of globally, that will be a bummer. Ideally, they should allow the user to auto-profile their own EXE and apply some optimizations automatically, even if it results in no more than 5% speedup. Better than nothing.
 
So I've done some thinking on this. The scenario I gleam.

intel's cache relationship to amd has morphed into

Code:
intel       amd

L0 (48k)    L1(48k)

L1 (192k)

            L2(1mb)

L2 (2-3Mb)

            L3(32mb)

L3(36mb)

memory        memory

In this relationship amd's L2 is a lot closer to intel's L1. intel's L2 sits half way between amd's L2 and L3. Intel's L3 goes over fabric and is relegated to fast ram like status.

So if code that is optimized to amd's large L2 and L3 caches run on intel's cache, it spills fast to the lower and slower L2 and L3. Additionally if you provision data to the L1 and L2 of amd and intel respectively. The intel 192k spills out a lot faster than amd's 1mb. Additionally these two caches have similar 12-16 way associativity, intel's narrow way window adds to this cache thrashing.

So what can intel do?

After doing some research, intel has some tools to flag how processes use the ways in the cache.

Intel RDT (Cache Allocation Technology) has bits where you can set a Class of Service (CLOS) bitmask. This allows you to let one thread use part of the ways (cache slices) while another can use the others.

In a system where the system may be streaming data, it would naturally evict older data down the hierarchy. If the RDT sets the right flags, it can keep some data resident that may be called less frequently but incurs a higher penalty when evicted.

This is all just musing.
 
Why would Zen be better at running Skylake-era code than Arrow Lake? Seems like a self-own from Intel more than anything. It doesn't even have a lake lineage to its name but AMD seemingly found hardware solutions to problems Intel can only fix with "optimizing" binaries used in benchmarks.

Very odd.
 
Why would Zen be better at running Skylake-era code than Arrow Lake? Seems like a self-own from Intel more than anything. It doesn't even have a lake lineage to its name but AMD seemingly found hardware solutions to problems Intel can only fix with "optimizing" binaries used in benchmarks.

Very odd.
I get the point for games, many games if not all are optimised for AMD because of console. But content creation apps like the Adobe suite and Resolve. Intel already does better in those apps than AMD.

My take and is a stretch but I think Intel knows that Zen6 is going to be hard to beat without resorting to software and are preparing this for Nova Lake.
 
I get the point for games, many games if not all are optimised for AMD because of console.
No they're not lol. They use the same damn compilers and they don't even bother to use a preset like znver. Most target Skylake because that's the actual common subset. The only console that runs Zen and ships in volume doesn't even support full rate AVX2 like Skylake.

What pray tell causes Raptor Lake to outperform Arrow Lake on Skylake like code too? Is it the evil game consoles?
 
Oh and when they do use this for NVL as intended it may have the side effect of causing even fewer developers to target APX. Because why bother? Intel will do it if they need it.

Really I think this may end up encrustifying Skylake as the ultimate, final x64 feature set in the Windows ghetto. Why target anything past Skylake? It works on everything even Qualcomm.
 
What is the meaning of full rate? AFAIK zen2 has full 256b load store and does not split instructions.
Full rate may be the wrong phrase. How about less parallel? Not all Zen 2 FPUs are equal
https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aaee399-f14a-4f83-9ef5-eeacdf229d0c_619x238.png
 
but the xbox series X/S is full
No one on earth optimizes for Xbox. How many exclusives? Not even those poor companies being violated by Microsoft Game Studio ownership focus on that.

The whole console game optimization is a red herring. Why does Raptor Lake compete well with the 9950X non 3D? Just think for a bit.
 
Honestly it’s only cause they pump Raptor with so much power.
Because there isn't optimization for Zen. Software overwhelming target the common denominator (Skylake or even Haswell). Arrow Lake's arrow to the knee was entirely its own cache, worse than its predecessor in games. Not a function of game developers targeting Zen. And Zen's "path to victory" was bolting a huge cache on top (or bottom). Arrow Lake's "path to Raptor Like performance" is rewriting parts of games used in benchmarks. And World of Warcraft (which is good, we'll allow that).

Admittedly, it is funny to say RPL is pumped with power when it's a 10/7nm part that ends up competitive with 3nm. That's the only way you'll get high performance out of antiquated technology. But the same software techniques used here would apply to RPL as well and yet it wasn't necessary.
 
In the same vernacular, I don't think a flag is going to make much of a difference. You basically get sse/avx2/vex avx2 128. You could argue for instruction decoding bottlenecks but zen2 has a 4k op cache.

I highly doubt any of these flags make much of an actual cpu difference. Though my money would be on using 128b paths keeps the cpu side power efficient allowing for more power budget to go to the GPU.
 
Admittedly, it is funny to say RPL is pumped with power when it's a 10/7nm part that ends up competitive with 3nm. That's the only way you'll get high performance out of antiquated technology. But the same software techniques used here would apply to RPL as well and yet it wasn't necessary.
Nodes advantages don’t really matter if you are behind in design, I know what a shocking thing to say but if your uncore is crap who cares if you used N2 or A14.
 
  • Like
Reactions: 511
Nodes advantages don’t really matter if you are behind in design, I know what a shocking thing to say but if your uncore is crap who cares if you used N2 or A14.
The node disadvantage manifested as a performance advantage though. I'm not sure what your point is. Arrow Lake is now getting software tomfoolery that Raptor Lake didn't yet they end up within 1% each other in game performance summaries after this. Zen optimizations don't enter into that comparison. It's all Intel baby. Whatever they did to Arrow Lake was detrimental to games and that's entirely on Intel, not developers suddenly targeting Zen.
 
The node disadvantage manifested as a performance advantage though. I'm not sure what your point is. Arrow Lake is now getting software tomfoolery that Raptor Lake didn't yet they end up within 1% each other in game performance summaries after this. Zen optimizations don't enter into that comparison. It's all Intel baby. Whatever they did to Arrow Lake was detrimental to games and that's entirely on Intel, not developers suddenly targeting Zen.
okay then why is Intel saying games are optimised for console x86.

,
 
okay then why is Intel saying games are optimised for console x86.
Because they are lying. It's what marketing does. Raptor Lake didn't beat Arrow Lake by being more like Zen. Rather by being less like Zen than Arrow Lake is. They made a chip worse at running Skylake code and have to talk in circles to sell it at a measly $300.
Again, games don't perform well on Zen 5. Why do you even give an ounce of credence to Intel marketing crap? Zen 5 only does well when AMD bolts a big cache to it.

Go look at CnC's Zen 5 vs ARL game summary where ARL regularly achieves higher IPC (mind you in the game code world this is typically only ~1 IPC or so) in good loops. The problems occur when it has to touch L3.
 
Last edited:
Because they are lying. It's what marketing does. Raptor Lake didn't beat Arrow Lake by being more like Zen. Rather by being less like Zen than Arrow Lake is. They made a chip worse at running Skylake code and have to talk in circles to sell it at a measly $300.
Again, games don't perform well on Zen 5. Why do you even give an ounce of credence to Intel marketing crap? They only do well when AMD bolt's a big cache to it.
I don’t. Just to get other people’s thoughts.

So they essentially privatised BOLT and made software optimisations only work on select Intel CPUs now and in the future on Nova lake etc. good sounds like a big old Nvidia move to me.
 
So they essentially privatised BOLT and make software optimisations only work on select Intel CPUs now and in the future on Nova lake etc. good sounds like a big old Nvidia move to me.
I'm not sure. It was most likely conceived to solve the APX software problem of old code spilling registers to the stack when the CPU has registers to spare. And it seems more general than BOLT, it apparently doesn't need relocation sections intact based on them distributing a GB6 profile without consent. But for now, when the instruction sets are not different, it may be similar to BOLT in practice.
 
Back
Top