Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

jamescox · Jul 6, 2021

LightningZ71 said:
Or...

They make one larger CCD that has two x eight core CCX with their L3 caches aligned such that they have a common long, central axis. The VIAs can be placed in the middle like with Zen3, and a single VCache die can be constructed to align with that axis. That would allow a single cache die to stack on the CCD and connect to both CCX units.

I actually suggest that they could design these high density CCDs with half the L3 per CCX, at 16MB. Then, a four stack of VCache can be placed on it with four layers of 32MB of cache over each ccx, giving 144MB of cache for each eight core CCD. That's still plenty.

Both of these mess up the one CCD and one cache die for everything that allows for good reuse and modularity. We don’t even know if the CCDs will be direct mounted on the package and connected with IFOP or if they will use some stacking tech (interposer or LSI). Making a 16 core die seems very reasonable considering that Zen 2 has the shared IFOP connection between the 2 CCX and Zen 3 moves it to one side. It may actually still have 2 ports even in the zen 3 design. Some version of Zen 4 or 5 might place another CCX on the die. Rearranging things to get the caches all on one side of the CCX to allow a single larger cache die seems unnecessary and a lot of work. I don’t know why they wouldn’t be able to just place 2 cache die stacks and appropriate silicon spacers.

LightningZ71 · Jul 6, 2021

I agree, it does mess up the "one CCD to rule them all" strategy that they have been using... Except that they aren't doing that, are they? They currently produce Dali, the Zen2 CCD (long support life obligations), the Zen3 cc'd, Lucienne, Cezanne, and, on a related family, the Xbox and Playstation SOCs. I think that I'm also forgetting at least one or two chips for integrated systems in there.

That's a LOT of dies for a company that was at death's door a few years ago.

We know, for a fact, that the physical interconnect on EPYC is a pain for AMD, and that it's pushing things to get what they have now. Do we REALLY believe that AMD is gooing to try to connect 16 CCDs to an IOD to achieve a 128 core EPYC SKU in the next 18 months? I doubt that. I also doubt that they will go with four interconnected IODs with four CCDs connected to each. It makes a TON more sense to just design a high density CCD with 16 cores that can neatly fit a four stack of L3 cache die on top of it over the reduced L3 on the CCD. Then, for less dense, but higher clocked solutions, they can use a smaller and less expensive 8 core CCD for those products. That's two different CCDs that both maximize wafer production and ASP per mm^2.

moinmoin · Jul 7, 2021

Underfox dumped a long thread of AMD patents yesterday:

https://twitter.com/x/status/1412502324644298756

Some notable ones I saw among these:

Automatic CPU usage optimization, not preferring only cores based of off the maximum frequency during a worst case workload as is the case currently. https://www.freepatentsonline.com/20210191778.pdf
Automatic parts testing, essentially adapting to degradation as it happens. To me this seems to be a consequential expansion of the SCF that already monitors many properties. https://www.freepatentsonline.com/20210182163.pdf
Automatic memory overclocking, would be the logical next step after the inclusion of the more flexible memory controller introduced in Renoir. Automatic RAM OC would be an application well suited to desktops. https://www.freepatentsonline.com/20210200456.pdf

Mopetar · Jul 7, 2021

The memory overclocking certainly seems interesting. I wonder if it could eventually be broadened to a more general automatic tuning to eke out that last little bit of performance since ramping up clock speeds doesn't always yield the best results.

DisEnchantment · Jul 7, 2021

moinmoin said:
Underfox dumped a long thread of AMD patents yesterday:

But he missed what could arguably be one of the most significant ideas since virtual addressing

ENHANCED PAGE INFORMATION CO-PROCESSOR

Abstract
A processing system includes a primary processor and a co-processor. The primary processor is couplable to a memory subsystem having at least one memory and operating to execute system software employing memory address translations based on one or more page tables stored in the memory subsystem. The co-processor is likewise couplable to the memory subsystem and operates to perform iterations of a page table walk through one or more page tables maintained for the memory subsystem and to perform one or more page management operations on behalf of the system software based the iterations of the page table walk. The page management operations performed by the co-processor include analytic data aggregation, free list management and page allocation, page migration management, page table error detection, and the like.

https://www.freepatentsonline.com/y2021/0182206.html

What it does is to offload page table walking to a fixed function unit. This is very important with the advent of 57 bit virtual addressing and CXL.
Your pages are very small (for historical reasons) but the system memory can go to peta bytes due to CXL memory pooling for example, this means there are tons of PTEs to scan.
This thing steps in and takes away the job from the OS (i.e. CPU core doing this) and perform the look up of the page tables and load the page table into the TLBs
Basically eliminating the OS job of handling the TLB miss and going through the the PTEs

In my eyes it should be transparent to the OS. The OS manages the PTEs but the TLB miss will not be seen by the OS instead the PTE will be loaded into the TLBs directly by this fixed function unit.
Super critical for applications which address large amounts of memory, hence the 57bit addressing. Another useful Data center feature.

Cool move AMD.

Just a sidenote, Intel is demoing the RAR feature for SPR for aiding TLB shootdowns when OS need to invalidate virtual to physical address mapping in multi core systems.

Intel Sapphire Rapids To Have Experimental "RAR" Feature - Phoronix

www.phoronix.com

This AMD's solution should be able to support the SPR feature as well since it can basically modify the TLB of the CPU cores.

Now watch everyone copying this in a few years, especially server cores

Update for more clarity:
I think @andermans explanation is a more likely scenario for this usage

My memory of this topic is quite rusty

andermans · Jul 7, 2021

DisEnchantment said:
But he missed what could arguably be one of the most significant ideas since virtual addressing

https://www.freepatentsonline.com/y2021/0182206.html
View attachment 46848

What it does is to offload page table walking to a fixed function unit. This is very important with the advent of 57 bit virtual addressing and CXL.
Your pages are very small (for historical reasons) but the system memory can go to peta bytes due to CXL memory pooling for example, this means there are tons of PTEs to scan.
This thing steps in and takes away the job from the OS (i.e. CPU core doing this) and perform the look up of the page tables and load the page table into the TLBs
Basically eliminating the OS job of handling the TLB miss and going through the the PTEs

The TLB miss stuff already gets handled by dedicated fixed function page walkers, not by the OS. (There were some architectures where the OS did it. IIRC MIPS and SPARC). There is some stuff that the OS indeed does (like flushing the TLB when the pagetables get changed, often called TLB shootdown, which the Intel RAR optimize)

The coprocessor here seems used to for e.g. tracking the LRU lists for demand paging and swapping memory out (or for migrating between NUMA nodes), which typically is done by the OS. Most references to the TLB in the patent seem to refer to the TLB shootdown process. Seems like some of the duties of this new coprocessor can do TLB shootdowns behind the back of the OS.

Curious how AMD would integrate this into Linux/Windows.

Bigos · Jul 7, 2021

Wait, what? Hardware TLB is a thing since 80386, unlike on some architectures where the hardware engineers thought that software TLB makes any sense. The hardware walks the page table by itself on TLB miss.

The OS almost never walks the page table structure from top to bottom. There are some structures for other way around (struct page) and VMAs describe the mappings in a more granular fashion, but that's it. When the context is switched, the only thing the kernel does is to switch the page table pointer and set a separate ASID (address space ID) so that the current TLB entries are invalidated.

The only thing that could be enhanced is to prefetch the new TLB entries from the new page table up-front, but that is not easy (how would you know which entries to load?). This would also be implemented by a prefetch system and not "a separate coprocessor".

Unless I have completely misunderstood the premise, this doesn't make any sense to me.

DisEnchantment · Jul 7, 2021

Bigos said:
Wait, what? Hardware TLB is a thing since 80386, unlike on some architectures where the hardware engineers thought that software TLB makes any sense. The hardware walks the page table by itself on TLB miss.

The OS almost never walks the page table structure from top to bottom. There are some structures for other way around (struct page) and VMAs describe the mappings in a more granular fashion, but that's it. When the context is switched, the only thing the kernel does is to switch the page table pointer and set a separate ASID (address space ID) so that the current TLB entries are invalidated.

The only thing that could be enhanced is to prefetch the new TLB entries from the new page table up-front, but that is not easy (how would you know which entries to load?). This would also be implemented by a prefetch system and not "a separate coprocessor".

Unless I have completely misunderstood the premise, this doesn't make any sense to me.

The page walk is handled by the CPU automatically, but this is done by the CPU no fixed function units

Bigos · Jul 7, 2021

DisEnchantment said:
The page walk is handled by the CPU automatically, but this is done by the CPU no fixed function units

What is the difference between "CPU" and "fixed functional units"? Isn't page walker, which is part of the CPU, a fixed functional unit?

If you meant that page walk is performed by CPU instructions then that is wrong on every architecture that employs hardware TLB, so the entirety of x86 among other architectures.

I much more likely would buy andermans' explanation that this is meant to partially invalidate TLB of sibling cores without invoking an IPI (inter-processor interrupt). I still don't understand though why would a "coprocessor" be needed for that, though.

Maybe the key difference is whether the whole cache hierarchy (L1, L2, L3) is being used or only part of it. Maybe the coprocessor has its own L1 cache, or something like that. That might actually make some sense to attach a separate L1 cache to the page table walker, but calling the whole solution a coprocessor is strange. But we are talking about patents, they always have strange wording.

DisEnchantment · Jul 7, 2021

I think @andermans explanation is the most likely usage scenario for this.

Bigos · Jul 7, 2021

Ok, after reading a few first paragraphs this sounds like allowing OS to use the page table walker (now called "a coprocessor") in more ways than just to fill TLB entries automatically. The OS will thus offload some tasks that are performed per-page to be done in background by the coprocessor which it will synchronize with from time to time.

This is arguably better than the current scheme when the OS has to switch to kernel mode and fiddle with page flags to employ things like checking which pages are used, etc. The cost is a vast complication of the OS kernel that now will need to work with a heterogenous system with twice as many CPUs where half of them perform specialized tasks only.

Sounds interesting, but AMD would need a lot of kernel development effort in order to implement a use of that.

scineram · Jul 7, 2021

They also should move to 16 or 64 KB pages like a sane platform.

jamescox · Jul 7, 2021

scineram said:
They also should move to 16 or 64 KB pages like a sane platform.

You can use 2 MB and 1 GB pages, but the OS support still seems to be a mess. The hardware has supported larger page sizes for a long time though. I have been hitting issues with transparent huge pages (2 MB) on centos 6 and 7. The problem is that they get fragmented into 4K pages and the defrag doesn’t seem to be effective. I don’t know if it had been improved significantly in later kernels.

jamescox · Jul 7, 2021

LightningZ71 said:
I agree, it does mess up the "one CCD to rule them all" strategy that they have been using... Except that they aren't doing that, are they? They currently produce Dali, the Zen2 CCD (long support life obligations), the Zen3 cc'd, Lucienne, Cezanne, and, on a related family, the Xbox and Playstation SOCs. I think that I'm also forgetting at least one or two chips for integrated systems in there.

That's a LOT of dies for a company that was at death's door a few years ago.

We know, for a fact, that the physical interconnect on EPYC is a pain for AMD, and that it's pushing things to get what they have now. Do we REALLY believe that AMD is gooing to try to connect 16 CCDs to an IOD to achieve a 128 core EPYC SKU in the next 18 months? I doubt that. I also doubt that they will go with four interconnected IODs with four CCDs connected to each. It makes a TON more sense to just design a high density CCD with 16 cores that can neatly fit a four stack of L3 cache die on top of it over the reduced L3 on the CCD. Then, for less dense, but higher clocked solutions, they can use a smaller and less expensive 8 core CCD for those products. That's two different CCDs that both maximize wafer production and ASP per mm^2.

I would agree that it is entirely plausible that they would make a separate 16-core die, although stacking might also be a possibility.

If the layout is similar to Zen 3, with cache in the middle, then a single die 16-core part could very easily be made by just mirroring the cores and cache on each side of the IFOP link for maximum design reuse. Having a mirrored CCX may interfere with using the same cache chip though. Other option would be no mirroring and the second CCX just has some longer connections to get to the IFOP link. Cache chips are the same with 2 used per die.

For what you suggest, they would probably redesign the base chiplet to put cache on the edge. It might still require mirroring. They would probably have a specialized cache chip used only in this high end product.

Some solution using interposers, LSI, or actual 3D stacking may be more likely than multiplying the number of chip types used (multiple different CCD and cache chips). They probably want to make an HPC device that includes HBM, so you have to fit that in somehow also. That may be done with LSI. They could possibly just connect multiple CCD together with LSI and then use one link to the IO die.

Adding TSVs to the IFOP die area to allow at least a 2 high cpu core stack seems like a good option. It would only be for extreme core count at low clocks due to thermal issues. They could presumably still stack the standard cache die on top. Just like all Zen 3 have TSVs for cache chips, they would only make use of them on certain versions.

edit: not sure how they would handle only having the cache chip on the top die. Perhaps they would pass through some TSVs to allow both to connect or the super high core count version just doesn’t get stacked caches. Right now they have single core optimized where you get all 8 CCD for the full 256 MB if L3 cache, but with only 1, 2, or 3 cores active per CCX. The F-parts also have higher clocks since a smaller number of cores get the entire CCD power budget. It would make sense that the super high core count version just gets 32 MB, or whatever it is on Zen 4.

Tuna-Fish · Jul 7, 2021

scineram said:
They also should move to 16 or 64 KB pages like a sane platform.

It's not possible. The 4kB page size is assumed by so much legacy software, that the moment it's raised all backwards compatibility goes out the window. At that point, might as well switch away from x86 entirely.

The same is true for the 64B cache line size. They are just part of the x86 spec, they can never be changed.

zir_blazer · Jul 7, 2021

Bigos said:
What is the difference between "CPU" and "fixed functional units"? Isn't page walker, which is part of the CPU, a fixed functional unit?

If you meant that page walk is performed by CPU instructions then that is wrong on every architecture that employs hardware TLB, so the entirety of x86 among other architectures.

I much more likely would buy andermans' explanation that this is meant to partially invalidate TLB of sibling cores without invoking an IPI (inter-processor interrupt). I still don't understand though why would a "coprocessor" be needed for that, though.

Maybe the key difference is whether the whole cache hierarchy (L1, L2, L3) is being used or only part of it. Maybe the coprocessor has its own L1 cache, or something like that. That might actually make some sense to attach a separate L1 cache to the page table walker, but calling the whole solution a coprocessor is strange. But we are talking about patents, they always have strange wording.

Not directly related since I'm not up-to-date, but everything Paging related is managed by the MMU (Memory Management Unit), which is built-in into the CPU but should be considered as a specialized fixed function unit. In x86, the MMU has been part of the CPU since its introduction in the 80286 (It had a built-in MMU with a Segmentation Unit. The 80386 added a Paging Unit on top of it. Both could be used simultaneously in 32 Bits Protected Mode, until 64 Bits Long Mode deprecated the original Segmentation Unit, so you can do only Paging), albeit in other vendors platforms the MMU was a dedicated Coprocessor (Like the Motorola MC68451).
The MMU actually has its own Cache, as the TLB fills that role, and each CPU architecture usually has a specific amount of slots for cacheable entries of each supported Page size. If I recall correctly, in PAE and Long Mode each Page entry in a Page Table was 8 Bytes in size, so you can easily guess how much Cache memory the MMU has according to how many slots for Page entries it has.

The MPU (Memory Processing Unit) that they are proposing is a sort of external accelerator for the CPU's MMU Coprocessor. Somehow TLB Cache coherency comes to my mind or anything else that may require to coordinate a stupid amount of Cores scattered around multiple Processors.

jamescox · Jul 7, 2021

Tuna-Fish said:
It's not possible. The 4kB page size is assumed by so much legacy software, that the moment it's raised all backwards compatibility goes out the window. At that point, might as well switch away from x86 entirely.

The same is true for the 64B cache line size. They are just part of the x86 spec, they can never be changed.

The page size seems to be an issue, at least in Linux, but for the OS, not user mode applications. On Centos6, the transparent huge pages would get fragmented into 4K pages and the defrag was trying to copy pages around to free up enough 4K pages to make a 2 MB page. That was causing ridiculously long memory allocation delays and it often would not manage to actually get a 2 MB page back, it would just give up and fill the allocation with 4K pages. Part of the issue was swapping. The swap system only supported 4K pages, so swapping out a 2 MB page resulted in 2 MB pages all getting fragmented into 4K pages. I don’t know if later versions of the OS fixed this. We just switched to centos 7. We are leaving huge pages enabled, since they do increase performance, but disabling the defrag since it causes too long of delays. The performance will degrade somewhat over time as the pages get fragmented, at least until the machine is rebooted.

It seems like this should almost all be OS stuff though, not anything in user space. The cache line size shouldn’t be very visible in user space or even OS space as page size is. I don’t see why it would matter that much if they did something like increasing the cache line to 128-bytes. Modern systems are probably going to prefetch more than 1 consecutive 64-byte cache line anyway. Do you have any links to article indicating what the issues would be for user space applications?

We are getting to where a lot of machines have hundreds of GB of memory if not multiple TB, so managing it with 4K pages is getting to be ridiculous. It seems like they need to have mode where 2 MB pages are the default.

Doug S · Jul 7, 2021

Just because we have hundreds of gigabytes or more DRAM in some systems doesn't mean that much larger page sizes make sense. 8K, 16K, sure. But 64K - let alone making 2 MB the default would be very wasteful due to fragmentation.

Not only of filesystem mappings, but also stuff like per process mappings. Do you really want to waste almost all of 2 MB for every stack in every process? The data segment of every process?

x86 provides larger pages, so you get the benefit of it where it can make sense without taking all the hits where it is stupid.

Tuna-Fish · Jul 8, 2021

jamescox said:
The cache line size shouldn’t be very visible in user space or even OS space as page size is. I don’t see why it would matter that much if they did something like increasing the cache line to 128-bytes. Modern systems are probably going to prefetch more than 1 consecutive 64-byte cache line anyway. Do you have any links to article indicating what the issues would be for user space applications?

... The cache line size is immediately visible to any multi-threaded software because of false sharing. That is, if you for example have 4 threads count something, you cannot have them all update a single counter because that would cause the cache line containing it to bounce between cores every time it's written. So the best practice is to use one counter per thread, and pad the counter size to 64 bytes so that you know each has it's own cache line, then merge the counts in the end. Every single synchronization primitive and every lock-free data structure out there assumes 64-byte lines. If that assumption is wrong, it will cause massive false sharing and contention in multithreaded workloads.

As to the page size, a lot of software uses things like circular buffers where the buffer is remapped next to itself in order to eliminate the corner cases, and other stuff like that related to mappings. They cannot move to larger smallest page size without failing. Moving to wider use of THP is an option.

JoeRambo · Jul 8, 2021

jamescox said:
You can use 2 MB and 1 GB pages, but the OS support still seems to be a mess. The hardware has supported larger page sizes for a long time though. I have been hitting issues with transparent huge pages (2 MB) on centos 6 and 7. The problem is that they get fragmented into 4K pages and the defrag doesn’t seem to be effective. I don’t know if it had been improved significantly in later kernels.

We have found the best strategy with THP is using as new kernel as possible during validation ( and then "freezing" it obv. ), keeping THP on madvice default setting, so it does not do crazy stuff on it's own.
And then explicitely using THP + up front allocation in our custom programs and JVM instances. When it works it is amazing, esp with medium sized JVM heaps ( think tens of GBs) that are busy with allocation and garbage collection. Even better when bound to NUMA node with numactl.

-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages -XmxNNg -XmsNNg ( where NN is desired size of heap in GB )

Tells newish ( since 11 or so) JVM to advice kernel to use THP on heap setup, uses fixed heap and immediately allocates NN GB of virtual memory and touches it all to make physical memory allocation for all of it in 2MB THP pages.

Some good memory saving in page tables and performance gain due to lower TLB misses and reduced cache footprint of page tables backing memory to be had for free.

Vattila · Jul 8, 2021

Tuna-Fish said:
[...much software has hard dependency on memory page size and cache line size, so we're stuck with these sizes for compatibility reasons...]

Is it so hard to query the system for the memory page size and cache line size?

c++ - detecting the memory page size - Stack Overflow
c++ - Programmatically get the cache line size? - Stack Overflow

It is sad that bad software is impeding progress.

That is, if you for example have 4 threads count something, you cannot have them all update a single counter because that would cause the cache line containing it to bounce between cores every time it's written.

Intense use of shared memory and locks, including false sharing due to cache line sharing, probably is a big reason why so many (poorly written) games and applications perform badly. Then gamers and PC enthusiasts blame the hardware for "high latency" between cores. That's my suspicion, anyway.

It is sad that so much effort has to be spent in hardware to make bad software run well. But programming hardware optimally is hard, so this is a fact of life, I guess.

LightningZ71 · Jul 8, 2021

My programming background is admittedly quite out of date, and never experienced the fun of living in VMs. Might it be a useful approach to have the bare metal and the hypervisor speak in very large page sizes and abstract that down to smaller pages for the client VMs? I realize that there will be some memory overhead with this, but, from the 10,000 foot view, it seems to me that the hypervisor could handle dynamically allocating and reallocating large pages to and from the VMs (in dynamically sized ones) and represent those pages to the VM as a collection of smaller pages? The client program wouldn't care, as it sees the smaller pages that it wants, and the hypervisor just updates the pages that get touched as needed. Kind of like storage block sub allocation?

I suspect that I'm overlooking a massive performance hit from the translation and constant rewriting of large pages...

Hans de Vries · Jul 8, 2021

DisEnchantment said:
But he missed what could arguably be one of the most significant ideas since virtual addressing

https://www.freepatentsonline.com/y2021/0182206.html

The crucial thing is here "Page Migration" from one memory to the other.

Preferably by a special purpose hardwired co-processor on the IO-die.

It does not make much sense to include HBM2e memory without this. The co-processor monitors the access bandwidth to individual pages and determines if these pages should be moved from Main Memory (using DDR5 DIMMS) to Main Memory using HBM2e or back.

The OS would set a number of general control parameters but not be involved in any real time migration actions, that would be a nightmare.

https://www.freepatentsonline.com/20210182206.pdf

For those who talk about a HBM2e cache: Realize that you need about 12% extra memory for the cache tags. So a 32 GByte HBM2e cache needs ~4 GByte of tags.

jamescox · Jul 8, 2021

Tuna-Fish said:
... The cache line size is immediately visible to any multi-threaded software because of false sharing. That is, if you for example have 4 threads count something, you cannot have them all update a single counter because that would cause the cache line containing it to bounce between cores every time it's written. So the best practice is to use one counter per thread, and pad the counter size to 64 bytes so that you know each has it's own cache line, then merge the counts in the end. Every single synchronization primitive and every lock-free data structure out there assumes 64-byte lines. If that assumption is wrong, it will cause massive false sharing and contention in multithreaded workloads.

As to the page size, a lot of software uses things like circular buffers where the buffer is remapped next to itself in order to eliminate the corner cases, and other stuff like that related to mappings. They cannot move to larger smallest page size without failing. Moving to wider use of THP is an option.

While I don’t doubt that is how it works, that seems like a terrible solution to add a dependency on something that is not supposed to be visible at that level of abstraction.

jamescox · Jul 8, 2021

Doug S said:
Just because we have hundreds of gigabytes or more DRAM in some systems doesn't mean that much larger page sizes make sense. 8K, 16K, sure. But 64K - let alone making 2 MB the default would be very wasteful due to fragmentation.

Not only of filesystem mappings, but also stuff like per process mappings. Do you really want to waste almost all of 2 MB for every stack in every process? The data segment of every process?

x86 provides larger pages, so you get the benefit of it where it can make sense without taking all the hits where it is stupid.

If they could have multiple page sizes and have everything play nice then that is obviously the best solution. That doesn’t seem to work in the current implementation. I thought that Apple went to default 64K page sizes, which might be a good intermediary. In researching the THP issues (long stalls on allocation, up to 90 seconds due to defrag or compact operations), a lot of people seem to recommend just disabling them completely. I wouldn’t say that I know enough about the workings of the system to have an informed opinion; I just note that they don’t really seem to work for our application.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Senior member

Platinum Member

Diamond Member

Diamond Member

Golden Member

Member

Senior member

Golden Member

Senior member

Golden Member

Senior member

Senior member

Senior member

Senior member

Golden Member

Golden Member

Senior member

Diamond Member

Golden Member

Golden Member

Senior member

Platinum Member

Senior member

Senior member

Senior member