After few days of testing, here's what I've discovered:
The most essential thing is that the CPU is initialized WITHOUT a microcode. Allegedly it is possible to initialize the CPU with an extremely old microcode version, but so far I haven't been able to find such version (hence allegedly). Microcode version 0x1F (06/03/2014) is already too "new" to prevent this exploit from working. Since each and every motherboard bios is supplied with a microcode present (for obvious reasons), initializing the CPU without a microcode mandates that the microcode is completely removed from the bios binary. This naturally involves modifying the bios and updating it, which in some cases can be little tricky.
After testing all of the different microcodes I could find, I've found out that there are rather large differences between them. The most important thing is, that it appears that Intel has no direct or indirect means to completely prevent this exploit from working. Technically they can reduce the "yield" (clocks) in certain workloads, but not prevent it completely as it is too late when the CPU has already been initialized. Newer microcode builds generally contain workarounds for errata and because of that it is generally recommended to use the newest build available. When using this exploit you'll need to decide if you want to have the highest possible performance in all workloads, possibly at the expence of reliability or alternatively slightly lower performance at the best known reliability (i.e with the most recent microcode update).
Haswell was the first "wide" core from Intel (256-bit FP). In order to preserve power, the Power Management Unit (PMU) power gates the upper 128-bit of the FP when 256-bit instructions are not executed. In somewhere between August and September of 2014 Intel changed the behavior of the Turbo on Haswell. Previously the Turbo behavior was identical regardless if the upper 128-bit of the FP was executing or not (i.e same clocks for 128-bit and 256-bit workloads). In the microcode released in September 2014 the Turbo behavior was changed significantly, from static to workload dependant. In this microcode and all the newer ones the Turbo clocks are exactly the same for 128-bit workloads as before, but significantly lower for 256-bit workloads. On my CPU the difference is 400MHz.
The newest microcode version for the Haswell-E/EP/EX/EN production stepping (CPUID 0x306F2) is version 0x39 (10/07/2016). This microcode can be used for this exploit, however it will result in lower yield (clocks) than the earlier ones. This microcode is highly recommended if you are satisfied with a more modest boost, or require maximum reliability (professional use). This microcode also has an additional advantage on systems, which lack both the "Power Limit" or "CPU telemetry feature" (SVID) options in the bios. Version 0x39 microcode is one of the few versions, which doesn't feature the bug I call as the "LFM bug". The best way to describe the "LFM bug" is that when you use this exploit, load a newer microcode in flight and then try adjusting any of the CPU parameters (frequency, voltage, power limits, etc), the CPU will lock to the LFM state (typically 800MHz).
I personally ended up using microcode version 0x27 (08/08/2014), and this is the version which offers the best performance. This versions still features the static Turbo behavior (same for 128/256-bit workloads) and has some of the most critical Haswell-Ex erratas (such as TSX) already fixed.
Additionally there appears to be some Turbo rules, which appear to be core configuration dependant and completely fixed.
These apply on my Haswell-E HCC, but they might be different on other variants:
- >= 10 cores == Maximum Turbo Ratio available
- >= 12 cores == Maximum Turbo Ratio - 100MHz
- >= 14 cores == Maximum Turbo Ratio - 200MHz
- >= 16 cores == Maximum Turbo Ratio - 400MHz
- >= 18 cores == Maximum Turbo Ratio - 500MHz
This means that when 0x27 microcode is used, I can run my 2699 at 3.6GHz (1-10 cores), 3.5GHz (with 12 cores), 3.4GHz (with 14 cores), 3.2GHz (with 16 cores), 3.1GHz (with 18 cores), regardless of the workload.
Since the microcode can be updated in flight, controlling the microcode version in Windows might be slightly harder.
For Windows 7 - 8.1 (including their server variants) update KB3064209 must be uninstalled, in case it is found in the system. This is a microcode update, which contains microcode version 0x2E for Haswell-Ex.
Windows 10 meanwhile is distributed with microcode version 0x36. To remove it, file named "mcupdate_GenuineIntel.dll" found in System32 folder must be renamed so that the system no longer finds it. Note that I haven't tested this procedure personally, since I'm still using Windows 7.
For Linux using a specific microcode version should be quite well documented else where.
The microcode in Windows can be updated with a driver released by VMWare:
https://labs.vmware.com/flings/vmware-cpu-microcode-update-driver
Here are version 0x27 & 0x39 microcodes for Haswell-Ex (0x306F2) in VMWare driver / Linux compatible format:
https://1drv.ms/u/s!Ag6oE4SOsCmDhFnET3uw9wHeV4EA
Rename the desired version to microcode.dat, and proceed as instructed by VMWare.
Personally I gained around 28% of performance with this exploit.