Ryzen: Strictly technical

Discussion in 'CPUs and Overclocking' started by The Stilt, Mar 2, 2017.

  1. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    This is by no means a full blown review. It just provides some of the more in-depth information, along with some test results.


    - CCX - Compute Complex. Consists of four Zen cores and a shared 8MB L3 cache.
    - PState - Performance State. Specifies the CPU multiplier and voltage when residing in the state.
    - Zeppelin – Codename of the die design used in Summit Ridge (AM4), Snowy Owl (SP4) and Naples (SP3) Zen based CPUs.
    - dLDO – Digital low-dropout voltage regulator.
    - XFR – Extended Frequency Range
    - MACF - Maximum all core frequency.
    - MSCF – Maximum single core frequency.
    - ACXFRC – All core XFR ceiling
    - SCXFRC – Single core XFR ceiling
    - SMU – System management unit.

    SMU – The master of puppets

    Due to the seamless integration of the on-die system management unit (SMU, similar in functionality to PMU on Intel) there are quite many differences compared to the previous AMD desktop microarchitectures when it comes to overclocking.

    At stock, Ryzen has all of the power management features enabled and the SMU runs the whole operation and is in charge for everything. These power management features include various power, current and thermal limiters, voltage controllers and power gating features.

    All of these can be completely ignored, if you have no plans to overclock the CPU.

    For overclocking purposes the engineers at AMD have included a special mode (the "OC Mode"), which will disable all of the limiters, voltage controllers and protections (except the CPU thermal protection) upon the activation.

    The "OC Mode" is automatically activated when the user raises the base frequency (P0 PState) of the CPU. The SMU indicates the activation of the "OC Mode" by sending "0C" code to the diagnostic display (Port 80) of the motherboard.

    Understanding the different CPU frequency states (PStates), their voltages and especially the actual effective voltage is harder than ever before with Zeppelin. Unlike with the older designs (15h family) the boosted PStates (Turbo & XFR) are completely invisible.

    Due to that fact, they are officially called as "Shadow PStates". This means that unlike with the previous designs these PStates are not defined in the standard MSR registers and cannot be modified (or be seen) by the user. The only way the user can even verify their presence is to see them actually firing (i.e. from the actual effective frequency & voltage).

    Understanding the voltages specified for the standard PStates can be confusing as well. That's because in the normal operating mode (i.e. "non-OC") the SMU controls the voltages automatically through the voltage controllers.

    For example, the P0 PState might specify 1.37500V voltage, while the actual effective voltage during the residency in this state is 1.26250V or slightly higher. This is not a glitch, but the normal operation of the CPU. Basically, the voltage specified in the MSR is just the upper limit and the SMU will automatically add a dynamic negative offset to this value, reducing the actual effective voltage. The amount of the negative offset varies depending on load and the temperature. For the tested sample the offsets were -120mV & -144mV for the two highest base PStates (3.6 & 3.2GHz).

    When the "OC Mode" is activated the SMU will disable the voltage controllers, which among other things disables the automatic voltage offsets. This can create an illusion that the power consumption is heavily increasing due to the use of "OC Mode". While technically it is accurate, it is more of a consequence than the actual reason. A vast majority of the increased power consumption comes from the now disabled automatic negative voltage offsets, which causes the actual CPU voltage to increase anything between 50 and 150mV. Because of this behavior, it is advised that the user doesn't increase the CPU voltage right away (when overclocking), but only upon actual demand (as usual).

    One of the major down sides of the "OC-Mode" is that upon activation both Turbo and XFR will be disabled as well. Basically, this means that unless you are able to reach at least the default MSCF / XFR frequency on all cores, then you will essentially be losing single threaded performance compared to stock configuration.

    The XFR

    The XFR (Extended Frequency Range) is essentially an extension or enhancement to the standard CPB (Turbo) algorithm. In scenarios where all of the various limiters (power, current and thermal) have margins, the CPU is allowed to boost above its nominal base and boost speeds. Just like the standard CPB algorithm, the XFR has separate clock ceilings for a single core and all core operation.

    For example, for the 1800X SKU the clock configuration is following: 3.6GHz all core frequency (MACF), 4.0GHz single core frequency (MSCF), 3.7GHz maximum all core XFR ceiling (ACXFRC) and 4.1GHz maximum single core XFR ceiling (SCXFRC).

    The number of XFR bins (n x 100MHz) might vary between the different SKUs, however for the 1800X model there is a single XFR bin available for both all core and single core operations. In typical consumer workloads, the CPU will generally be able to reside in the XFR states (3.7GHz / 4.1GHz) constantly, however in certain specialized workloads (such as Linpack or Prime95) the frequency usually decreases towards the base frequencies (3.6GHz / 4.0GHz).

    The base-clock (BCLK)

    Overclocking the base clock (BCLK) on AM4 platform is possible, however generally not recommended. This is due to its frequency relations with other interfaces, such as the PCIe. Unlike with Intel's more recent CPUs, there is no asynchronous mode (straps / gears) available, which would allow stepping down the PCIe frequency at certain intervals. The PCIe frequency relation is fixed and therefore it increases at the same rate with the BCLK. Gen. 3 operation can generally be sustained up to ~107MHz frequency and higher speeds will usually require forcing the links to either Gen. 2 or to Gen. 1 modes.

    Unstable PCIe can cause various issues, such as system crashes, data corruption (M.2 SSDs), graphical artifacts and various kinds of other undefined behavior.

    The internal voltage regulation (dLDO)

    Zeppelin is the first design in which AMD has extensively utilized integrated voltage regulators. Unlike the fully integrated voltage regulator (FIVR) used in Haswell and Broadwell CPUs, AMD's regulator implementation isn't based on ultra-high speed switching circuitry. The integrated voltage regulators in Zeppelin are ultra-high efficiency digital low-dropout (dLDO) type of regulators. Most of the different domains (cores, caches, data fabric, etc.) have their own dLDOs and they can all be controlled individually.

    Despite the presence of the dLDOs, the consumers can ignore them completely. This is because in the consumer parts most of the dLDOs (all except some of the minor domains) are permanently placed in a by-pass mode. This means that actual regulators are disabled and all of the voltage regulation takes place on the motherboard, just like on the previous generation CPUs and APUs.

    The frequency relations of the CCX

    In terms of the internal die frequency relations, Zeppelin is quite different to the previous designs. The core, L1 and L2 cache speed is permanently linked together as usual, however unlike with the previous designs the L3 cache now operates at core speed as well (i.e. full speed). Since the L3 cache is shared between the cores within the same CCX, the L3 frequency is synchronized with the currently highest clocked core of the CCX it belongs to. In normal conditions, all of the cores within a CCX operate at the same speed, or alternatively are power gated.

    The structure of the CCX sets few rules that one should know prior starting changing the settings from stock. Each of the four cores within a CCX must be running at the same frequency (i.e. reside in the same PState) or be power gated. While this is the official truth, the rule doesn't fully apply in practice. It is entirely possible to command the individual cores within the same CCX to different PStates, however the results in many cases are not what was originally expected. This is due to the internal frequency relations of the CCXs.

    While each of the cores and their full speed L1 & L2 caches can be clocked independently, the shared L3 cache frequency is linked to the currently highest clocked core speed within the CCX at all times. Because of that there will be frequency difference dependent delta between the requested and the actual frequency, if all of the cores within the CCX don't have a common frequency.

    The effective CPU multiplier consists of two components: CPUFID and CPUDFSId. The CPUFID is an integer value ranging from 16 to 255. The CPUDFSId is a floating-point value between 1 and 6. Due to the natural divider of 8 for the CPUDFSId, its adjustment step is always 0.125 (1/8). The effective multiplier is produced with following formula: ((CPUFID / (CPUDFSId / 8)) / 4).

    In cases where the cores within a CCX a clocked differently, calculating the effective multiplier is somewhat more complex. If all of the different PStates have the same CPUDFSId value, the effective multiplier can be calculated with following formula: Target core CPUFID / (1 + ((highest core CPUFID - target core CPUFID) / highest core CPUFID)). For example, if the highest core multiplier is 36.0x (CPUFID = 144 & CPUDFSId = 1) and the target multiplier for other cores is 32.0x (CPUFID = 128 & CPUDFSId = 1): 128 / (1 + ((144 - 128) / 144)) = 115.2 (28.8x). Further rules and limitations may apply, depending on the used CPUDFSId values and the actual frequency.

    The synchronization of the data fabric dictates that each of the enabled CCXs have identical number of cores enabled at all times. The available configurations are 1 (1:0), 2 (2:0 or 1:1), 3 (3:0), 4 (4:0 or 2:2), 6 (3:3), 8 (4:4).

    The data fabric

    The northbridge of Zeppelin is officially called as the data fabric (DF). The DF frequency is always linked to the operating frequency of the memory controller with a ratio of 1:2 (e.g. DDR4-2667 MEMCLK = 1333MHz DFICLK). This means that the memory speed will directly affect the data fabric performance as well. In some cases, it may appear that the performance of Zeppelin scales extremely well with the increased memory speed, however that is necessarily not the case.

    In many of these cases the abnormally good scaling is caused by the higher data fabric clock (DFICLK) resulting from the higher memory speed, rather than the increased performance of the memory itself.

    The highest officially supported memory speed for consumer (AM4) Zeppelin parts is 2667MHz (two single rank / sided modules in total) or 2400MHz (two dual rank / sided modules in total), however memory ratios for 2933MHz and 3200MHz speeds are available (not officially supported), at least on some motherboards.

    Overclocking

    The overclocking headroom for the higher-end Ryzen models is rather slim. This was expected due to the relatively high stock frequencies, high-density orientation of the design and the low power targeted manufacturing process used for the Zeppelin die (Samsung 14nm LPP).

    [​IMG]


    As indicated by the Vmin-Fmax curve, Zeppelin's voltage scaling is perfectly linear until 3.3GHz (25mV per 100MHz). The first deviation ("Critical 1") from this linear behavior can be seen at 3.3GHz. The second and the final deviation ("Critical 2") can be seen at 3.5GHz. Beyond this point the voltage scaling is neither linear or recovers even temporarily, and the CPU is requiring higher voltage in increasingly larger steps to scale further.

    The ideal frequency range for the process or the design (as a whole) appears to be 2.1 - 3.3GHz (25mV per 100MHz). Above this region (>= 3.3GHz) the voltage scaling gradually deteriorates to 40 - 100mV+ per 100MHz.

    This means that at ~3.8GHz pushing further usually becomes extremely costly (power / thermal wise).

    In comparison, the "critical" points for the two previous AMD desktop designs were at:

    - Orochi Rev. C aka Vishera, 32nm SHP SOI - (1 = 4.4GHz, 2 = 4.7GHz)

    - Kaveri / Godavari, 28nm "SHP" HPP Planar - (1 = 4.3GHz, 2 = 4.5GHz)

    The voltage scaling indicated by the Vmin-Fmax curve (above) can be also clearly seen in the default voltages for the different frequency states (PStates) of the CPU.

    On the high-end models the actual (effective) voltage for the base frequency (e.g. 3.6GHz on 1800X SKU) can be anything between 1.200 - 1.300V. Meanwhile the actual (effective) voltage for the highest single core boosted PState (XFR, e.g. 4.1GHz) can be as high as 1.47500V.

    In the tested sample the actual default voltage for the base frequency (P0, 3.6GHz) was ~1.25000V, while the highest single core boost state (XFR, 4.1GHz) defaulted to 1.4625V.

    While AMD has not revealed the highest safe (sustainable) VDDCR_CPU (CCX) or VDDCR_SOC (data fabric & peripheral) voltage levels, it can be speculated that voltages higher than 1.4500V are generally not advisable for sustained use, at least in conditions / workloads which result in high power consumption (i.e. all cores fully stressed).

    Despite it is true that the high-end models can have their default voltage set up to 1.47500V during their maximum single core boost (XFR) operation, the power consumption / dissipation, amount of current flowing and the temperatures are very different between the scenarios where only a single core is fully stressed or all of the cores are fully stressed.

    Pushing to or even beyond the factory MSCF (4.1GHz / XFR) frequency is entirely possible on all cores, however in my personal opinion it is not worth the significantly higher power consumption resulting from the significantly increased supply voltage. Personally, if find it more intriguing to try making the CPU even more efficient than it already is at stock.

    Overclocking Ryzen, at least the higher-end models is kind of a double-edged sword. Due to how the Turbo / XFR operates in Zeppelin and the rather slim overclocking margins, the user might end up actually losing single core performance when the CPU is overclocked. Since the Turbo / XFR will always be disabled when the CPU is overclocked (upon entering the “OC-Mode”), the single core performance might actually be lower than at stock, if the user is unable to reach the same speed on all cores as the CPU operated at single core stress at default (e.g. 4.1GHz on 1800X SKU).

    The power consumption

    All of the power consumption measurements have been made with DCR method. The figures represent the total combined power consumed by the CPU cores (VDDCR_CPU, Plane 1) and the data fabric / the peripherals (VDDCR_SOC, Plane 2). These figures do not include switching or conduction losses.

    Peak power (i.e. worst-case) figures were measured during Firestarter FMA/AVX binary execution. On average the resulting power consumption is around 30% higher than the power consumption resulting from any other real world consumer, fully multithreaded workload.

    Note
    : Current versions of Prime95 (28.10) do not stress Ryzen CPUs properly. The resulting power consumption is abnormally low, and both Firestarter and Linpack result in significantly higher power consumption.

    [​IMG]

    "MCRT" (Monte Carlo raytracer, based on SmallPT) was chosen as a more real world representative workload. It provides extremely good and linear multithreaded scaling and is a relatively modern workload.
    Rather than just measuring the average power consumption, performance per watt metric was included as well to provide an additional data point.

    [​IMG]

    [​IMG]

    An easter egg

    Zeppelin features a highly advanced power management, as stated many times before. Just like Carrizo / Bristol Ridge, which feature a very similar PM, Zeppelin can infact support cTDP as well. cTDP is not officially supported (or available) on any consumer Zeppelin based SKU (AFAIK). The lack of official support is merely a distraction ;)

    [​IMG]

    850 points in Cinebench 15 at 30W is quite telling. Or not telling, but absolutely massive. Zeppelin can reach absolutely monstrous and unseen levels of efficiency, as long as it operates within its ideal frequency range.

    In case you have "any" questions, just ask.

    EDIT: 3/6/2017

    Ok, so I've now changed the charts to have a common Y-axises, where possible.
    An additional data point was added to the charts to indicate performance with 256-bit workloads excluded (the ones which have actual gains from 256-bit code).
    The excluded 256-bit workloads are: Blender, Bullet (IPC only), Embree, Euler3D, Himeno, Linpack, NBody & X265.

    I also noticed that there is an error, affecting Kaby Lake results in 4C/4T, 4C/8T and SKU vs. SKU tests. The absolute data is correct, however the summary views are affected by a tiny amount.
    This is because I made a typo in "Caselab Euler3D" calculation for Kaby Lake (reversed calculation). Because of that the summary views will show small changes for Kaby Lake only. In 4C/4T summary the difference will reduce by >= 0.5% and in 4C/8T increase by ~1% (same applies to SKU vs. SKU).

    I leave the gallery codes (Imgur) to the original charts in the OP after changing the charts, so anyone can inspect the results if necessary.

     
    #1 The Stilt, Mar 2, 2017
    Last edited: Mar 6, 2017
  2. Loading...

    Similar Threads - Ryzen Strictly technical Forum Date
    Worth upgrading from Haswell i5-4670 to Ryzen 1700? CPUs and Overclocking Yesterday at 2:10 PM
    Thinking about a Ryzen upgrade. CPUs and Overclocking Yesterday at 1:31 PM
    OC'ing RAM on Ryzen help CPUs and Overclocking Sunday at 3:00 PM
    The Ryzen "ThreadRipper"... 16 cores of awesome CPUs and Overclocking Wednesday at 8:25 AM
    i3-560 vs i5-760 - strictly gaming CPUs and Overclocking Dec 16, 2012

  3. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    The results

    NOTE: Many of the chart legends state "(ER)". "ER" stands for extremities removed. This means that the single absolute best and worst results have been excluded.

    Core vs. Core (IPC) 3.5GHz

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]
     
    #2 The Stilt, Mar 2, 2017
    Last edited: Mar 2, 2017
    Artorius, pcp7, Harney and 32 others like this.
  4. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    Core vs. Core (IPC) 3.5GHz - Continued

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    MT - 4C/4T 3.5GHz

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    EDIT: 3/6/2017

    Ok, so I've now changed the charts to have a common Y-axises, where possible.
    An additional data point was added to the charts to indicate performance with 256-bit workloads excluded (the ones which have actual gains from 256-bit code).
    The excluded 256-bit workloads are: Blender, Bullet (IPC only), Embree, Euler3D, Himeno, Linpack, NBody & X265.

    I also noticed that there is an error, affecting Kaby Lake results in 4C/4T, 4C/8T and SKU vs. SKU tests. The absolute data is correct, however the summary views are affected by a tiny amount.
    This is because I made a typo in "Caselab Euler3D" calculation for Kaby Lake (reversed calculation). Because of that the summary views will show small changes for Kaby Lake only. In 4C/4T summary the difference will reduce by >= 0.5% and in 4C/8T increase by ~1% (same applies to SKU vs. SKU).

    I leave the gallery codes (Imgur) to the original charts in the OP after changing the charts, so anyone can inspect the results if necessary.
     
    #3 The Stilt, Mar 2, 2017
    Last edited: Mar 6, 2017
    Artorius, pcp7, Chicken76 and 27 others like this.
  5. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    MT - 4C/4T 3.5GHz - Continued

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    MT - 4C/8T 3.5GHz

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    EDIT: 3/6/2017

    Ok, so I've now changed the charts to have a common Y-axises, where possible.
    An additional data point was added to the charts to indicate performance with 256-bit workloads excluded (the ones which have actual gains from 256-bit code).
    The excluded 256-bit workloads are: Blender, Bullet (IPC only), Embree, Euler3D, Himeno, Linpack, NBody & X265.

    I also noticed that there is an error, affecting Kaby Lake results in 4C/4T, 4C/8T and SKU vs. SKU tests. The absolute data is correct, however the summary views are affected by a tiny amount.
    This is because I made a typo in "Caselab Euler3D" calculation for Kaby Lake (reversed calculation). Because of that the summary views will show small changes for Kaby Lake only. In 4C/4T summary the difference will reduce by >= 0.5% and in 4C/8T increase by ~1% (same applies to SKU vs. SKU).

    I leave the gallery codes (Imgur) to the original charts in the OP after changing the charts, so anyone can inspect the results if necessary.
     
    #4 The Stilt, Mar 2, 2017
    Last edited: Mar 6, 2017
  6. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    MT - 4C/8T 3.5GHz - Continued

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    SKU vs. SKU MT

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    EDIT: 3/6/2017

    Ok, so I've now changed the charts to have a common Y-axises, where possible.
    An additional data point was added to the charts to indicate performance with 256-bit workloads excluded (the ones which have actual gains from 256-bit code).
    The excluded 256-bit workloads are: Blender, Bullet (IPC only), Embree, Euler3D, Himeno, Linpack, NBody & X265.

    I also noticed that there is an error, affecting Kaby Lake results in 4C/4T, 4C/8T and SKU vs. SKU tests. The absolute data is correct, however the summary views are affected by a tiny amount.
    This is because I made a typo in "Caselab Euler3D" calculation for Kaby Lake (reversed calculation). Because of that the summary views will show small changes for Kaby Lake only. In 4C/4T summary the difference will reduce by >= 0.5% and in 4C/8T increase by ~1% (same applies to SKU vs. SKU).

    I leave the gallery codes (Imgur) to the original charts in the OP after changing the charts, so anyone can inspect the results if necessary.
     
    #5 The Stilt, Mar 2, 2017
    Last edited: Mar 6, 2017
  7. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    SKU vs. SKU MT - Continued

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    The SMT Yield


    [​IMG]

    [​IMG]

    [​IMG]

    EDIT: 3/6/2017

    Ok, so I've now changed the charts to have a common Y-axises, where possible.
    An additional data point was added to the charts to indicate performance with 256-bit workloads excluded (the ones which have actual gains from 256-bit code).
    The excluded 256-bit workloads are: Blender, Bullet (IPC only), Embree, Euler3D, Himeno, Linpack, NBody & X265.

    I also noticed that there is an error, affecting Kaby Lake results in 4C/4T, 4C/8T and SKU vs. SKU tests. The absolute data is correct, however the summary views are affected by a tiny amount.
    This is because I made a typo in "Caselab Euler3D" calculation for Kaby Lake (reversed calculation). Because of that the summary views will show small changes for Kaby Lake only. In 4C/4T summary the difference will reduce by >= 0.5% and in 4C/8T increase by ~1% (same applies to SKU vs. SKU).

    I leave the gallery codes (Imgur) to the original charts in the OP after changing the charts, so anyone can inspect the results if necessary.
     
    #6 The Stilt, Mar 2, 2017
    Last edited: Mar 6, 2017
    Artorius, DarthKyrie, pcp7 and 21 others like this.
  8. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    The SMT Yield - Continued

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    It is pretty easy to tell from the results, which workloads utilize FMA instructions (Bullet, Himeno, NBody, Linpack). Hopefully AMD is able to improve the FMA performance in the newer Zen based architectures.

    EDIT: 3/6/2017

    Ok, so I've now changed the charts to have a common Y-axises, where possible.
    An additional data point was added to the charts to indicate performance with 256-bit workloads excluded (the ones which have actual gains from 256-bit code).
    The excluded 256-bit workloads are: Blender, Bullet (IPC only), Embree, Euler3D, Himeno, Linpack, NBody & X265.

    I also noticed that there is an error, affecting Kaby Lake results in 4C/4T, 4C/8T and SKU vs. SKU tests. The absolute data is correct, however the summary views are affected by a tiny amount.
    This is because I made a typo in "Caselab Euler3D" calculation for Kaby Lake (reversed calculation). Because of that the summary views will show small changes for Kaby Lake only. In 4C/4T summary the difference will reduce by >= 0.5% and in 4C/8T increase by ~1% (same applies to SKU vs. SKU).

    I leave the gallery codes (Imgur) to the original charts in the OP after changing the charts, so anyone can inspect the results if necessary.
     
    #7 The Stilt, Mar 2, 2017
    Last edited: Mar 6, 2017
    Artorius, DarthKyrie, pcp7 and 24 others like this.
  9. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    I'm done.
     
    #8 The Stilt, Mar 2, 2017
    Last edited: Mar 2, 2017
    Magic Carpet, Artorius, pcp7 and 32 others like this.
  10. R0H1T

    R0H1T Platinum Member

    Joined:
    Jan 12, 2013
    Messages:
    2,200
    Likes Received:
    67
    So the secret sauce is SMT, pleasantly surprised by this.
    I guess 4c/8t or 6c/12t & 8c/16t would be better buys than their "lesser" counterparts.

    I must've missed this, but I don't see any power consumption numbers apart from Ryzen? Just interested in perf/w of Intel's best vs Zen.
     
    #9 R0H1T, Mar 2, 2017
    Last edited: Mar 2, 2017
  11. iBoMbY

    iBoMbY Member

    Joined:
    Nov 23, 2016
    Messages:
    113
    Likes Received:
    59
    Very interesting. Could you maybe also post the output of Coreinfo of your Ryzen, if you have the time? Unfortunately I will get mine tomorrow the earliest.
     
    strategyfreak likes this.
  12. lolfail9001

    lolfail9001 Senior member

    Joined:
    Sep 9, 2016
    Messages:
    932
    Likes Received:
    317
    Wait a minute, why does Zen gets stomped by Excavator in Himeno? In other FMA loads (Linpack, duh) it does not look like that.
     
  13. Atari2600

    Atari2600 Senior member

    Joined:
    Nov 22, 2016
    Messages:
    408
    Likes Received:
    286
    Great work!
     
  14. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    I didn't provide the numbers for others but Ryzen.
    That's because it would be rather hard to do so. I could measure the same DCR powers with Haswell-E and Excavator, however not with Kaby Lake.
    The issue with Haswell is that it's power wouldn't represent the power consumed by the CPU cores and the uncore, but it would also include switching losses (due to FIVR).
    On Kaby Lake I can only access the SVID data, which technically could be biased if Intel wanted that. That's because even the motherboard itself is rather OK, it doesn't have IR controller on it which would allow direct DCR telemetry.

    I might provide the figures for Excavator and Haswell-E later, regardless.
     
    Drazick likes this.
  15. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    That's a very good question.
    It happens regardless of the compiler used (GCC, MSVC, ICL) so it isn't compiler specific issue. Himeno binaries were compiled with GCC 6.3 as it is the superior compiler for the test.
     
    .vodka, lightmanek and Drazick like this.
  16. w3rd

    w3rd Senior member

    Joined:
    Mar 1, 2017
    Messages:
    218
    Likes Received:
    50
    Thank you.
     
  17. lolfail9001

    lolfail9001 Senior member

    Joined:
    Sep 9, 2016
    Messages:
    932
    Likes Received:
    317
    Is there a chance rumored latency issues have anything to do with it? Because i see such weird results on few synthetics that i only have memory in mind to explain that.
     
  18. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    Can you refresh me on the latency issue (LLC or DRAM)?
    The DRAM latency is almost exactly the same as on Excavator, so that's not the reason.
    Cache latencies appear to be fine as far as I can tell, at least when tested properly.

    Most things (including the microcodes and co-processor firmwares) will continue to evolve. There will definitely be some performance and functionality improvements.
     
    Artorius, .vodka, lightmanek and 7 others like this.
  19. lolfail9001

    lolfail9001 Senior member

    Joined:
    Sep 9, 2016
    Messages:
    932
    Likes Received:
    317
    DRAM, LLC looks inline with 6900k and the bunch.
    Makes sense. And does not make sense in the same time.
     
  20. Zucker2k

    Zucker2k Member

    Joined:
    Feb 15, 2006
    Messages:
    133
    Likes Received:
    26
    What are your findings on power consumption? As you stated, Prime 95 doesn't do a good job yet.
     
  21. tamz_msc

    tamz_msc Senior member

    Joined:
    Jan 5, 2017
    Messages:
    842
    Likes Received:
    554
    Tom's said that there the way AIDA64 measures latency does not pertain to real-world scenarios. Better software support is necessary. Though the SMT bug is a sore point in gaming.
     
  22. MajinCry

    MajinCry Golden Member

    Joined:
    Jul 28, 2015
    Messages:
    1,918
    Likes Received:
    319
  23. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    Code:
    AMD Ryzen: ZD3601BAM88F4_40/36_Y               
    AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
    HTT           *   Multicore
    HYPERVISOR   -   Hypervisor is present
    VMX           -   Supports Intel hardware-assisted virtualization
    SVM           *   Supports AMD hardware-assisted virtualization
    X64           *   Supports 64-bit mode
    
    SMX           -   Supports Intel trusted execution
    SKINIT       *   Supports AMD SKINIT
    
    NX           *   Supports no-execute page protection
    SMEP         *   Supports Supervisor Mode Execution Prevention
    SMAP         *   Supports Supervisor Mode Access Prevention
    PAGE1GB       *   Supports 1 GB large pages
    PAE           *   Supports > 32-bit physical addresses
    PAT           *   Supports Page Attribute Table
    PSE           *   Supports 4 MB pages
    PSE36         *   Supports > 32-bit address 4 MB pages
    PGE           *   Supports global bit in page tables
    SS           -   Supports bus snooping for cache operations
    VME           *   Supports Virtual-8086 mode
    RDWRFSGSBASE   *   Supports direct GS/FS base access
    
    FPU           *   Implements i387 floating point instructions
    MMX           *   Supports MMX instruction set
    MMXEXT       *   Implements AMD MMX extensions
    3DNOW         -   Supports 3DNow! instructions
    3DNOWEXT      -   Supports 3DNow! extension instructions
    SSE           *   Supports Streaming SIMD Extensions
    SSE2         *   Supports Streaming SIMD Extensions 2
    SSE3         *   Supports Streaming SIMD Extensions 3
    SSSE3         *   Supports Supplemental SIMD Extensions 3
    SSE4a         *   Supports Streaming SIMDR Extensions 4a
    SSE4.1       *   Supports Streaming SIMD Extensions 4.1
    SSE4.2       *   Supports Streaming SIMD Extensions 4.2
    
    AES           *   Supports AES extensions
    AVX           *   Supports AVX intruction extensions
    FMA           *   Supports FMA extensions using YMM state
    MSR           *   Implements RDMSR/WRMSR instructions
    MTRR         *   Supports Memory Type Range Registers
    XSAVE         *   Supports XSAVE/XRSTOR instructions
    OSXSAVE       *   Supports XSETBV/XGETBV instructions
    RDRAND       *   Supports RDRAND instruction
    RDSEED       *   Supports RDSEED instruction
    
    CMOV         *   Supports CMOVcc instruction
    CLFSH         *   Supports CLFLUSH instruction
    CX8           *   Supports compare and exchange 8-byte instructions
    CX16         *   Supports CMPXCHG16B instruction
    BMI1         *   Supports bit manipulation extensions 1
    BMI2         *   Supports bit manipulation extensions 2
    ADX           *   Supports ADCX/ADOX instructions
    DCA           -   Supports prefetch from memory-mapped device
    F16C         *   Supports half-precision instruction
    FXSR         *   Supports FXSAVE/FXSTOR instructions
    FFXSR         *   Supports optimized FXSAVE/FSRSTOR instruction
    MONITOR       *   Supports MONITOR and MWAIT instructions
    MOVBE         *   Supports MOVBE instruction
    ERMSB         -   Supports Enhanced REP MOVSB/STOSB
    PCLMULDQ      *   Supports PCLMULDQ instruction
    POPCNT       *   Supports POPCNT instruction
    LZCNT         *   Supports LZCNT instruction
    SEP           *   Supports fast system call instructions
    LAHF-SAHF    *   Supports LAHF/SAHF instructions in 64-bit mode
    HLE           -   Supports Hardware Lock Elision instructions
    RTM           -   Supports Restricted Transactional Memory instructions
    
    DE           *   Supports I/O breakpoints including CR4.DE
    DTES64       -   Can write history of 64-bit branch addresses
    DS           -   Implements memory-resident debug buffer
    DS-CPL       -   Supports Debug Store feature with CPL
    PCID         -   Supports PCIDs and settable CR4.PCIDE
    INVPCID       -   Supports INVPCID instruction
    PDCM         -   Supports Performance Capabilities MSR
    RDTSCP       *   Supports RDTSCP instruction
    TSC           *   Supports RDTSC instruction
    TSC-DEADLINE   -   Local APIC supports one-shot deadline timer
    TSC-INVARIANT   *   TSC runs at constant rate
    xTPR         -   Supports disabling task priority messages
    
    EIST         -   Supports Enhanced Intel Speedstep
    ACPI         -   Implements MSR for power management
    TM           -   Implements thermal monitor circuitry
    TM2           -   Implements Thermal Monitor 2 control
    APIC         *   Implements software-accessible local APIC
    x2APIC       -   Supports x2APIC
    
    CNXT-ID       -   L1 data cache mode adaptive or BIOS
    
    MCE           *   Supports Machine Check, INT18 and CR4.MCE
    MCA           *   Implements Machine Check Architecture
    PBE           -   Supports use of FERR#/PBE# pin
    
    PSN           -   Implements 96-bit processor serial number
    
    PREFETCHW    *   Supports PREFETCHW instruction
    
    Maximum implemented CPUID leaves: 0000000D (Basic), 8000001F (Extended).
    
    Logical to Physical Processor Map:
    **--------------  Physical Processor 0 (Hyperthreaded)
    --**------------  Physical Processor 1 (Hyperthreaded)
    ----**----------  Physical Processor 2 (Hyperthreaded)
    ------**--------  Physical Processor 3 (Hyperthreaded)
    --------**------  Physical Processor 4 (Hyperthreaded)
    ----------**----  Physical Processor 5 (Hyperthreaded)
    ------------**--  Physical Processor 6 (Hyperthreaded)
    --------------**  Physical Processor 7 (Hyperthreaded)
    
    Logical Processor to Socket Map:
    ****************  Socket 0
    
    Logical Processor to NUMA Node Map:
    ****************  NUMA Node 0
    
    No NUMA nodes.
    
    Logical Processor to Cache Map:
    *---------------  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
    *---------------  Instruction Cache   0, Level 1,   64 KB, Assoc   4, LineSize  64
    *---------------  Unified Cache       0, Level 2,  512 KB, Assoc   8, LineSize  64
    *---------------  Unified Cache       1, Level 3,   16 MB, Assoc  16, LineSize  64
    -*--------------  Data Cache          1, Level 1,   32 KB, Assoc   8, LineSize  64
    -*--------------  Instruction Cache   1, Level 1,   64 KB, Assoc   4, LineSize  64
    -*--------------  Unified Cache       2, Level 2,  512 KB, Assoc   8, LineSize  64
    -*--------------  Unified Cache       3, Level 3,   16 MB, Assoc  16, LineSize  64
    --*-------------  Data Cache          2, Level 1,   32 KB, Assoc   8, LineSize  64
    --*-------------  Instruction Cache   2, Level 1,   64 KB, Assoc   4, LineSize  64
    --*-------------  Unified Cache       4, Level 2,  512 KB, Assoc   8, LineSize  64
    --*-------------  Unified Cache       5, Level 3,   16 MB, Assoc  16, LineSize  64
    ---*------------  Data Cache          3, Level 1,   32 KB, Assoc   8, LineSize  64
    ---*------------  Instruction Cache   3, Level 1,   64 KB, Assoc   4, LineSize  64
    ---*------------  Unified Cache       6, Level 2,  512 KB, Assoc   8, LineSize  64
    ---*------------  Unified Cache       7, Level 3,   16 MB, Assoc  16, LineSize  64
    ----*-----------  Data Cache          4, Level 1,   32 KB, Assoc   8, LineSize  64
    ----*-----------  Instruction Cache   4, Level 1,   64 KB, Assoc   4, LineSize  64
    ----*-----------  Unified Cache       8, Level 2,  512 KB, Assoc   8, LineSize  64
    ----*-----------  Unified Cache       9, Level 3,   16 MB, Assoc  16, LineSize  64
    -----*----------  Data Cache          5, Level 1,   32 KB, Assoc   8, LineSize  64
    -----*----------  Instruction Cache   5, Level 1,   64 KB, Assoc   4, LineSize  64
    -----*----------  Unified Cache      10, Level 2,  512 KB, Assoc   8, LineSize  64
    -----*----------  Unified Cache      11, Level 3,   16 MB, Assoc  16, LineSize  64
    ------*---------  Data Cache          6, Level 1,   32 KB, Assoc   8, LineSize  64
    ------*---------  Instruction Cache   6, Level 1,   64 KB, Assoc   4, LineSize  64
    ------*---------  Unified Cache      12, Level 2,  512 KB, Assoc   8, LineSize  64
    ------*---------  Unified Cache      13, Level 3,   16 MB, Assoc  16, LineSize  64
    -------*--------  Data Cache          7, Level 1,   32 KB, Assoc   8, LineSize  64
    -------*--------  Instruction Cache   7, Level 1,   64 KB, Assoc   4, LineSize  64
    -------*--------  Unified Cache      14, Level 2,  512 KB, Assoc   8, LineSize  64
    -------*--------  Unified Cache      15, Level 3,   16 MB, Assoc  16, LineSize  64
    --------*-------  Data Cache          8, Level 1,   32 KB, Assoc   8, LineSize  64
    --------*-------  Instruction Cache   8, Level 1,   64 KB, Assoc   4, LineSize  64
    --------*-------  Unified Cache      16, Level 2,  512 KB, Assoc   8, LineSize  64
    --------*-------  Unified Cache      17, Level 3,   16 MB, Assoc  16, LineSize  64
    ---------*------  Data Cache          9, Level 1,   32 KB, Assoc   8, LineSize  64
    ---------*------  Instruction Cache   9, Level 1,   64 KB, Assoc   4, LineSize  64
    ---------*------  Unified Cache      18, Level 2,  512 KB, Assoc   8, LineSize  64
    ---------*------  Unified Cache      19, Level 3,   16 MB, Assoc  16, LineSize  64
    ----------*-----  Data Cache         10, Level 1,   32 KB, Assoc   8, LineSize  64
    ----------*-----  Instruction Cache  10, Level 1,   64 KB, Assoc   4, LineSize  64
    ----------*-----  Unified Cache      20, Level 2,  512 KB, Assoc   8, LineSize  64
    ----------*-----  Unified Cache      21, Level 3,   16 MB, Assoc  16, LineSize  64
    -----------*----  Data Cache         11, Level 1,   32 KB, Assoc   8, LineSize  64
    -----------*----  Instruction Cache  11, Level 1,   64 KB, Assoc   4, LineSize  64
    -----------*----  Unified Cache      22, Level 2,  512 KB, Assoc   8, LineSize  64
    -----------*----  Unified Cache      23, Level 3,   16 MB, Assoc  16, LineSize  64
    ------------*---  Data Cache         12, Level 1,   32 KB, Assoc   8, LineSize  64
    ------------*---  Instruction Cache  12, Level 1,   64 KB, Assoc   4, LineSize  64
    ------------*---  Unified Cache      24, Level 2,  512 KB, Assoc   8, LineSize  64
    ------------*---  Unified Cache      25, Level 3,   16 MB, Assoc  16, LineSize  64
    -------------*--  Data Cache         13, Level 1,   32 KB, Assoc   8, LineSize  64
    -------------*--  Instruction Cache  13, Level 1,   64 KB, Assoc   4, LineSize  64
    -------------*--  Unified Cache      26, Level 2,  512 KB, Assoc   8, LineSize  64
    -------------*--  Unified Cache      27, Level 3,   16 MB, Assoc  16, LineSize  64
    --------------*-  Data Cache         14, Level 1,   32 KB, Assoc   8, LineSize  64
    --------------*-  Instruction Cache  14, Level 1,   64 KB, Assoc   4, LineSize  64
    --------------*-  Unified Cache      28, Level 2,  512 KB, Assoc   8, LineSize  64
    --------------*-  Unified Cache      29, Level 3,   16 MB, Assoc  16, LineSize  64
    ---------------*  Data Cache         15, Level 1,   32 KB, Assoc   8, LineSize  64
    ---------------*  Instruction Cache  15, Level 1,   64 KB, Assoc   4, LineSize  64
    ---------------*  Unified Cache      30, Level 2,  512 KB, Assoc   8, LineSize  64
    ---------------*  Unified Cache      31, Level 3,   16 MB, Assoc  16, LineSize  64
    
    Logical Processor to Group Map:
    ****************  Group 0
    
     
    Artorius, gupsterg, .vodka and 9 others like this.
  24. The Stilt

    The Stilt Golden Member

    Joined:
    Dec 5, 2015
    Messages:
    1,216
    Likes Received:
    1,415
    Drazick, sirmo, Burpo and 1 other person like this.
  25. inf64

    inf64 Platinum Member

    Joined:
    Mar 11, 2011
    Messages:
    2,733
    Likes Received:
    932
    Impressive article Stilt! You detailed ups and downs of Zen really well, better than any large media outlet out there. This should be posted on some website like RWT or similar. Kudos!
     
    sushukka, pcp7, lightmanek and 8 others like this.
  26. MajinCry

    MajinCry Golden Member

    Joined:
    Jul 28, 2015
    Messages:
    1,918
    Likes Received:
    319
    Sweet. Optimal settings is to set Ships to 1, Rocks to 16000, and disable instancing.