Recent content by BenchPress

  1. B

    Intel extends AVX to 512-bit

    We have 10-core processors right now. You have the (lack of) competition to blame for keeping the prices high on anything beyond quad-core. AMD's Steamroller architecture with four modules might finally perform a little closer to an 8-core. So that would make Intel release affordable 6 or 8-core...
  2. B

    Intel extends AVX to 512-bit

    They tried but failed, due to the inherent heterogeneous overhead and programming complications. With the Kepler architecture they're focusing on graphics again, which is where the money is for them, and they've taken a serious step back from consumer GPGPU. AVX-512 instead is homogeneous, which...
  3. B

    Intel extends AVX to 512-bit

    What makes you think that? Same question. I don't perceive the lack of dedicated masking registers as that big of an issue. AVX has 'blend' instructions for predication, and Intel CPUs have two execution ports for them. Masking can reduce power consumption by disabling unused lanes, and it...
  4. B

    Intel extends AVX to 512-bit

    That's a bit like asking what programs benefit from having more execution units per core. Sure, some have more instruction level parallelism (ILP) than others, but you can't draw a line between ones that do and ones that don't benefit from it. Likewise, these wide vector instructions are very...
  5. B

    Intel extends AVX to 512-bit

    Interestingly it's not exactly new. It's for the most part the Xeon Phi ISA, made compatible with the legacy 256-bit and 128-bit instructions. There's a new zeroing behavior option when using the mask registers, which seems of particular interest to out-of-order execution architectures, to...
  6. B

    Intel extends AVX to 512-bit

    AVX only extended floating-point operations to 256-bit. x264 uses integer operations. AVX2 offers 256-bit integer vector operations. AVX-512 does not extend them to 512-bit, for now. Just to be clear though, GPGPU is not efficient at processing small integer elements.
  7. B

    Intel extends AVX to 512-bit

    It's going to revolutionize computing as we know it. It brings all of the general-purpose computing power of the GPU, into the CPU cores. No more heterogeneous overhead. R.I.P. GPGPU.
  8. B

    Intel extends AVX to 512-bit

    http://software.intel.com/en-us/blogs/2013/07/10/avx-512-instructions
  9. B

    Company of Heroes 2 - fascinating CPU benchmarks

    With more threads, you get more interactions between threads that need synchronization. More synchronization means more overhead. This is exactly why Intel introduced TSX. It optimizes synchronization by assuming that most interactions won't be interrupted by a conflict.
  10. B

    Knight's Landing, Skylake to unify instruction sets?

    Why would you want to run Windows on it? Windows is targeted at consumers and servers, not at HPC systems that have barely any need for an OS. Even so, Microsoft could easily create a Phi version of Windows, if there was enough demand. No need for Intel to bend over backwards to support Windows...
  11. B

    Knight's Landing, Skylake to unify instruction sets?

    Fat chance. 14 nm is a node and a half smaller, and they'll probably bring 6 or 8-core to the mainstream market. They'll have enough on their plate to not want to be bothered with a new architecture at the same time. The tick-tock model has worked really well so far. There could be a handful of...
  12. B

    Knight's Landing, Skylake to unify instruction sets?

    Not likely. Xeon Phi is targeted exclusively at the HPC market, and runs software by and for that market. So it doesn't have to be binary compatible with legacy CPU extensions. You may not even want that. Xeon Phi is an in-order execution architecture with hundreds of threads, while desktop...
  13. B

    Hyperthreading Revisited

    Note that Haswell has improved Hyper-Threading performance. It has four arithmetic execution ports, instead of three (which we were stuck with since Core 2). What's more, they're arranged so that it's really two pairs of ports with equal capabilities (for scalar integer instructions). This is...
  14. B

    Yes another Haswell thread. Let's have a look at tock-to-tock IPC.

    But it requires a lot from the hardware! You can't claim IPC scales much more easily by just looking at what it means to developers. That's only half the story, the good part. The bad part is that beyond modest increments it takes a very large amount of hardware to extract more ILP, and worse...
  15. B

    Yes another Haswell thread. Let's have a look at tock-to-tock IPC.

    That certainly couldn't have been your original point: Note that this is what started our discussion. If "inherent scaling problems" refer to "a loss of work, no matter how small", that implies you expect IPC to scale much more easily. In reality IPC only scales by roughly 10% every...