4th Generation Intel Core, Haswell summarized

Makaveli · Sep 12, 2012

Ajay said:
"comfortably faster"? I'm guessing you didn't word that the way you meant. I think a 10+% increase in performance on the same node is pretty impressive.

In any case, Haswell isn't compelling for those that won't see much benefit for the work/play that's important to them if they already have SB/IB. But coming from Nehalem, as I am, that added 10% just makes Haswell all the more attractive.

I a bit torn between IB-E** (more cores) and Haswell when I start a new build next summer (Yay!). The software engineer in me would like to play with AVX2 and TSX (does TSX work on standard DDR3? I'm a little unclear about that). But I like to contribute to F@H in my father's memory, so more cores is usually the way to go, unless F@H is going to take advantage of AVX2. So, I'll need to wait for benchmarks and more info before deciding. Although I have to wait another year to upgrade, at least I will have some exciting high performance CPUs to choose from :thumbsup:

** I betting that IB-E will use a better TIM than IB, so it stands a good chance of being a better overclocker than IB.

i'm in the same boat skipped both SB and IVY.

Anyone on Nehalem/lynnfield/westmere or new builds even a 10% IPC will be a huge boost. If you are on SB/IVY I wouldn't even look at haswell and should wait for its refresh.

Based on all the details i've read so far its looks like AVX2 will have more of an impact then AVX.

Quad's are still the sweet spot so I think we will see the same thing that happend with SB-E and IVY.

So it will be Haswell 4 cores and maybe IVY-E for 6 cores.

NTMBK · Sep 12, 2012

BenchPress said:
Haswell adds a fourth scalar execution port, and the vector throughput had doubled. So these cores are much more powerful.

I'm not entirely sure but I don't think Intel has made any official IPC claims yet... Architecturally it should be capable of more than 10%. And there's TSX to further improve mult-threading efficiency. Last but not least, the type of consumer software that would benefit from more cores, can more easily get more performance with AVX2. So there's really no need to go beyond quad-core just yet.

Look at it this way: When GPUs double their vector width, they advertise it as twice the number of 'compute' cores. Haswell does exactly that. So there's no need to be disappointed about a quad-core Haswell. It's going to be way better than a quad-core Sandy/Ivy Bridge.

Haswell only doubles vector width for integer instructions, not floating point.

Your point with GPUs is misleading- GPUs interact with code through drivers and abstraction layers, so they can use their doubled vector width on legacy code right now. AVX(2) requires that code is at least recompiled for very marginal gains from autovectorisation- take a look at the MSDN pages on autovectorisation to get some idea at how few cases they can actually automatically do, and notice how many cases at least require some form of #pragma hint, if not code restructuring.

In order to actually exploit the full throughput potential of AVX(2) you will need to dive deep into your code and fill it with intrinsics, which may as well be assembly for all their ease of use and readability. Not a single program on the market today uses AVX2, and only a tiny, tiny handful use AVX. We will be waiting years for any serious benefits from it in our everyday computer usage- and even then, the majority of code won't bother with the hand optimisation necessary to actually get the claimed theoretical speedup. Most coders will get their code working, turn on all compiler optimisations they can get their hands on and ship the damn thing.

Just take a look at the current situation in games for a more realistic picture. Go through the requirements of top Steam games. How many of them require any processor extensions beyond SSE1 or SSE2? How many of them even ship 64 bit binaries? And this is gaming, a segment notorious for craving any performance edge they can get. What do you think the situation is like in other market segments? Developers want to sell to as many people as they possibly can, which means not making software incompatible with 99% of the computers in peoples' homes.

CPU instruction extensions are very, very slow to take any real effect in the market beyond incredibly niche segments like HPC. Whereas a graphics manufacturer can throw their entire chip architecture under the bus and replace it with something new, and still get basically maximum performance out of it. Don't let people like BenchPress get you overly hyped over the latest shiny version of AVX; it will mean nothing for years to come, and by the time software really starts using it your Haswell will be slow and outdated.

inf64 · Sep 12, 2012

@BenchPress

What you describe is running a thread on a very fat core. This is not reverse HT... Reverse HT myth describes a situation where 2 separate cores run inherently serial single threaded workload(hard to extract ILP) in a "split" way ,providing the speedup over a situation where only one of those cores runs the same workload.

Being able to utilize newly available resource within the single core(like more exec. ports in Haswell) doesn't make it reverse HT capable. Intel just took their very wide core and made it even wider,that's all. Whether all the resources will be fully utilized when it runs a single threaded code is debatable. Average single thread IPC on Core generation chips is between 1 and 1.2 (instrcutions per cycle) . This is measured on 4way core that has 3 ALU ports. Haswell may improve on this average number by another 10-15% which is realistic. The gains from going wider will be smaller and smaller as time goes by(you should know this).

pelov · Sep 12, 2012

NTMBK said:
Haswell only doubles vector width for integer instructions, not floating point.

Your point with GPUs is misleading- GPUs interact with code through drivers and abstraction layers, so they can use their doubled vector width on legacy code right now. AVX(2) requires that code is at least recompiled for very marginal gains from autovectorisation- take a look at the MSDN pages on autovectorisation to get some idea at how few cases they can actually automatically do, and notice how many cases at least require some form of #pragma hint, if not code restructuring.

In order to actually exploit the full throughput potential of AVX(2) you will need to dive deep into your code and fill it with intrinsics, which may as well be assembly for all their ease of use and readability. Not a single program on the market today uses AVX2, and only a tiny, tiny handful use AVX. We will be waiting years for any serious benefits from it in our everyday computer usage- and even then, the majority of code won't bother with the hand optimisation necessary to actually get the claimed theoretical speedup. Most coders will get their code working, turn on all compiler optimisations they can get their hands on and ship the damn thing.

Just take a look at the current situation in games for a more realistic picture. Go through the requirements of top Steam games. How many of them require any processor extensions beyond SSE1 or SSE2? How many of them even ship 64 bit binaries? And this is gaming, a segment notorious for craving any performance edge they can get. What do you think the situation is like in other market segments? Developers want to sell to as many people as they possibly can, which means not making software incompatible with 99% of the computers in peoples' homes.

CPU instruction extensions are very, very slow to take any real effect in the market beyond incredibly niche segments like HPC. Whereas a graphics manufacturer can throw their entire chip architecture under the bus and replace it with something new, and still get basically maximum performance out of it. Don't let idiots like BenchPress get you overly hyped over the latest shiny version of AVX; it will mean nothing for years to come, and by the time software really starts using it your Haswell will be slow and outdated.

It's refreshing to see a post from someone with a level head

cheers.

NTMBK · Sep 12, 2012

Somewhat pertinent to this discussion, here are the stats from the latest Steam hardware survey (August 2012) on what instruction extensions are available on users' PCs:

FCMOV100.00%0.00%
SSE2 99.74% +0.03%
NTFS 99.51% -0.01%
SSE3 99.07% +0.12%
SSE4.1 54.33% +1.74%
SSE4.2 41.64% +1.80%
HyperThreading 34.70% +1.06%
SSE4a 20.93% +0.15%

SSE4.1 is only in half of all gamers' PCs (assuming that all gamers use Steam, which isn't far from the truth). This has been in processors since Penryn, which shipped in 2007. Five years on, almost half of the gaming target audience cannot use those instructions. Why would a developer waste money on making specialised code branches for anything beyond SSE3?

pelov · Sep 12, 2012

I think an even more telling point is that Penryn was shipped when a majority of development was focused around the PC and x86 whereas now it's shifted to mobile and ARM.

Win8 may change that but only if it succeeds and judging by a lot of enthusiasts' own perception of the OS and the platforms, it hasn't exactly had a ringing endorsement.

I'm far more interested in what Haswell will bring to its target area of Ultrabooks and mobile as a whole along with its GPU and efficiency increase. For AVX2 to take a strong foothold you would need a time machine or an alternate universe.

kleinkinstein · Sep 12, 2012

Dribble said:
Mmm, 10% ipc same node and still quad core is not very exciting. Looks like my old i5-2500K is going to stay competitive (well comfortably faster then any non-oc intel cpu) for another generation.

Heck, all those with i7-920's are still in the game. Haswell looks to be a real stretch for a desktop "tock"!

BenchPress · Sep 12, 2012

NTMBK said:
Haswell only doubles vector width for integer instructions, not floating point.

Haswell also doubles floating-point throughput using FMA. I didn't say it doubled vector width for floating-point.

Your point with GPUs is misleading- GPUs interact with code through drivers and abstraction layers, so they can use their doubled vector width on legacy code right now.

You can get twice the performance with OpenCL code running on the CPU. This also merely requires the OpenCL "driver" to be updated. The same is true for other JIT-compiled languages.

AVX(2) requires that code is at least recompiled for very marginal gains from autovectorisation- take a look at the MSDN pages on autovectorisation to get some idea at how few cases they can actually automatically do, and notice how many cases at least require some form of #pragma hint, if not code restructuring.

AVX2 is far more suitable for auto-vectorization thanks to the addition of gather support and vector-vector shift. Every relevant scalar operation got a vector equivalent, so it becomes straightforward to vectorize things in an SPMD fashion. You can't compare AVX2 to anything that has come before it.

NTMBK · Sep 12, 2012

BenchPress said:
You can get twice the performance with OpenCL code running on the CPU. This also merely requires the OpenCL "driver" to be updated. The same is true for other JIT-compiled languages.

Yes, OpenCL running on the CPU is going to benefit from it very nicely- I'm interested to see some benchmarks comparing Haswell against GCN for raw OpenCL performance. I suspect that Haswell won't do that great, but that'd be more due to the fact that most OpenCL code won't make the most of the CPU's strengths in branching code and prediction, and the algorithms will be optimised to minimise branching.

AVX2 is far more suitable for auto-vectorization thanks to the addition of gather support and vector-vector shift. Every relevant scalar operation got a vector equivalent, so it becomes straightforward to vectorize things in an SPMD fashion.

I'm inclined to agree with you; autovectorization will go from essentially useless to being helpful in at least a handful of cases. But you still won't see anything close to the promised 2x,4x,8x performance gains unless developers go in and code these instructions by hand, and even then only in certain scenarios where you actually have 8 simultaneous data elements to work on (or 32, if you want to go across the cores). I remain dubious, until I see some benchmarking of the improvements that autovectorisation will have.

You can't compare AVX2 to anything that has come before it.

I can compare AVX2 to what came before it, because nobody will use it for years to come due to the market factors I already explained. No-one will start shipping code which won't run on the majority of their target audience's computers.

AVX2 is a big improvement for the right use cases, in the same was SSE2 was a big improvement- but that means nothing if its not in anyone's code. The main tipping point for SSE2 adoption came because you could guarantee that it would be in every single x64 processor.

LogOver · Sep 12, 2012

NTMBK said:
Somewhat pertinent to this discussion, here are the stats from the latest Steam hardware survey (August 2012) on what instruction extensions are available on users' PCs:

FCMOV100.00%0.00%
SSE2 99.74% +0.03%
NTFS 99.51% -0.01%
SSE3 99.07% +0.12%
SSE4.1 54.33% +1.74%
SSE4.2 41.64% +1.80%
HyperThreading 34.70% +1.06%
SSE4a 20.93% +0.15%

SSE4.1 is only in half of all gamers' PCs (assuming that all gamers use Steam, which isn't far from the truth). This has been in processors since Penryn, which shipped in 2007. Five years on, almost half of the gaming target audience cannot use those instructions. Why would a developer waste money on making specialised code branches for anything beyond SSE3?

That's funny, but it seems that percentage of SSE4 on steam has grown a little bit since your last visit (today, I guess):
FCMOV 100.00%0.00%

SSE2 99.75% +0.03%

SSE3 99.11% +0.12%

NTFS 95.07% -0.01%

SSE4.1 56.11% +1.69%

SSE4.2 42.55% +1.89%

HyperThreading 35.48% +1.14%
SSE4a 19.99% +0.14%

Any way, games are not the only software in the world. I guess companies which develop performance-critical software will like it.

LogOver · Sep 12, 2012

pelov said:
I think an even more telling point is that Penryn was shipped when a majority of development was focused around the PC and x86 whereas now it's shifted to mobile and ARM.

The situation with ISA in ARM world in fact is much more complicated then in x86 world. There are a lot of different extensions to support: armv5, armv6, armv7, NEON, vfp2/3, upcoming armv8 etc.

NTMBK · Sep 12, 2012

LogOver said:
That's funny, but it seems that percentage of SSE4 on steam has grown a little bit since your last visit (today, I guess):

Hah, my bad. I was looking at Windows only numbers, the Mac and Windows numbers are indeed what you say. My point still stands though, 54% or 56% regardless.

Any way, games are not the only software in the world. I guess companies which develop performance-critical software will like it.

Only companies which have complete control over the hardware platform they deploy on, and which don't have to support any form of legacy hardware- basically HPC, and the financial sector (those crazy guys will do anything to get a millisecond edge on their rivals, replacing a few racks of servers is a drop in the ocean for them).

Mark R · Sep 12, 2012

meloz said:
Haswell looks like a must upgrade. Improvements in performance/watt are great. Reading about PSR got me giggling with happiness, I hope this feature does not need any support in kernel or a recompile and works transparently for all OS and programs.

From a technical perspective, it doesn't sound like there should be any reason for it to require any software support whatsoever. It sounds like it should be completely transparent given connection to a PSR compatible screen.

ZeroRift · Sep 12, 2012

On-die GPU cache does exist!

link

I, too, thought that this tech was an enthusiast fabrication....

blastingcap · Sep 12, 2012

http://semiaccurate.com/2012/09/07/haswell-gt3-uses-shaders-to-save-power/

http://semiaccurate.com/2012/09/10/crystalwell-is-very-wide-memory-for-haswell-gt3/

http://semiaccurate.com/2012/09/07/intel-to-do-away-with-dram-in-pcs/

BenchPress · Sep 12, 2012

Pilum said:
What symmetry? No integer MUL/DIV on p5+6.

I'm not sure about integer MUL, but yes, integer DIV is an exception. But note that everything else is identical, including branch!

DIV isn't a problem because it's a high latency instruction anyway so its ok to not have its result forwarded in the same cycle.

Also, limiting port forwarding to the p0+1 and p5+6 pairs would mean a regression in single-threaded performance vs. IVB.

No. The only effect of having no forwarding is increased latency (for having to go through the register file). That's not a problem as long as you're executing instructions on the second port pair, that the next couple of instructions on the first port pair don't depend on. Keep in mind that it's fairly rare to achieve an IPC of more than 2 instructions anyway. And when you do, it's going to be because of plenty of independent instructions. For instance in a loop where the CPU starts executing multiple independent iterations. In those cases Haswell achieve an IPC of 4. Also note that Haswell increases every single buffer which affects out-of-order execution, further increasing the chances of finding independent instructions which can execute on the second port pair.

So I don't think eliminating forwarding between 0+1 and 5+6 would be a regression. It's going to prevent them from claiming a 33% single-threaded IPC improvement, but that appears to be exactly the case...

exar333 · Sep 12, 2012

Edit: Someone already posted the link...my mistake!

Interesting as well. 128MB on-package memory for GPU.

http://www.anandtech.com/show/6277/haswell-up-to-128mb-onpackage-cache-ulv-gpu-performance-estimates

exar333 · Sep 12, 2012

pelov said:
In synthetic benchmarks it was doubled, in real world gaming scenarios the figures weren't quite as high. You sort of expect that because of how it's going to depend on drivers and the diversity of titles that the GPU is tasked with.

Correct me if I'm wrong, but didn't Anand state that at most the improvement from IB was in the low teens and not 10%+? Let's not Bulldozer ourselves here. Despite the FPU improvements this processor looks to be mostly perf-per-watt and GPU centric. Not that I'm complaining, that's exactly what I wanted

Take a look at AT's IV review. Quicksync was more than 50% in many cases, but gaming was around 41% higher than SB. That's pretty close to 50%...

http://www.anandtech.com/show/5771/the-intel-ivy-bridge-core-i7-3770k-review/9

MisterMac · Sep 12, 2012

Why did anandtech take down the "crystalwell" article about 128mb onpackage mem for Gt3?

BenchPress · Sep 12, 2012

inf64 said:
What you describe is running a thread on a very fat core. This is not reverse HT... Reverse HT myth describes a situation where 2 separate cores run inherently serial single threaded workload

That difference is pretty artificial (and I don't recall ever seeing a strict definition of it). Why would two cores with two arithmetic ports each, cooperating to execute a single thread, count as reverse Hyper-Threading, while one with four ports (two pairs of symmetrical ones) that can run two threads that don't have to compete over ports, and can use all ports in case of a single thread, not count as reverse Hyper-Threading?

Your narrow definition of reverse Hyper-Threading is indeed a myth. It was never feasible to have two separate cores cooperate on a single thread. But the loose definition is to have a single thread use two sets of symmetrical execution resources, which are otherwise used by two threads. And that's pretty much what Haswell appears to be doing.

If you don't want to call it reverse Hyper-Threading, fine, but then I doubt we can call it a very fat core either. Very fat cores are not power efficient, and I don't see any signs of Haswell making any compromises on that front. On the contrary.

Being able to utilize newly available resource within the single core(like more exec. ports in Haswell) doesn't make it reverse HT capable. Intel just took their very wide core and made it even wider,that's all.

And have same cycle result forwarding between all four scalar arithmetic ports? Very doubtful.

Olikan · Sep 12, 2012

On the CPU side you can expect a ~10% increase in performance on average over Ivy Bridge. As always we'll see a range of performance gains, some benchmarks will show less and others will show more.

the rumor was sadly true than?

Pilum · Sep 12, 2012

BenchPress said:
I'm not sure about integer MUL, but yes, integer DIV is an exception. But note that everything else is identical, including branch!

If you look at SF12_ARCS001_100.pdf p. 12, it seems obvious to me there's no MUL on p6... I'm sure they would have added that in the diagram if it was there.

DIV isn't a problem because it's a high latency instruction anyway so its ok to not have its result forwarded in the same cycle.

Yeah, integer DIV is so rare anyway that latency doesn't matter much.

No. The only effect of having no forwarding is increased latency (for having to go through the register file). That's not a problem as long as you're executing instructions on the second port pair, that the next couple of instructions on the first port pair don't depend on. Keep in mind that it's fairly rare to achieve an IPC of more than 2 instructions anyway. And when you do, it's going to be because of plenty of independent instructions. For instance in a loop where the CPU starts executing multiple independent iterations.

I don't think that an average IPC of 2 is really rare on modern architectures; if that'd be true, Bulldozer should fare better in many workloads, but it gets creamed in single-threaded perf by SNB/IVB nearly everywhere.

And considering how much effort Intel puts into improving single-threaded perf, I don't see them taking the additional latency hit. You're right that there are many cases where increased latency doesn't matter; but I think that Intel doesn't care for the best or the average case, but for the worst case. Only the paranoid get excellent IPC. Of course, it depends on the additional latency. Is there any detailed information on the pipeline? If it's 1 extra cycle, no problem. If it's three, that could be a significant problem in some cases.

In those cases Haswell achieve an IPC of 4. Also note that Haswell increases every single buffer which affects out-of-order execution, further increasing the chances of finding independent instructions which can execute on the second port pair.

True, but the buffer increases are a few percent, they won't be able to compensate for drastic changes in the pipeline. Which reminds me Intel states in SF12_SPCS001_100.pdf p. 20: "No change in key pipelines". I think that a crippling of the forwarding network would constitute such a change.

So I don't think eliminating forwarding between 0+1 and 5+6 would be a regression. It's going to prevent them from claiming a 33% single-threaded IPC improvement, but that appears to be exactly the case...

I'm certain it would be a regression on some workloads. And IMO it just doesn't seem like Intels style (since Core2) to compromise on any aspect of IPC or single-threaded performance.

But we'll know in ~6 months, so we just need a little patience.

Pilum · Sep 12, 2012

MisterMac said:
Why did anandtech take down the "crystalwell" article about 128mb onpackage mem for Gt3?

I also couldn't access it at first, but it's available now.

LogOver · Sep 12, 2012

Olikan said:
the rumor was sadly true than?

It worth to mention, that 10% increase is expected on existing software. It's actually large gain, considering that these days the thing which is holding performance back is software and not hardware. Usually, no matter how wide your CPU is, the average IPC is low because of the way the most software is written. You still can extract few percent of performance by applying different heuristics (what Intel is doing right now) but you never will be able to get large performance increase on existing software by improving uarch only. The next performance gain will come from utilizing new instruction and rewriting software.

BenchPress · Sep 12, 2012

NTMBK said:
Somewhat pertinent to this discussion, here are the stats from the latest Steam hardware survey (August 2012) on what instruction extensions are available on users' PCs:

FCMOV100.00%0.00%
SSE2 99.74% +0.03%
NTFS 99.51% -0.01%
SSE3 99.07% +0.12%
SSE4.1 54.33% +1.74%
SSE4.2 41.64% +1.80%
HyperThreading 34.70% +1.06%
SSE4a 20.93% +0.15%

SSE4.1 is only in half of all gamers' PCs (assuming that all gamers use Steam, which isn't far from the truth). This has been in processors since Penryn, which shipped in 2007. Five years on, almost half of the gaming target audience cannot use those instructions. Why would a developer waste money on making specialised code branches for anything beyond SSE3?

And yet the things that benefit from SSE4, already have support for SSE4. Ever since SSE2 the improvements have been very minor though. Until AVX2.

So please stop looking at the past adoption rate of instruction set extension. Even the major SSE2 extension had a slow adoption rate because the 128-bit instructions were initially executed on 64-bit execution units!

Furthermore, I think it's silly to expect a significant number of applications to take advantage of more than four cores, but not AVX2. I can tell your first hand that scaling beyond four cores, without the help of TSX, gets very hard. In comparison it will be a breeze to take advantage of AVX2 to increase throughput. So rest assured that the developers who wish their application to run faster, will make use of AVX2 very quickly.

In other words, AVX2 is in most cases every bit as good as having twice the number of cores, if not better.

4th Generation Intel Core, Haswell summarized

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Senior member

Senior member

Lifer

Member

Member

Lifer

Diamond Member

Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Senior member

Platinum Member

Member

Member

Member

Senior member