4th Generation Intel Core, Haswell summarized

BenchPress · Sep 12, 2012

NTMBK said:
Yes, OpenCL running on the CPU is going to benefit from it very nicely- I'm interested to see some benchmarks comparing Haswell against GCN for raw OpenCL performance. I suspect that Haswell won't do that great, but that'd be more due to the fact that most OpenCL code won't make the most of the CPU's strengths in branching code and prediction, and the algorithms will be optimised to minimise branching.

Here's something to give you an idea: OpenCL Accelerated Handbrake with AMD's Trinity. The APU doesn't stand a chance against Haswell.

I'm inclined to agree with you; autovectorization will go from essentially useless to being helpful in at least a handful of cases. But you still won't see anything close to the promised 2x,4x,8x performance gains unless developers go in and code these instructions by hand, and even then only in certain scenarios where you actually have 8 simultaneous data elements to work on (or 32, if you want to go across the cores). I remain dubious, until I see some benchmarking of the improvements that autovectorisation will have.

You're probably thinking about 'horizontal' vectorization. AVX2 is particularly suited for 'vertical' vectorization. To have 8 data elements to work on simutaneously, all you need is a scalar loop with independent iterations. Each AVX2 instruction can then execute the same scalar operation for 8 iterations simultaneously!

Pretty much all arithmetic intensive applications have performance critical loops like that. And there's no need to "code these instructions by hand". It's very straightforward for the compiler to perform this form of vectorization.

I can compare AVX2 to what came before it, because nobody will use it for years to come due to the market factors I already explained. No-one will start shipping code which won't run on the majority of their target audience's computers.

Yes they will. People who are the first to buy new hardware are also the first to buy new software. So developers have to implement support for new features early on or they'll lose an important piece of the market to the competition. Benchmarks of competing software products are typically released not long after the new hardware is released. So they can't afford to look bad.

Ajay · Sep 12, 2012

pelov said:
Process maturation alone would imply that the overclocks would be higher if not the stock clock speeds themselves and perhaps even both.

Good point! No GT3 on desktop Quad I7s, if I understand correctly - will be GT2.

inf64 said:
8 core haswell will demolish 10 core IB-EX . But in order to get the haswell EX you need to wait quite a while (intel needs to launch and milk the IB-EX first).

Yeah, I don't want to wait another two years, I don't think I can take it. :'(

BenchPress said:
TSX doesn't have anything to do with RAM. It provides transactional synchronization between the L1 caches.

Folding@home will absolutely use AVX2.

Duh, of course - anything read or written to RAM will be in L1$ for CPU operations - thanks! Oh, and awesome, can play with TSX right away - so long as M$ VS supports it next year. Heck, maybe it's already there, haven't checked - was playing with C++ AMP. :thumbsup:

Well with F@H using it (and why not, since a compiler flag with do most of the heavy lifting), then 4 HW cores will likely have around the same throughput as 8 IB-E cores, and for allot less money (well, maybe not - I'll spend some extra dosh on a proper water cooled system so that the H100 can stay with my Ci7 920

).

pelov · Sep 12, 2012

So 128mb cache is likely going to be sitting on the desktop chips for TDP headroom freedom? I don't see it being possible that the 35W and 17W ULVs would have anywhere near that number.

So 10% average performance gain in CPU and a still crappy and expensive on-die GPU for the desktop? They weren't kidding about Haswell being mobile focused :/

NTMBK · Sep 12, 2012

BenchPress said:
And yet the things that benefit from SSE4, already have support for SSE4. Ever since SSE2 the improvements have been very minor though. Until AVX2.

So please stop looking at the past adoption rate of instruction set extension. Even the major SSE2 extension had a slow adoption rate because the 128-bit instructions were initially executed on 64-bit execution units!

I fear you may be misunderstanding what those numbers are- those are not the % of applications using instruction set extensions, those are the percentage of active Steam users' PCs which can utilise those instruction set extensions. And given the decline in PC sales growth of late, I imagine that the rate at which end users get access to new ISEs will only get slower.

Furthermore, I think it's silly to expect a significant number of applications to take advantage of more than four cores, but not AVX2. I can tell your first hand that scaling beyond four cores, without the help of TSX, gets very hard. In comparison it will be a breeze to take advantage of AVX2 to increase throughput. So rest assured that the developers who wish their application to run faster, will make use of AVX2 very quickly.

In other words, AVX2 is in most cases every bit as good as having twice the number of cores, if not better.

The nature of large vector units and SIMD is that you need to be performing the same calculations across multiple chunks of data. This can also be done with separate cores, but separate cores can also perform entirely independent tasks.

I am not, however, especially advocating vastly increased core counts. In server chips, yes, of course- we need all the cores we can get there. But in consumer devices core count is not the most important. The single most useful improvement to everyday users is transparent speedup of existing applications- that is, increased clock speeds and increased IPC.

You're probably thinking about 'horizontal' vectorization. AVX2 is particularly suited for 'vertical' vectorization. To have 8 data elements to work on simutaneously, all you need is a scalar loop with independent iterations. Each AVX2 instruction can then execute the same scalar operation for 8 iterations simultaneously!

Pretty much all arithmetic intensive applications have performance critical loops like that. And there's no need to "code these instructions by hand". It's very straightforward for the compiler to perform this form of vectorization.

Yes, I am well aware of how vectorisation works, and how to use vector units. I write code full of __m128's for a living. And outside of matrix multiplications, there really are not that many situations which have large numbers of iterations of a loop with independent iterations. Of course I don't have access to a vast range of different codebases outside of my speciality, so I can't reliably predict how well AVX2 autovectorisation will perform. I reserve judgement for the time being, but I am certainly not pinning my hopes upon it.

Yes they will. People who are the first to buy new hardware are also the first to buy new software. So developers have to implement support for new features early on or they'll lose an important piece of the market to the competition. Benchmarks of competing software products are typically released not long after the new hardware is released. So they can't afford to look bad.

Developers will need to write code utilising these new extensions in addition to the existing code path(s) which give the same functionality (but slightly slower), or else the large majority of their potential users will be utterly unable to use their software. Repeat this every time Intel brings out a new ISE, and you have an utterly unmanageable hierarchy of different branches of code using different levels of instruction set for all the potential permutations. The costs for this are huge.

I do not deny that in certain very specific segments, this is indeed the case. Small but mission critical algorithms that companies pay a lot of money for to run as fast as possible will indeed use any edge that they can get over their rival developers. But the majority of software written does not bother with marginal performance increases for a tiny fraction of their potential market (who have the newest processors anyway, and hence are the least likely to complain about lack of performance). They will use an extension if the vast majority of computers in use support it, and if the only computers which do not support it would be too rubbish to run the software properly anyway. At the moment, that title belongs to SSE2 and arguably SSE3.

BenchPress · Sep 12, 2012

Pilum said:
If you look at SF12_ARCS001_100.pdf p. 12, it seems obvious to me there's no MUL on p6... I'm sure they would have added that in the diagram if it was there.

There's no mention of a scalar integer MUL on port 0 either. So there's not much we can conclude from it. Given the fairly high importance of it, I expect each thread to get its own MUL unit. But I guess it wouldn't be a deal-breaker if that wasn't the case either.

I don't think that an average IPC of 2 is really rare on modern architectures; if that'd be true, Bulldozer should fare better in many workloads, but it gets creamed in single-threaded perf by SNB/IVB nearly everywhere.

The average is actually typically lower than 2, as also pointed out by inf64. But execution is pretty 'bursty' so having a high peak IPC is indeed important.

Anyway, "rare" wasn't the best choice of word. What I really meant was that during these intermittent bursty periods a lack of same-cycle result forwarding between ports 0+1 and 5+6 probably matters less because you only achieve an IPC above 2 when there are few instruction dependencies anyway.

And considering how much effort Intel puts into improving single-threaded perf, I don't see them taking the additional latency hit. You're right that there are many cases where increased latency doesn't matter; but I think that Intel doesn't care for the best or the average case, but for the worst case. Only the paranoid get excellent IPC. Of course, it depends on the additional latency. Is there any detailed information on the pipeline? If it's 1 extra cycle, no problem. If it's three, that could be a significant problem in some cases.

Without any forwarding, the result has to be written into the register file and then read again by the next instruction that needs it. So I guess it could take two extra cycles. But note that since the port pairs are largely symmetric, there's a very good chance that a dependent instruction can be executed on the same pair in the next cycle. So critical paths in the code stick to the same pair of execution ports, while everything non-critical around it can execute on the other pair.

True, but the buffer increases are a few percent, they won't be able to compensate for drastic changes in the pipeline. Which reminds me Intel states in SF12_SPCS001_100.pdf p. 20: "No change in key pipelines". I think that a crippling of the forwarding network would constitute such a change.

No, that's about branch miss penalty. Even with no forwarding network at all it would still count as a pipeline with the same number of stages.

I'm certain it would be a regression on some workloads. And IMO it just doesn't seem like Intels style (since Core2) to compromise on any aspect of IPC or single-threaded performance.

Due to the fact that instructions on the critical path would stick to the same pair of execution ports, I think it's extremely hard to come up with any workload that would suffer. Having a second pair of execution ports which can take pretty much any scalar integer operation also pretty much eliminates any port contention. So even if very rarely and very briefly an instruction suffers a latency penalty, this wide architecture with big buffers can probably completely make up for it in fewer penalties in other parts of the code. And that's a realistic worst case. In the average case there should be an IPC increase.

NTMBK · Sep 12, 2012

pelov said:
So 128mb cache is likely going to be sitting on the desktop chips for TDP headroom freedom? I don't see it being possible that the 35W and 17W ULVs would have anywhere near that number.

The 128MB cache is on-package, not on-die- it's a separate chip under the heat spreader alongside the processor die. As such it can be entirely left out of chips which don't use it (such as desktop).

Khato · Sep 12, 2012

pelov said:
So 128mb cache is likely going to be sitting on the desktop chips for TDP headroom freedom? I don't see it being possible that the 35W and 17W ULVs would have anywhere near that number.

Uhhhh, you kinda have that backwards. Caches actually decrease power usage due to the simple fact that reading/writing to a cache uses far less power than doing the same to main memory. There's an interesting slide from IDF 2012 Beijing in a presentation on correctly sizing precision for power/efficiency (page 10 https://intel.activeevents.com/bj12...8E4BFA518C50B0238AD69900EF354CDC2B1FD9B56C80D ). It claims that reading 64 bits of data from DRAM consumes 4200 pJ, whereas a 64-bit multiply-add only takes 64 pJ, and a read/store of register data is only 6pJ.

Now the 128 MB cache on Crystalwell is likely nowhere near as energy efficient as the on-die L3 cache, but if accessing it instead of system DRAM is even half the power then it's going to be a power saver.

Aikouka · Sep 12, 2012

Ajay said:
Good point! No GT3 on desktop Quad I7s, if I understand correctly - will be GT2.

I really wish that Intel would consider the HTPC market. Some HTPC users don't want to game, but just want a potent media player that isn't restricted to whatever the manufacturer designed their set-top box to do. However, sometimes... users want a modest CPU with a modest GPU, and it sounds like Haswell could really deliver. (EDIT: This is possible now with a lower end dedicated GPU, but you'll most likely sacrifice size.)

Heck, I purchased the newly released i3-3225 earlier this week because it's the only low-end CPU that I could find with the HD4000. While a HTPC could benefit from the i7 S-series that also has the HD4000, I think that's overkill for a HTPC and costs over two times as much.

Although, maybe Intel would prefer that HTPC users go for its new 4x4 setup? If Intel would release a 4x4 with Haswell + GT3, and I could get a passive cooling case for it, I'd be golden. Note that the case that Intel showed off for the Ivy Bridge-based 4x4 did have what looked like fan grills.

IntelUser2000 · Sep 12, 2012

Aikouka said:
I really wish that Intel would consider the HTPC market. Some HTPC users don't want to game, but just want a potent media player that isn't restricted to whatever the manufacturer designed their set-top box to do.

The GPU architecture has enough improvements for the GT2 to be a significant improvement by itself. And as I said, GT3 will clock significantly less(2/3rds) so the improvement won't be as drastic as we previously thought.

I think even with GT2 up to 50% gains are possible.

Hulk · Sep 12, 2012

2is said:
Don't see why you can't do that now.

Frame rate sync issues, deinterlacing performance, and a host of other image quality parameters are much better with discrete cards. Just have a look at some of the Anandtech testing.

So yes you can do it now but it's not a high video quality solution.

IntelUser2000 · Sep 13, 2012

The ULT chip may not have the on package DRAM. That may be why the gain in performance is only 30% rather than 2x with higher end(in this case: quad core) parts.

I've wondered how Intel will squeeze in the Haswell CPU/MCH plus a PCH, plus seperate package DRAM on the CPU package. It looks like the eDRAM part won't happen.

OCGuy · Sep 13, 2012

piesquared said:
Of course it can only get better, it can't get any worse.

If that is the case, please explain how AMD's lineup got worse after the 6-core Phenom II?

2is · Sep 13, 2012

Hulk said:
Frame rate sync issues, deinterlacing performance, and a host of other image quality parameters are much better with discrete cards. Just have a look at some of the Anandtech testing.

So yes you can do it now but it's not a high video quality solution.

Interesting. Do you have a link to that review? I'm curious to see in what scenario it falls behind in, though it seems at least some of those problems could simply be driver related. I can't say I've personally noticed an issue. I have the IGP enabled on my 3770k and driving a 1080p TV via HDMI. I've really only used it for sharing youtube video's in a room full of people and it seemed to do fine there. I'm currently downloading Apple's keynote for the iPhone 5 in 1080p format and will be viewing it on the tv as well

LogOver · Sep 13, 2012

2is said:
Interesting. Do you have a link to that review? I'm curious to see in what scenario it falls behind in, though it seems at least some of those problems could simply be driver related. I can't say I've personally noticed an issue. I have the IGP enabled on my 3770k and driving a 1080p TV via HDMI. I've really only used it for sharing youtube video's in a room full of people and it seemed to do fine there. I'm currently downloading Apple's keynote for the iPhone 5 in 1080p format and will be viewing it on the tv as well

http://www.anandtech.com/show/5773/intels-ivy-bridge-an-htpc-perspective

Seems like HD4000 is one of the best options for HTPC (for now).

nategator · Sep 13, 2012

Based on everything I've seen, we're looking at up to 10% performance increases on the CPU side. Maybe mostly less once third-party benchmarks are run. So, if you already have a SB CPU and a decent discrete graphics card on a PC, is there any reason to get a Haswell chip?

Yuriman · Sep 13, 2012

For general use? Probably not. If you have a specific program that will benefit from Haswell's optimizations? Probably.

I'm betting that emulators compiled to take advantage of AVX2 will really fly on Haswell, for instance. Dolphin already makes good use of AVX.

Don't forget to factor in the power (and heat) savings. Due to the incredibly low idle power use, your CPU fan can probably just turn off most of the time. This is worth something to some people.

mikk · Sep 15, 2012

IntelUser2000 said:
I think even with GT2 up to 50% gains are possible.

15-25% according to CPU-world sources.

Idontcare · Sep 15, 2012

IntelUser2000 said:
I've wondered how Intel will squeeze in the Haswell CPU/MCH plus a PCH, plus seperate package DRAM on the CPU package. It looks like the eDRAM part won't happen.

128MB of dram would be a rather tiny chip to add to the interposer. If it gets nixed it will be because Intel wants to save $1 per CPU and not because of space considerations.

Consider their change in TIM for IB vs SB, its all about the margins and if Haswell doesn't need 2x the performance then they won't spend the money to make it 2x the performance.

inf64 · Sep 15, 2012

mikk said:
15-25% according to CPU-world sources.

GT2 in Haswell will be roughly 40% faster than top gpu in IB on desktop. For mobile the difference (with GT3) may come close to even 2x.

mikk · Sep 15, 2012

inf64 said:
GT2 in Haswell will be roughly 40% faster than top gpu in IB on desktop. For mobile the difference (with GT3) may come close to even 2x.

I have to ask, what's your source?

inf64 · Sep 15, 2012

It's just simple math :20 improved EUs (more IPC) with more clock versus 16EU in IB. Summed up : 20/16x1.15=1.4 or 40%. 15% is the effect of more clock (1.2Ghz vs 1.15Ghz) and more IPC(intel claims that the new EU is more efficient than the old one).

mikk · Sep 15, 2012

inf64 said:
It's just simple math :20 improved EUs (more IPC) with more clock versus 16EU in IB. Summed up : 20/16x1.15=1.4 or 40%. 15% is the effect of more clock (1.2Ghz vs 1.15Ghz) and more IPC(intel claims that the new EU is more efficient than the old one).

The simple math didn't work for IVB and I bet it won't work for Haswell too because bandwidth limitation becomes bigger and bigger. It might work if Intel would increase the bandwith somehow in the same range but not with just DDR3-1600. The GPU computing power or gflops might increase in a ~50% range but not for the average performance in games.

cytg111 · Sep 15, 2012

seeing how BP presses the importance of AVX2, i've been googling around a bit.
See

http://mail.openjdk.java.net/pipermail/jdk8-dev/2011-December/000534.html

That is pretty straight forward, when an update is ready for Java (c# and others too im sure), we shall see what exactly AVX2 brings to the table ... For these exact test cases we should see up to 8x increase in performance going from one small jdk revision to the next.
If that holds true, consider my mind blown.

Fjodor2001 · Sep 16, 2012

Was there any info on Broadwell at the IDF?

Didn't they mention Haswell already at last year's IDF?

IntelUser2000 · Sep 16, 2012

Idontcare said:
128MB of dram would be a rather tiny chip to add to the interposer. If it gets nixed it will be because Intel wants to save $1 per CPU and not because of space considerations.

It's pretty big if you believe this: http://assets.vr-zone.net/15272/dieimage.jpg

The amount of effort they are using to push Ultrabooks, there must be more reason than just cost for having only 30% gains despite having GT3.

4th Generation Intel Core, Haswell summarized

Senior member

Lifer

Diamond Member

Lifer

Senior member

Lifer

Golden Member

Lifer

Elite Member

Diamond Member

Elite Member

Lifer

Diamond Member

Member

Junior Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Elite Member