David Kanter dissects Haswell

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
Recompile all the programs!



Wonder if we should start pestering software companies now to actually do so, can take quite a while for retail software.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
Recompile all the programs!



Wonder if we should start pestering software companies now to actually do so, can take quite a while for retail software.

VS2012 and Intel compilers for example already support AVX2 code today.
 

Makaveli

Diamond Member
Feb 8, 2002
5,026
1,624
136
great article... alittle to tech heavy for me now had a couple drinks.

But will make for a great read tomorrow at work!
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Recompile all the programs!



Wonder if we should start pestering software companies now to actually do so, can take quite a while for retail software.
AVX2 has features that have been desired for quite some time, now, and Intel has had the spec, and software tools, out for some time; so content creation application makers will have been able to make a decision about it already, and likely will have already been working on adding support, if they decided to.

There shouldn't be any need to pester anybody. Remember how Intel wanted everyone to follow their high-speed dream with Netburst, and everybody more or less went, "Uh, Hell no, guys," and it wasn't Intel's finest hour? Yeah, well, Intel didn't repeat that mistake. AVX2 and HTM are examples of giving the customers what they want, plain and simple.

But there's also TSX and RTM.
Those will take a good bit of time. A compiler might be able to automatically elide a few locks, but generally, you're going to need to carefully implement that kind of thing (on the bright side, careful is a matter of verifying correctness: very little code changes will be needed for elision, and it shouldn't break anything on uarches not supporting it, x86 or not). Unlike AVX2, as well, it is more of a future need, as far as our client PCs go. HPC and big DB users could start taking advantage of it ASAP. Like 64-bit support almost 10 years ago (or half of what went into the 386, for that matter), it's a case of adding something before it's really needed.

Transactional memory has been known for at least a decade now to be the best way to improve scaling, without getting rid of the benefits of a lock-using system; but, sadly, while straight-forward, it's not remotely simple or elegant. By around '07, pretty much all the hardware-level worries had been figured out, so now it goes into our CPUs. HPC users may start utilizing it within months of getting Haswell-based clusters, and the rest of us will find it trickling into our sync-limited multithreaded applications, over the course of the next 5-10 years. We're not yet in desperate need of it, but when that time will be is a big question mark, and we can go ahead and make use of it, so nobody wants to be late in supporting it.
 
Last edited:

meloz

Senior member
Jul 8, 2008
320
0
76
Absolutely love David's work. This was another gem.

His article on TSX is still the best explanation I have found on the 'net.
 

2is

Diamond Member
Apr 8, 2012
4,281
131
106
I'm looking forward to haswell, I don't think I'll be upgrading my desktop to one, but I'm long overdue for a new laptop.
 

blastingcap

Diamond Member
Sep 16, 2010
6,654
5
76
With CPU idle power usage getting lower and lower, it's time mobo makers caught up. It's sad to have such low-idling CPUs and then have the mobo and RAM and everything else eat up so much more wattage.
 

cytg111

Lifer
Mar 17, 2008
26,860
16,123
136
Recompile all the programs!



Wonder if we should start pestering software companies now to actually do so, can take quite a while for retail software.


hrmmm .. one of these genius compiler guys should invent the binary compiler. Take a binary, say targetted 386 and recompile it towards a new arch.
Should be doable.
 

2is

Diamond Member
Apr 8, 2012
4,281
131
106
With CPU idle power usage getting lower and lower, it's time mobo makers caught up. It's sad to have such low-idling CPUs and then have the mobo and RAM and everything else eat up so much more wattage.

This is likely the reason Intel took matters into their own hands and will have VRMs integrated into haswell.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
hrmmm .. one of these genius compiler guys should invent the binary compiler. Take a binary, say targetted 386 and recompile it towards a new arch.
Should be doable.
But, what benefit will it have? Only with SSE2 code would you get any benefit, and those programs are likely to have even better AVX2 support added in the near future, anyway. For everything else, it would be a chore, and would probably not be much better than the OS thunking it, instead, if it can't run it natively (pure 32-bit 386 code already runs quite well on modern Intel CPUs).

Interpreted and JIT VMs have been made to handle converting to new systems, but they simply don't have the benefit of the original source code's ASTs, to help target the new computer optimally, and they must mimic the effects of every single instruction, in case a side effect was being used for some purpose (elimination of such should be possible, of course, but probably at some very high development and compiler time cost).
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
But, what benefit will it have? Only with SSE2 code would you get any benefit, and those programs are likely to have even better AVX2 support added in the near future, anyway. For everything else, it would be a chore, and would probably not be much better than the OS thunking it, instead, if it can't run it natively (pure 32-bit 386 code already runs quite well on modern Intel CPUs).

Interpreted and JIT VMs have been made to handle converting to new systems, but they simply don't have the benefit of the original source code's ASTs, to help target the new computer optimally, and they must mimic the effects of every single instruction, in case a side effect was being used for some purpose (elimination of such should be possible, of course, but probably at some very high development and compiler time cost).

Reminds me of DEC's FX!32 software.

Emulation has been around for a while as a concept, but FX!32 went one stage further. It analysed the way programs worked and in real time, developed dynamic-link library (DLL) files of native Alpha code that the application could call upon next time it ran.
 

cytg111

Lifer
Mar 17, 2008
26,860
16,123
136
Reminds me of DEC's FX!32 software.

Not a bad idea IMO and in 'our' case its not a totally different arch.

But, what benefit will it have?
- You dont have to wait for your favorite software vendor to get benefit from your new arch. Your software vendor may never get around to it or may even not be in business anymore. There's a ton of scenarios where this makes sense IMO.
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
I really dont understand the 10 watt thing. I have a penryn CULV notebook that has a 10W cpu. It is a die shrink of a 65nm core originally designed about 8 years ago. In that time, we have quadrupled the transistor budget, and reduced voltage by 20%. Everything seems to indicate we should be able to get 2.0GHz Core 2 cpu performance, plus i3-2310M gpu performance from a 5 watt package. If itnel cannot raise the bar at least to that level after 8 years, they deserve to bleed another billion or two to apple.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I really dont understand the 10 watt thing. I have a penryn CULV notebook that has a 10W cpu. It is a die shrink of a 65nm core originally designed about 8 years ago. In that time, we have quadrupled the transistor budget, and reduced voltage by 20%. Everything seems to indicate we should be able to get 2.0GHz Core 2 cpu performance, plus i3-2310M gpu performance from a 5 watt package. If itnel cannot raise the bar at least to that level after 8 years, they deserve to bleed another billion or two to apple.

The difference of course is the amount of work, i.e. compute, being done with those 10 watts.

Haswell at 10W will probably do 2-3x the total calculations per second that a 10W CULV penryn would acheive.

The power is definitely a challenge though, apparently. My 3770k for example, when I underclock it to the lowest multi (16x) and optimize the voltage to be as low as possible while remaining stable for LinX operation it still consumes 12-13W (at 1.6GHz, 0.636V, 36°C).

That means at most my 3770k could be clocked at 1.2GHz if it was to fit inside a 10W power envelope. It isn't easy to scale down, which is why Atom was created.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
This is one of the reasons I'm very excited about the new Atoms. For many tasks, C2D level performance is fine, and I bet Atom will be able to reach to lower power levels than Haswell will.

That being said, Haswell looks like it'll be fantastic for Ultraportables.
 

moonbogg

Lifer
Jan 8, 2011
10,736
3,454
136
He says he estimates Haswell to have 10% better performance than Sandy bridge for current software. Does that mean 5% better than Ivy Bridge then? I'm hoping for a beast gaming chip that will ruin my 3930k. I have fear that this won't happen.
 

fov001

Junior Member
Nov 15, 2012
24
0
0
hey guys....I am new here so please be easy on judgment :)

question, when will Haswell be available?

I'm about to purchase a set of SSD (Samsung 840Pro x2 128GB RAID0) but would rather wait for this bad boy instead.
 
Last edited:

Haserath

Senior member
Sep 12, 2010
793
1
81
He says he estimates Haswell to have 10% better performance than Sandy bridge for current software. Does that mean 5% better than Ivy Bridge then? I'm hoping for a beast gaming chip that will ruin my 3930k. I have fear that this won't happen.

He mentions branch prediction improvements and memory improvements. Those are probably the two biggest enhancements.

We'll see what Haswell can do with clocks on 22nm since Intel says they opted for execution speed over IC quality on IVB.
It would be nice to see them do something with cache sizes next time. That may at least improve it some more even though that's probably not completely efficient.:p Though I did hear Haswell might have 1MB L2 caches?, which would be kind of nice for us gamers.