Intel "Haswell" Speculation thread

BenchPress · Sep 12, 2012

Tuna-Fish said:
I think it's possible that they just don't. When they were talking about the added execution port, they only mentioned that it frees 0 and 1 to FMA, not that it would increase IPC for present integer code. (Which, if it was fully connected, it absolutely would. Why not advertise?)

This leads me to believe that perhaps it doesn't forward to/from 0 and 1, and just exists so that loop counters, branches and such can be managed while 0 and 1 are dedicated to vector loads.

According to ARCS001 slide 12, ports 0+1 and 6+5 are symmetric when it comes to scalar integer operations and branch. So it seems possible that there's no forwarding between these pairs. Perhaps there's some instruction dependency analysis going on before scheduling, so that dependent ones are dispatched to the same pair of ports. With Hyper-Threading it's trivial to know which are independent. That way the second branch unit also starts to make a lot more sense...

It would also mean the IPC gain for single-threaded code is very minimal or even non-existent if there weren't other improvements. Hence why they wouldn't advertise it. Heck, it would bear some resemblance to Bulldozer. :hmm:

BenchPress · Sep 13, 2012

TuxDave said:
Engineer 1: Add Port 6
Engineer 2: No
Engineer 1: Do it, it's really awesome if we do
Engineer 2: Ok

Edit So on one hand, you make a complicated forwarding network to simplify the scheduler (doesn't have to care which ALU to which ALU).... or you have a simple network and a complicated scheduler. Just saying you have to pick the lesser of two evils.

If the added latency of having no bypass between 0+1 and 5+6 at all is small enough for that to be an option, then I believe the scheduler doesn't have to get any more complicated. It already has to be able to issue instructions which have operands coming from the register file anyway.

But that's a big 'if' of course. I don't know what the latency is for writing a PRF and reading from it again these days.

TuxDave · Sep 14, 2012

BenchPress said:
If the added latency of having no bypass between 0+1 and 5+6 at all is small enough for that to be an option, then I believe the scheduler doesn't have to get any more complicated. It already has to be able to issue instructions which have operands coming from the register file anyway.

But that's a big 'if' of course. I don't know what the latency is for writing a PRF and reading from it again these days.

I want to write a bunch.... but I can't. You're really over emphasising reducing design complexity for a lot of architectural complexity (the scheduler WILL have a bad day) AND a potential perfomance hit. And Intel has a super awesome design team. Just saying...

BenchPress · Sep 14, 2012

TuxDave said:
I want to write a bunch.... but I can't. You're really over emphasising reducing design complexity for a lot of architectural complexity (the scheduler WILL have a bad day) AND a potential perfomance hit. And Intel has a super awesome design team. Just saying...

Thanks for the hints. I read up on (unified) scheduler design as much as I could find and I now realize how uniform bypass latencies indeed keep things way simpler.

It still seems like a big feat to have four integer execution ports which could all execute instructions back-to-back. Has that even been done before? I believe it either sacrifices clock speed, or increases power consumption, or the "super awesome design team" has outdone itself and maximized the potential of the 22 nm process to squeeze it all in with no major compromises. Haswell is very power efficient so I guess that narrows it down.

One other suggestion I've stumbled upon is to use a form of width pipelining to save on bypass time. But maybe I'm making things more complicated than they have to be again. I'm just baffled by the addition of another arithmetic execution port, and hope it doesn't come at a significant cost.

BenchPress · Sep 16, 2012

How about this idea: Most arithmetic code is 32-bit, while 64-bit is mainly used for pointers and thus typically not on a critical path. So it would probably work out fine if 32-bit arithmetic had a latency of 1 cycle and 64-bit arithmetic had a latency of 2 cycles.

Unlike making the bypass latency between ports 0+1 and 5+6 longer, I don't think it would complicate scheduling. The latency of the operations wouldn't depend on which ALU they're coming from. It would just depend on their width.

One complication is that you can't have a 64-bit operation started two cycles ago be completed at the same time as a 32-bit operation started a cycle ago. So a 64-bit operation can only be followed by an independent 64-bit operation. But given that each execution port has a twin, that doesn't seem like an issue either!

The end result would be that Haswell doesn't have to sacrifice clock speed, and IPC could be slightly higher!

Nemesis 1 · Sep 16, 2012

Even though I really like what Haswell brings to the table. In so far as compute. It also brings a heavy heart . I see broadwell as the end of the Big case setting alongside your desk. It really is a step forward this Haswell is, Sadlly we as a community have to move with that step . It looks like that dick tracey watch is near.Along with new compute power comes new form. Haswell accellerates that new form factor change. broadwell brings an end to cabinet pcs

Tuna-Fish · Sep 16, 2012

BenchPress said:
How about this idea: Most arithmetic code is 32-bit, while 64-bit is mainly used for pointers and thus typically not on a critical path.

This is not true at all. Pointers are very much on the critical path -- in fact, since pointer operations are often followed by loads, and since the CPU scheduler often empties while waiting for a load (even a L2 hit takes so long that the execution units can usually clear the sceduler), every cycle you delay a pointer operation (and the issuing of the load that follows) means a completely lost cycle of execution for all execution units. Combine that with code that really likes objects (so practically every pointer access at least has an add), 2-cycle 64 bit ops would be a disaster.

cytg111 · Sep 16, 2012

"broadwell brings an end to cabinet pcs"

- the second the magic code fairy reveals the one true language to parallelize them all, is the second it really becomes all about moar coars. The second after that, the big ass power venting cabinets will be back.
Wait for it ... wait for it ...

Ajay · Sep 16, 2012

BenchPress said:
One complication is that you can't have a 64-bit operation started two cycles ago be completed at the same time as a 32-bit operation started a cycle ago. So a 64-bit operation can only be followed by an independent 64-bit operation. But given that each execution port has a twin, that doesn't seem like an issue either!

Why not? Especially since

Increasing size of buffers internally, giving us larger OoO window

from the AT Blog. So so long as the 64b & 32b operations are independent, I see no problem.

Again, my experience @ the ISA level is with RISC (mainly i860/960, PPC 750 & 32b Mips). Given the load/store nature of RISC, I would think that 64b would be in the critical path, but I don't know much about CISC architectures (aside from what I've already forgotten from my MPU Design & Comp Arch classes :$).

BenchPress · Sep 16, 2012

Tuna-Fish said:
This is not true at all. Pointers are very much on the critical path -- in fact, since pointer operations are often followed by loads, and since the CPU scheduler often empties while waiting for a load (even a L2 hit takes so long that the execution units can usually clear the sceduler), every cycle you delay a pointer operation (and the issuing of the load that follows) means a completely lost cycle of execution for all execution units. Combine that with code that really likes objects (so practically every pointer access at least has an add), 2-cycle 64 bit ops would be a disaster.

Thanks for pointing that out, but note that anything that fits the (many) x86 addressing modes will be executed by the AGUs at port 2 and 3, with no added latency. So I don't think having other 64-bit arithmetic taking two cycles would be anywhere near a disaster. Having an extra execution port should offset that and even offer higher IPC.

BenchPress · Sep 16, 2012

Ajay said:
Why not? Especially since "Increasing size of buffers internally, giving us larger OoO window" from the AT Blog. So so long as the 64b & 32b operations are independent, I see no problem.

The problem is that if you start a 1-cycle operation after a 2-cycle operation, the results would be ready on the same cycle. And you simply can't send two results down the same result bus. So the 1-cycle operation simply has to wait, or execute on another port. Starting another 2-cycle operation on the same port is no problem though. So there's no loss of throughput. And as I noted before, having 'twin' ports makes port contention between 1-cycle and 2-cycle operations very unlikely.

Again, my experience @ the ISA level is with RISC (mainly i860/960, PPC 750 & 32b Mips). Given the load/store nature of RISC, I would think that 64b would be in the critical path, but I don't know much about CISC architectures (aside from what I've already forgotten from my MPU Design & Comp Arch classes :$).

Indeed for RISC this would be a problem. But with CISC the majority of pointer arithmetic is part of the addressing mode, which can be executed by an indepedent AGU instead of requiring the generic ALUs.

Cerb · Sep 16, 2012

Nemesis 1 said:
Even though I really like what Haswell brings to the table. In so far as compute. It also brings a heavy heart . I see broadwell as the end of the Big case setting alongside your desk. It really is a step forward this Haswell is, Sadlly we as a community have to move with that step . It looks like that dick tracey watch is near.Along with new compute power comes new form. Haswell accellerates that new form factor change. broadwell brings an end to cabinet pcs

No, that will take awhile, yet. The primary reason for our boxes, today, is not anything that Haswell removes. We have them for drives, special bay devices, and expansion cards.

AIOs will only get more popular, and even power users will use MicroATX more than more, so our cabinets will shrink, and low-profile cards will become ever more popular, as well, but it will be quite some time yet before they go away.

BenchPress · Sep 16, 2012

By the way, not all 64-bit operations would have to take 2 cycles. Bitwise operations are trivial so they can still be 1 cycle.

But having things like 64-bit addition/subtraction and shift/rotate take 2 cycles would make it a lot easier to have enough time to bypass the results between four ALUs without having to lower the clock frequency. And that would mean there's hope for overclockers after all!

Lepton87 · Sep 16, 2012

BenchPress said:
Have you come up with such words yet? I could use a few.

Magical as per apple marketing?

Revolution 11 · Sep 16, 2012

Cerb said:
No, that will take awhile, yet. The primary reason for our boxes, today, is not anything that Haswell removes. We have them for drives, special bay devices, and expansion cards.

AIOs will only get more popular, and even power users will use MicroATX more than more, so our cabinets will shrink, and low-profile cards will become ever more popular, as well, but it will be quite some time yet before they go away.

People have been saying for years (decades?) that the desktop is going to die. It never does. Granted, the market is slowly declining and in a matured state but I think there will always be a room for desktop PCs. Scaling up performance is much harder in a compact form (tablet/laptop) for several reasons already mentioned.

Ajay · Sep 17, 2012

BenchPress said:
The problem is that if you start a 1-cycle operation after a 2-cycle operation, the results would be ready on the same cycle. And you simply can't send two results down the same result bus. So the 1-cycle operation simply has to wait, or execute on another port. Starting another 2-cycle operation on the same port is no problem though. So there's no loss of throughput. And as I noted before, having 'twin' ports makes port contention between 1-cycle and 2-cycle operations very unlikely.

OK, I think I must have misread something. You are talking about pushing two instructions down one port, a 2-cycle op and a 1-cycle op - so you wind up with a pipeline hazard. Am I following you correctly now? And thanks for the link to pipeline widening. I understand the P4 architecture much better now (and I remember going over this in my Comp. Arch. class).

So, is the L1 cache ported 8 ways so that the CPU can sustain 8 single cycle ops? [at least for bursty code, I realize it can't be sustained since L2$ and L3$ can't keep up, never mind main memory]

Thanks.

TuxDave · Sep 17, 2012

Ajay said:
So, is the L1 cache ported 8 ways so that the CPU can sustain 8 single cycle ops? [at least for bursty code, I realize it can't be sustained since L2$ and L3$ can't keep up, never mind main memory]

It doesn't have to be. You still have physical register files for the next hierarchy of memory. And if you want to go balls to the wall of design, remember that each op has multiple sources and so by my math you'll actually need far more than 8 ways total.

Ajay · Sep 17, 2012

TuxDave said:
It doesn't have to be. You still have physical register files for the next hierarchy of memory. And if you want to go balls to the wall of design, remember that each op has multiple sources and so by my math you'll actually need far more than 8 ways total.

Thanks! Seems like Intel will need a balls to the wall design to hit their targets on AVX2 performance at least. I suppose that there could be less (or equal) ports to the register files, but prioritize AVX ops in the scheduler. I should read RW's SB uArch overview to get a clear picture of what existed b/4 Haswell - so I can make more sense of this. It just from a top level view, Haswell is the first CPU in recent history that has my piqued my interest (since Core2).

CHADBOGA · Sep 17, 2012

Ajay said:
It just from a top level view, Haswell is the first CPU in recent history that has my piqued my interest (since Core2).

Haswell is the proper next gen over Core 2.

Everything since Core 2, were just Core 2 derivatives, thus Haswell sets the CPU platform for Intel's performance oriented line of CPU's for the next 5 to 6 years.

BenchPress · Sep 18, 2012

Ajay said:
OK, I think I must have misread something. You are talking about pushing two instructions down one port, a 2-cycle op and a 1-cycle op - so you wind up with a pipeline hazard. Am I following you correctly now?

Yes. The scheduler would be prevented from issuing a 1-cycle operation right after a 2-cycle operation, to avoid this hazard.

But due to the fact that Haswell's port 6 can execute the same operations as port 0, and port 5 can execute the same operations as port 1, blocking the scheduler shouldn't have much of an effect, at least not until it runs out of 64-bit instructions. At that point you lose a cycle and the port can start taking 32-bit instructions. But it should still be better than having three execution ports, which suffers from contention.

Anyway, I'm just theorizing out loud. Purely single-threaded workloads that are performance-critical have become very rare. So they may just have solved the increased bypass latency problem by lowering the clock frequency a tad and relying on a sufficient increase in IPC for other workloads. With 33% extra execution ports and increased out-of-order execution buffers that shouldn't be too hard.

And thanks for the link to pipeline widening. I understand the P4 architecture much better now (and I remember going over this in my Comp. Arch. class).

You're welcome. I've only just recently discovered that myself actually. Oh and it's "width pipelining", not "pipeline widening". The latter actually refers to increasing the number of (mirco-)instructions the overall CPU pipeline can handle. Width pipelining is done at the ALU level.

So, is the L1 cache ported 8 ways so that the CPU can sustain 8 single cycle ops?

It can probably access 8 cache banks per cycle, to efficiently support the gather operation, but that's not the same thing as having 8 ports. Sandy Bridge could access 6 banks per cycle...

Anyway, I think you're confused, so to avoid any more confusion let me quickly recap: Haswell has 8 execution ports, but only 4 of them are for arithmetic operations. So even if every instruction accesses memory, you need 4 cache ports tops, not 8. Haswell has two read ports and one write port. But a lot of instructions only use registers, not memory accesses. The register file has lots of ports, and the bypass network also supplies many operands that are the result of a recently executed instruction.

The reason the cache has many banks and few ports is because each bank can only service one memory operation per cycle. Trying to do more than one results in a bank conflict. The slides on Haswell claim it avoids all bank conflicts (for aligned data), and the cache line size remained the same. Which means each 64-byte cache line must be split across 16 32-bit values. This fits in with the gather support, if they increase the increased the number of simultaneous bank accesses to 8. The reason Sandy Bridge could already access 6 64-bit banks was to efficiently support unaligned 128-bit accesses.

Nemesis 1 · Sep 18, 2012

cytg111 said:
"broadwell brings an end to cabinet pcs"

- the second the magic code fairy reveals the one true language to parallelize them all, is the second it really becomes all about moar coars. The second after that, the big ass power venting cabinets will be back.
Wait for it ... wait for it ...

ya I was referring to pc cabinets not server pcs . Hell as good as Haswell sounds the next shrink brings what exactly? Your correct the Servers are going to need smaller leak proof cabinets in future .

Wait for it . Wait for it.

Ajay · Sep 18, 2012

BenchPress said:
It can probably access 8 cache banks per cycle, to efficiently support the gather operation, but that's not the same thing as having 8 ports. Sandy Bridge could access 6 banks per cycle...

Anyway, I think you're confused, so to avoid any more confusion let me quickly recap: Haswell has 8 execution ports, but only 4 of them are for arithmetic operations. So even if every instruction accesses memory, you need 4 cache ports tops, not 8. Haswell has two read ports and one write port. But a lot of instructions only use registers, not memory accesses. The register file has lots of ports, and the bypass network also supplies many operands that are the result of a recently executed instruction.

The reason the cache has many banks and few ports is because each bank can only service one memory operation per cycle. Trying to do more than one results in a bank conflict. The slides on Haswell claim it avoids all bank conflicts (for aligned data), and the cache line size remained the same. Which means each 64-byte cache line must be split across 16 32-bit values. This fits in with the gather support, if they increase the increased the number of simultaneous bank accesses to 8. The reason Sandy Bridge could already access 6 64-bit banks was to efficiently support unaligned 128-bit accesses.

Thanks fot that, very helpful :thumbsup:

Another question I have - would it be worth it to Intel to increase pipeline lengths a bit to maintain higher clocks (mainly for marketing) though with a higher branch misprediction penalty? Does Haswell improve on branch prediction or do anything to reduce the penalty? I read something about Haswell and branch prediction, but I can't find it (may have been speculation anyway).

Edrick · Sep 18, 2012

Nemesis 1 said:
Even though I really like what Haswell brings to the table. In so far as compute. It also brings a heavy heart . I see broadwell as the end of the Big case setting alongside your desk. It really is a step forward this Haswell is, Sadlly we as a community have to move with that step . It looks like that dick tracey watch is near.Along with new compute power comes new form. Haswell accellerates that new form factor change. broadwell brings an end to cabinet pcs

Want to bet on this?

I don't know what you have been smoking lately, but you have been making these grand assumptions on the forums with not much fact to back them up.

What does shrinking a CPU process have anything to do with PC cases? They still run hot (think IB) and need proper cooling. And we will be seeing 8 core desktop CPUs soon which will require a well vented PC case. Also, as 4K monitors become common, graphic cards will need to continue to grow in power, which in turn put out a lot of heat. And as SSD prices keep coming down, PCIe SSDs will become common place in the next 2-3 years, bypassing the SATA bottleneck, which will require more case space.

So please think of the larger picture.

Cerb · Sep 18, 2012

Edrick said:
What does shrinking a CPU process have anything to do with PC cases? They still run hot (think IB) and need proper cooling.

IB CPUs just have crappy TIM. That doesn't make them hotter, when it comes to cooling the enclosure (1 Watt = 1 Watt, though degC may vary). Indirect cooling of anything but the fastest CPUs out there has been doable for years, now, and can be done affordably, today, even without water. If you want to go all out, you could use a Scythe Ninja or TR HR-02 to cool most OCed CPUs, without even having a fan on the heatsink, and without many loud fans forcing lots of CFM through the case (well, depending on the case--in an SFF you might have to force things a bit more). It would take some care, especially with the cheaper Ninja, but it's doable. You've basically got to pull the kind of airflow control tricks that OEMs do.

An i3 would be no trouble to cool with only a quiet case fan, even in a cramped SFF case (say, 2-3x the volume of a nettop). You end up wanting or needing a regular computer case even if you have passive CPU cooling and a passive power supply, just because USB can't always do everything. If we had an external interface as cheap as USB, while being as performant as PCIe 4-8x (especially if it could be ganged across more cables), then we could start doing tiny cases, with whole-case HSFs (IE, like notebooks) and modular multi-box systems all over the place, as almost everything could be a universally changeable external peripheral.

Cooling issues we have are mostly matters of shaving pennies, and maintaining component interchangeability without loss of reliability, and less matters of real technical difficulties, except for the fastest and hottest CPUs. The Dick Tracy watch may be far off, but a Nettop or Mac Mini sized powerful computer is not commonly available more due to lack of sufficient demand, much more than that it is too hard to do.

How many people want a computer smaller than a Shuttle type, with one or no PCIe slots, limited panel IO (there's just not room!), and then having to rely mainly on USB for every non-special-function device added after the first two SATAs? OK, now of those people, how many of them are willing to pay more than a standard ATX case scenario, to get less system flexibility? A small percentage of Apple PC customers, and anyone buying a notebook/tablet/etc..

If I'm paying more, I want faster PCIe (more lanes), more PCIe slots, faster SATA, more SATA ports (maybe even SAS), more maximum memory, lots of options to cool cards using those expansion slots, and lots of panel IO options (which includes being able to stuff spare expansion slot covers and drive bay covers with extra goodies, if I want).

You're right about Nemisis 1 being wrong about them going away, but I argue that cooling is a secondary, maybe even tertiary, problem, wrt to standard large cases going away. Cooling is an annoying issue because big cases in which parts can be swapped out with any other parts prevent part makers from being able to make assumptions about cooling; allow Intel and AMD to use small and cheap HSFs; and that makes it such that the 3rd-party cooler makers and case designers must make fairly generic flexible coolers and enclosures, which end up being more robust and expensive than if they were made with a specific implementation in mind (a known mobo and PSU, FI).

Nemesis 1 · Sep 18, 2012

Edrick said:
Want to bet on this?

I don't know what you have been smoking lately, but you have been making these grand assumptions on the forums with not much fact to back them up.

What does shrinking a CPU process have anything to do with PC cases? They still run hot (think IB) and need proper cooling. And we will be seeing 8 core desktop CPUs soon which will require a well vented PC case. Also, as 4K monitors become common, graphic cards will need to continue to grow in power, which in turn put out a lot of heat. And as SSD prices keep coming down, PCIe SSDs will become common place in the next 2-3 years, bypassing the SATA bottleneck, which will require more case space.

So please think of the larger picture.

Geewiz guy I just going by what intel said at IDF. Whats inside the computer is going to be less known than the new form factors . Intel said this along with the desktop is becoming less important and small compute devices was the way forward . Intels words not mine . I read All the IDF stuff. If you didn't don't hang it on me . Do your own reading (research)

Intel "Haswell" Speculation thread

Senior member

Senior member

Lifer

Senior member

Senior member

Lifer

Golden Member

Lifer

Lifer

Senior member

Senior member

Elite Member

Senior member

Platinum Member

Senior member

Lifer

Lifer

Lifer

Platinum Member

Senior member

Lifer

Lifer

Golden Member

Elite Member

Lifer