Ivy bridge-E or Sandy bridge-E or Something Haswell

rituraj · Nov 10, 2012

I need to build a workstation mainly for simulations and number crunching stuff. My work needs good amount of processing power but not enough to justify the need of my institute's super computing facilities.

I've seen some SB-E rigs taking boringly long time doing similar works. Can I expect Haswell to offer me a solution (in 8-core or more) or should I settle for an IB-E or even SB-E.

I am afraid Haswell is mostly concentrated on mobile computing or am I wrong?

ShintaiDK · Nov 10, 2012

Haswell-E will be laste 2014. Desktop (4 cores max) is in Q1 2013. It depends on your work, if you can code it for AVX2, then a 4 core Haswell will clearly be the fastest solution until Haswell-E/EP/EX.

lehtv · Nov 10, 2012

Q1? I thought Haswell is April, so Q2.

ShintaiDK · Nov 10, 2012

lehtv said:
Q1? I thought Haswell is April, so Q2.

Its stated as March on roadmaps. Atleast the latest I have seen.

cytg111 · Nov 10, 2012

rituraj said:
I need to build a workstation mainly for simulations and number crunching stuff. My work needs good amount of processing power but not enough to justify the need of my institute's super computing facilities.

I've seen some SB-E rigs taking boringly long time doing similar works. Can I expect Haswell to offer me a solution (in 8-core or more) or should I settle for an IB-E or even SB-E.

I am afraid Haswell is mostly concentrated on mobile computing or am I wrong?

Erhm, so, if i deduce that right, you dont *need* those cores to be on the same die, right?
You could build a mini cluster yourself fairly cheap, and go mITX or similar, you could stack those puppies and not take up much more space than a regular desktop. You could have a 20-core-hyperthreading beast on your hands in no time.

Just saying you got options.

edit: and call me curious .. but what kind of numbers are you crunching? calculating something 42-ish ? 🙂

rituraj · Nov 10, 2012

Thanks Guys. I feel like an illiterate here, actually I am a civil Engg grad so I just 'use' computers and don't know much of the internal principles. Well now I think I need to. Right now I am using a pre-written code in MATLAB and it takes hours to give any result. May be I should learn how a program utilizes the available computing powers and then write my own optimized code.

Well, so Haswell will have a high end desktop line. I am in no hurry. So I can wait a year and educate myself.

OVerLoRDI · Nov 10, 2012

cytg111 said:
Erhm, so, if i deduce that right, you dont *need* those cores to be on the same die, right?
You could build a mini cluster yourself fairly cheap, and go mITX or similar, you could stack those puppies and not take up much more space than a regular desktop. You could have a 20-core-hyperthreading beast on your hands in no time.

Just saying you got options.

edit: and call me curious .. but what kind of numbers are you crunching? calculating something 42-ish ? 🙂

And then link them all together in one massive water cooling loop! Project time!

ShintaiDK · Nov 11, 2012

rituraj said:
Well, so Haswell will have a high end desktop line. I am in no hurry. So I can wait a year and educate myself.

In late 2014. Basicly 2 years away.

rituraj · Jun 9, 2014

Old post and was my first post here. Just felt like finishing it. Since I asked some Comp Science students and faculties, looking at the problem they said I won't need more cores, but a faster processor with better IPC. Back then I couldn't immediately understand what it meant, but seems like for some problems, we need a GHz war, supercomputers wont help.

Basically my work was like doing iterative calculations based on a likely assumed value of the unknown variables (there were two unknowns, one equation or something like that) and then get the value of the same to the desired accuracy. Then proceed to the next time (or length) step and follow the same procedure. Each iteration had powers, multiplication, addition and to find out each value of variable, several hundred iterations were required. I didn't officially take up that work and did something in an entirely different field, but now it looks like what I needed was more GHz. I didn't have a clue about multi-threaded programming, neither did my supervisor and friends. But I think for some problems which cannot be made multi-threaded or where instruction sets don't help, we do need more single core speed. Moreover The problem didn't have any scope of parallel calculations, it was "find the first one, use its value to find the second one, and so on.

And one more thing I need to check if my understanding is correct.
If a certain architecture introduces an instruction set that can do multiplication and addition simultaneously (is it the so called FMA in Haswell?), like 2+3*7 is just one operation the same as 3+5 is a single operation. But if the problem is 2+5-9+21, there is no way that the new instruction set will help, and the solution will be a three step operation (two additions, one subtraction), unless they also have an instruction that can treat a 3 fold addition-subtraction problem as one operation. So in this situation an older platform will perform similarly to a newer one given the same operating frequency.

Please give me a thumbs up if my understanding is correct.

moonbogg · Jun 9, 2014

If you need IPC then you may look at a CPU that is being released right now, the i7 4790k. Best IPC and clock speed available anywhere. Good luck.

TuxDave · Jun 9, 2014

rituraj said:
... snip! ...
And one more thing I need to check if my understanding is correct.

If a certain architecture introduces an instruction set that can do multiplication and addition simultaneously (is it the so called FMA in Haswell?), like 2+3*7 is just one operation the same as 3+5 is a single operation. But if the problem is 2+5-9+21, there is no way that the new instruction set will help, and the solution will be a three step operation (two additions, one subtraction), unless they also have an instruction that can treat a 3 fold addition-subtraction problem as one operation. So in this situation an older platform will perform similarly to a newer one given the same operating frequency.

Please give me a thumbs up if my understanding is correct.

I call that type of performance "legacy performance": the performance without using new instructions/algorithms. This performance also improves generation to generation. But we're talking maybe ~10% on average. The most straight forward method is simply by improving the hardware logic to finish executing these instructions faster. Old instructions... just faster. The less straightforward way is removing other bandwidth/bottlenecks that prevent your ALU from being 100% utilized.

So even if you can't take advantage of new instructions doesn't mean you won't get a performance improvement. Legacy performance is an important metric in CPU development but it definitely won't be anywhere near the gain that some perfect algorithm can get by using new ISA.

Ken g6 · Jun 9, 2014

rituraj said:
for some problems, we need a GHz war, supercomputers wont help.

As Amdahl knew many years ago.

rituraj said:
And one more thing I need to check if my understanding is correct.
If a certain architecture introduces an instruction set that can do multiplication and addition simultaneously (is it the so called FMA in Haswell?), like 2+3*7 is just one operation the same as 3+5 is a single operation. But if the problem is 2+5-9+21, there is no way that the new instruction set will help, and the solution will be a three step operation (two additions, one subtraction), unless they also have an instruction that can treat a 3 fold addition-subtraction problem as one operation. So in this situation an older platform will perform similarly to a newer one given the same operating frequency.

Well, yes and no. Yes, FMA does a multiply and an add at once. Though most times when you see "FMA", including for Haswell, it refers to floating-point, not integers. And yes, 2+5-9+21 requires three operations. However, some platforms will do this faster than others, because of varying levels of instruction-level parallelism.

Very old processors just worked on one instruction at a time. Add, add, add, three cycles. Very simple. I think this lasted up to about the 286 or 386 for Intel.

Around the 486, Intel started using a pipeline. Think of separate units to decode an instruction, load data, add, and store results. This led to faster cycles - four in my hypothetical example - but required more cycles to do a single instruction. Leave those instructions in order, and each depends on the previous one, so that takes ~~12 cycles~~ 10 cycles - no branch instructions means the decodes can proceed without delay. But write your code like this:

a = 2+5
b = 9+21
c = a-b

and the first add and the second add run almost in parallel, so that takes 4+1+3 or 8 cycles. (The subtract needs to wait for both adds to be done.)

Then along came the Pentium with another wrinkle. It could decode and run two simple instructions at the same time, as long as one wasn't dependent on the other. Original code with my 4-stage pipeline: still 10 cycles. Modified code: now only 7 cycles.

Modern processors can, at least in theory, run up to about half a dozen independent instructions of specific types in one cycle. And latency has also been reduced so that, unless there's a conditional, results from one step are ready much more quickly for the next step. Anything from about Core 2 up would run the original code in 3 cycles (one per add), and the modified code in just 2 cycles!

Modern processors can also search for independent instructions over a wide range - getting wider with each CPU version. So in your original code, a hypothetical modern processor might have 3*7 and 2+5 run at the same time for cycle 1, more of 3*7 (multiplies almost always take longer than one cycle) and (the-result-of-2+5)-9 in cycle 2, and (the-result-of-3*7)+2 and (the-result-of-2+5-9)+21 in cycle 3.

rituraj · Jun 10, 2014

Wow. That was an amazing reply. Thanks a lot. It took a long time to swallow all that, still chewing some bits.

Ken g6 said:
But write your code like this:

a = 2+5
b = 9+21
c = a-b

and the first add and the second add run almost in parallel, so that takes 4+1+3 or 8 cycles. (The subtract needs to wait for both adds to be done.)

being a non programmer, these kind of tricks never come to my mind and don't have the keyword vocabulary to google them efficiently. TBH, had no idea what to search for in google 😛.

But thanks guys. Time to learn some 100001101010 stuff I guess.

Mand · Jun 10, 2014

Ken g6 said:
2+5-9+21

But write your code like this:

a = 2+5
b = 9+21
c = a-b

The math nitpicker in me has to point out that these aren't the same thing. One gives 19, the other gives -23.

rituraj · Jun 10, 2014

Mand said:
The math nitpicker in me has to point out that these aren't the same thing. One gives 19, the other gives -23.

Yeah. Good catch.

Should be like

a = 2+5
b = -9+21
c = a+b

To simplify and avoid this kind of error all the operations can be made '+' and the variables are assigned + or - signs accordingly.

ViRGE · Jun 11, 2014

rituraj said:
Wow. That was an amazing reply. Thanks a lot. It took a long time to swallow all that, still chewing some bits.

being a non programmer, these kind of tricks never come to my mind and don't have the keyword vocabulary to google them efficiently. TBH, had no idea what to search for in google 😛.

But thanks guys. Time to learn some 100001101010 stuff I guess.

To be fair, simple optimizations like that should be picked up by the compiler and the OoO engine in the CPU. Your job as a programmer is to use the right algorithm rather than tinkering with individual instructions. A solid multi-threaded algorithm is going to beat the pants off of the tightest single-threaded algorithm most days of the week.

BrightCandle · Jun 11, 2014

ViRGE said:
To be fair, simple optimizations like that should be picked up by the compiler and the OoO engine in the CPU. Your job as a programmer is to use the right algorithm rather than tinkering with individual instructions. A solid multi-threaded algorithm is going to beat the pants off of the tightest single-threaded algorithm most days of the week.

These days the big optimisation that programmers are having to apply is cache optimisation. Getting memory access to be sequential and to hit the cache without gaps is bringing a lot of substantial gains in performance. Modern CPUs are heavily dependent cache hits and that is really the major way in which heavy computation algorithms are gaining performance.

We have seen a number of iterations on really basic algorithms like sorting that have benefited a lot from tweaking the way the cache is interacted with. All these sorts of number crunching optimisations matter but the bigger lose these days is normally the 100+ cycles it takes to go out to RAM.

Ivy bridge-E or Sandy bridge-E or Something Haswell

rituraj

Member

ShintaiDK

Lifer

lehtv

Elite Member

ShintaiDK

Lifer

cytg111

Lifer

rituraj

Member

OVerLoRDI

Diamond Member

ShintaiDK

Lifer

rituraj

Member

moonbogg

Lifer

TuxDave

Lifer

Ken g6

Programming Moderator, Elite Member

rituraj

Member

Mand

Senior member

rituraj

Member

ViRGE

Elite Member, Moderator Emeritus

BrightCandle

Diamond Member

TRENDING THREADS