CPU recommendation for learning multiprocessor programming

lgratian

Junior Member
Aug 26, 2012
1
0
0
I posted the same question in programmers.stackexchange.com, but got no helpful response.

I am very interested in working with multiprocessor programming and parallel algorithms, mostly for research purposes. I want to build a computer specially for this and it should have at least 8 cores (many algorithms have contention problems only starting with 8 cores).

Looking at what Intel and AMD offer I think the 8-core Intel CPUs are far too expensive, so I would have to choose between:
- 6-core Intel i7 980X (3.33 Ghz)
- 8-core AMD FX 8150 (3.6 Ghz)
- 2x8-core AMD Opteron 4248 (3.0 Ghz, server version of FX 8150)
- 1x or 2x-12-core AMD Opteron 6172 (2.1Ghz, expensive, but sometimes affordable on Ebay)

I'm inclined for the 2x8-core Opteron, but I'm not sure how the Bulldozer architecture compares to the Intel one; from what I understand the 8 cores actually share some parts and are not compleltely independent like the Intel ones (ignoring the shared L3 cache). I'm not sure that the results I get would reflect the ones that would be obtained on a " classical" CPU.

On the other side, the i7 is much more powerful and might be sufficient for testing the parallel algorithms. The 2x12-core Opteron would probably be the best for testing, but they are also the slowest by far and I would like to use this computer as a workstation too.

What would be the best solution? Is the Bulldozer architecture suitable for research (mostly for the massive parallelization of a compiler I wrote)?
 
Dec 30, 2004
12,553
2
76
not sure about this: "many algorithms have contention problems only starting with 8 cores"
Hadn't heard of that...
I know if you write multicore code but only run on single core, it will still be executing sequentially and you can "get away" with using a single resource from "multiple" threads, then when you move to multicore system you'll have instability because your concept was flawed.
on a dual core you could write code using up to 2 cores and be sure you've coded it right
quad 4
etc.
But the principles you learn to make dual cores maxed out, are the same for quad, etc.
I would get a quad.
I personally wouldn't dump my money into a server system. Bad use of resources IMO, not going to make me smarter; unless the compiling is on a huge codebase that takes longer than 30 seconds to compile and you're going to be compiling frequently as you make changes.

regarding which chip to get, what kind of code are you writing?
edit: oh, compiler. Bulldozer or a Hyperthreaded chip would help your compiling a lot. There's a lot of mispredicted branches and "dead ends" in compiling while the CPU waits on system RAM. That's when hyperthreading comes in, switches the CPU over to another thread and execution can continue while the 1st thread waits for the RAM. So your 970X would have 12 threads, 6 processors. An Intel will be fastest, but not most "bang for your buck" in your situation where you're just looking to test multithreading.

in Bulldozer two cores share a 256-bit wide FPU. Lots of code doesn't need the FPU to calculate numbers and stuff all the time so they were able to share it between the two cores. I don't think compiling makes much use of FPU (compiling (code that uses FPU) != running (code that uses FPU))
 
Last edited:

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
Both AMD and Intel have made the distinction between physical cores and virtual shared resource ones a little vague. Intel has done so with its Hyperthreading and AMD has done so with its shared FPU. For the purposes of multiple CPU programming AMD doesn't have anything above a quad core despite their marketing (just half the core counts or go by the module count) while Intel does have genuine 8 core chips but they are not exactly cheap.

That our of the way what hardware do you need to actually do multi threaded programming? In practice multi core programming these days is about thousands of PCs all working in a cluster on a single problem. Those machines are normally commodity based PCs with a pretty standard cheap CPU, aka 4 core mid clock speed.

Saying that sometimes you are writing code to scale up as well as possible on a single box, and in those cases higher than quad might be worthwhile to you, but depends greatly on your targeted hardware. Performance testing should always be done on something representative of production, you don't necessarily need that for your dev box but in a few cases it can be helpful.

When it comes to seeing contention issues and the like they do show up on smaller boxes but not as prevalently. Its easy to miss certain types of problems on low core count machines which is one of the ways I argued myself up to a 6 core SB-E.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I am very interested in working with multiprocessor programming and parallel algorithms, mostly for research purposes. I want to build a computer specially for this and it should have at least 8 cores (many algorithms have contention problems only starting with 8 cores).

Welcome to the AnandTech forums lgratian :thumbsup:

I have some experience in this area and can understand your dilemma.

To better answer your question though, I need to better understand the scope with which you are aiming to cover with your research.

As you are likely aware, the limitations of parallelized algorithms and applications are broken down into those dependent on hardware and those dependent on software. And naturally there is interplay between the two. (what we refer to as fine-grained vs. course-grained)

grainvsIPC.png


Amdahl's Law captures the first-order impact in thread-scaling on the basis of parallelized versus serial code. This part of thread-scaling has been researched to death over the past 40yrs and won't be of much interest to you being running a few simulations in Excel as a learning exercise:

AmdahlsLaw.png


AmdahlsLawInfiniteProcessors.png


However, a much more interesting subject matter, and relevant in today's multi-core dominated microprocessor world, is that inescapable issue of interprocessor communication (dubbed IPC, not to be confused with instruction per clock) aka the network fabric, which causes a slowdown in thread-scaling as more and more data contention comes to bear.

Almasi and Gottlieb captured the first-order effects of IPC on thread-scaling by augmenting Amdahl's Law by adding a serial communication component (Tis) and a parallel communication component (Tip):

AmdahlsLawaugmentedbyAlmasiandGottlieb.png


It is the Tip and the Tis that makes or breaks thread scaling nowadays when looking at various platform architectures. Cores per socket, sockets per mobo, mobo's per rack, etc.

Impactofbroadcastprotocolonscaling.gif


Each tier of communication adds latency and decreases bandwidth, increasing Tis and Tip, increasing IPC, and thus decreasing thread scaling.

For example, simply reducing the bandwidth and increasing the latency of the ram alone can have a dramatic, first-order, effect on thread-scaling:

LinXThreadScaling.png


(note the ram speed was reduced from 800MHz to 533Mhz and the impact of IPC was nearly doubled from 2.5% to 4.3%, as expected based on the percentage change in IPC speeds)

It also explains the differences in real-world scaling between platforms when multiple sockets come into the equation:

Euler3DBenchmarkScaling.gif


(notice how well the AMD systems in this case study perform when scaling across the socket versus the Intel systems, this issue has largely been mitigated now with Intel going to QPI)

LinxScalingNehalemDenebKentsfield.png


So, getting back to your question - the answer to your question depends on how deeply you are wanting to explore the thread-scaling limiting impact of interprocessor communications and the tradeoffs you make as a programmer when crafting your algorithms from a fine-grained to a coarse-grained methodology.

If you want to really deep dive into the practical considerations then you want a research testbed that gives you the ability to generate data that spans three specific network fabric regimes:
  1. Interprocessor within the same socket (shared L2$ or L3$, ram dependencies, etc)
  2. Interprocessor within the same mobo (data sharing at the ram, socket-to-socket)
  3. Interprocessor across the network (box to box, via ethernet fabric or quadrics, beowulf type stuff)
Ideally you'd get a minimum of two boxes so you can explore the node-to-node dependencies on the thread-scaling of your algorithms, each node would contain at least 2 sockets, and each socket would have a CPU that has at least 2 cores (preferably 4 or more cores).

Also you want the clockspeeds of the cores to be manually adjustable by you so that you can have control over the ration of processor clockspeed versus network fabric speed so you can generate the data you are going to need in order to make some projections about the utility of today's algorithms in say a decade's time from now.

On last point of concern - modern x86 processors further convolute the thread-scaling picture by way of SMT (hyperthreading for Intel) and CMT (bulldozer's FPU sharing for AMD). I'd recommend the Intel platform solely for the sake that 80% of people's algorithms are going to process on an Intel platform and only 20% will be used on an AMD platform, may as well explore the case which has the most relevance to the general populace of x86 programmers.

Corei79204GHzwithHT.png


Whichever platform you decide to go with, just be sure you can generate data which is impacted by the core-resources sharing versus having the sharing disabled or mitigated (either by BIOS switch or by careful planning in your thread spawning algorithm).
 

SammichPG

Member
Aug 16, 2012
171
13
81
What about a virtualized machine?
It would be convenient since you should be able to add virtual cpu cores on your system to test "scaling" and you'd just buy the fastest machine you can afford without caring about shared resources.
 

2is

Diamond Member
Apr 8, 2012
4,281
131
106
I'd use something with HT, that way you can experiment between physical and virutal cores by turning HT on or off. An i7 if budget isn't much of a concern or an i3 if it is.
 

Gryz

Golden Member
Aug 28, 2010
1,551
204
106
If you write programs to research stuff, I can think of two goals you might persue.
1) Testing the correctness of the algorithms.
2) Testing the performance of the algorithms.

If you care about correctness, then you could run your program on a single-core processor ! As long as you have an OS with a scheduler that does true pre-emptive scheduling, you will run into all the same correctness problems that you would run into on a multi-core machine. So I don't think that if you want to test correctness first, you need 8 cores. Or any specific amount of cores.

I used to write real-time software. Distributed systems. Long time ago. Robustness of the algorithms was the most important feature. You didn't want a bunch of processors to melt down because of huge amounts of (unnecessary) communication between themselves.
I preferred to test my software on the oldest, slowest, cheapest hardware I could find. Because if the old machines were running fine, then I knew that the faster machines would be able to cope as well. So if I were you, I wouldn't care about the clockspeed of the processors, the IPC efficiency, etc. If you care about cores, then just get the cheapest machines with the most cores for your dollar.
 

GammaLaser

Member
May 31, 2011
173
0
0
If you write programs to research stuff, I can think of two goals you might persue.
1) Testing the correctness of the algorithms.
2) Testing the performance of the algorithms.

If you care about correctness, then you could run your program on a single-core processor ! As long as you have an OS with a scheduler that does true pre-emptive scheduling, you will run into all the same correctness problems that you would run into on a multi-core machine. So I don't think that if you want to test correctness first, you need 8 cores. Or any specific amount of cores.

I don't think this is true. A lot of subtle multiprogramming bugs (such as deadlocks/livelocks/unpredictable memory ordering) won't show up at all when the threads are forced to serialize on a single core CPU. So not only do you lose on performance but you will not be able to reproduce these types of bugs.