- Nov 26, 2005
- 15,194
- 403
- 126
That graph is wrong.
The Core-i7 920 has four physical cores, eight virtual cores. The dots fit exactly what I see, making parallel computations on a Core-i7 2600K.
I use the programming language Haskell for parallel programming on various machines. The language itself has a bit of a learning curve, but one can parallelize a program with a handful of extra lines of code. It is particularly sensitive to resource contention, known as the "last core slowdown." On a machine with eight physical cores, performance scales nicely up to seven cores, and craters if one uses all eight cores.
With four physical cores, eight virtual cores, one still gets worse performance using eight threads over seven. Performance scales nicely up to four threads, with a bit of a hill for 5, 6, or 7 threads.
In other words, virtual cores are a scheduling convenience, that allows one to use all four cores of a four core processor. It's a second order effect, even if a rather significant one.
So I believe the dots; I recognize their pattern. LinX looks like it's implemented in a less efficient language for parallel computations, but the appearance of the graph is qualitatively the same as I see with other problems.
I question the two lines. They don't apply to virtual cores.
That graph is wrong.
The Core-i7 920 has four physical cores, eight virtual cores. The dots fit exactly what I see, making parallel computations on a Core-i7 2600K.
I use the programming language Haskell for parallel programming on various machines. The language itself has a bit of a learning curve, but one can parallelize a program with a handful of extra lines of code. It is particularly sensitive to resource contention, known as the "last core slowdown." On a machine with eight physical cores, performance scales nicely up to seven cores, and craters if one uses all eight cores.
With four physical cores, eight virtual cores, one still gets worse performance using eight threads over seven. Performance scales nicely up to four threads, with a bit of a hill for 5, 6, or 7 threads.
In other words, virtual cores are a scheduling convenience, that allows one to use all four cores of a four core processor. It's a second order effect, even if a rather significant one.
So I believe the dots; I recognize their pattern. LinX looks like it's implemented in a less efficient language for parallel computations, but the appearance of the graph is qualitatively the same as I see with other problems.
I question the two lines. They don't apply to virtual cores.
I almost sprung for 6 cores this time around, but Sandy Bridge was too amazing to pass on. The 2600K rocks. Next year?So who has a 6 core to test
Facinating. So who has a 6 core to test? So I would expect pretty good scaling up to 5 cores, 6th one starts deviating (last of the physical core), and then from 7 to 11 we get a slight hill and really tanks on the 12th? AND it indicates that quad cores are valuable in 2-thread applications since it wouldn't suffer from the last core problem?
Interesting that these graphs analyze interprocessor communication slowdowns as inevitable.
Functional languages have the advantage that once a value is created, it is treated as read-only. Every core can have a copy of this value in their cache if they like, with no possibility of the cache getting invalidated by a write elsewhere. Other computational models stall a lot, when they're waiting for an invalid cache value to update, that was changed by a different core. It's fairly mind-boggling what has to take place at the hardware level, for SMP parallelism to work correctly.
GHC Haskell still suffers from a "stop the world" memory management garbage collector. (That will soon change.) So the effects of a tardy last core could be more pronounced than in other models. Like waiting to resume a meeting while a participant is in the loo.
It's all older stuff, I haven't bothered updating the graphs to include the even higher thread-count stuff from Magny-cours and so. They don't have a "last core slowdown" either, strange?
LinX is Linpack X I believe, and I think it is a windows test.
http://www.xtremesystems.org/forums/showthread.php?t=201670
So your LinX (I dunno, is that a Linux based test?) shows what he shows.
Not many applications are extremely coarse-grained, nor are there many applications that lend themselves to embarrassing levels of parallelism.
That's why the corner-cases don't make for reliable generalizations.
That's one way to look at it. Funny, I've got just such a corner case on my other screen, mapping out a Vcore vs Turbo Ratio minesweeper grid as I overclock my new build.
I'm an optimist, and lower bound proofs in computer science are rather rare. If I can't program corner-case efficiency, I conclude I'm a friggin' idiot who hasn't figured out a good way to look at the problem yet. I don't conclude that my failure is par for the course, and something I should accept.
Of course, I tend to enumerate possibilities, and filter the results. That's generally a slam-dunk to parallelize. Haskell can efficiently handle millions of threads in its computational model, mapped onto the handful of available operating system threads. One faces essentially a statistical problem, figuring out how to break a problem into millions of chunks, of variable and unknown complexity, so that the largest chunk turns out to be no more that 1% of the total work.
That is a bold assertion.
Oh am I allowed to play here too? How about a algorithm for sparse graphs for a PRAM model that is practically relevant? Let's say for computing SCCs. And that's one problem for which parallel algorithms exist and depending on the language aren't that hard to come up with. But I think coppersmith showed for that particular problem an algorithm (the best one I know of) with O(|E| lg |V|) but Tarjan for example has O(|E| + |V|). Which makes it practically pretty much irrelevant.Make this discussion more tangible. What's a scalable problem that interests you? I don't care how poorly it parallelizes for five minute cases, but for a case that takes an hour, or a day?
Make this discussion more tangible. What's a scalable problem that interests you? I don't care how poorly it parallelizes for five minute cases, but for a case that takes an hour, or a day?
Sparse matrix manipulations involving frequent results collection and updating.
