Giga Flops, lol

BTRY B 529th FA BN · Apr 28, 2011

You just sunk my battle-ship

Idontcare · Apr 28, 2011

Syzygies said:
That graph is wrong.

The Core-i7 920 has four physical cores, eight virtual cores. The dots fit exactly what I see, making parallel computations on a Core-i7 2600K.

I use the programming language Haskell for parallel programming on various machines. The language itself has a bit of a learning curve, but one can parallelize a program with a handful of extra lines of code. It is particularly sensitive to resource contention, known as the "last core slowdown." On a machine with eight physical cores, performance scales nicely up to seven cores, and craters if one uses all eight cores.

With four physical cores, eight virtual cores, one still gets worse performance using eight threads over seven. Performance scales nicely up to four threads, with a bit of a hill for 5, 6, or 7 threads.

In other words, virtual cores are a scheduling convenience, that allows one to use all four cores of a four core processor. It's a second order effect, even if a rather significant one.

So I believe the dots; I recognize their pattern. LinX looks like it's implemented in a less efficient language for parallel computations, but the appearance of the graph is qualitatively the same as I see with other problems.

I question the two lines. They don't apply to virtual cores.

Yeah you're probably right, I never was any good at this stuff.

TuxDave · Apr 28, 2011

Syzygies said:
That graph is wrong.

The Core-i7 920 has four physical cores, eight virtual cores. The dots fit exactly what I see, making parallel computations on a Core-i7 2600K.

I use the programming language Haskell for parallel programming on various machines. The language itself has a bit of a learning curve, but one can parallelize a program with a handful of extra lines of code. It is particularly sensitive to resource contention, known as the "last core slowdown." On a machine with eight physical cores, performance scales nicely up to seven cores, and craters if one uses all eight cores.

With four physical cores, eight virtual cores, one still gets worse performance using eight threads over seven. Performance scales nicely up to four threads, with a bit of a hill for 5, 6, or 7 threads.

In other words, virtual cores are a scheduling convenience, that allows one to use all four cores of a four core processor. It's a second order effect, even if a rather significant one.

So I believe the dots; I recognize their pattern. LinX looks like it's implemented in a less efficient language for parallel computations, but the appearance of the graph is qualitatively the same as I see with other problems.

I question the two lines. They don't apply to virtual cores.

Facinating. So who has a 6 core to test? So I would expect pretty good scaling up to 5 cores, 6th one starts deviating (last of the physical core), and then from 7 to 11 we get a slight hill and really tanks on the 12th? AND it indicates that quad cores are valuable in 2-thread applications since it wouldn't suffer from the last core problem?

Syzygies · Apr 28, 2011

TuxDave said:
So who has a 6 core to test

I almost sprung for 6 cores this time around, but Sandy Bridge was too amazing to pass on. The 2600K rocks. Next year?

Funny thing, Apple's OS X does a better job of scheduling the last core. It would make sense that's something they'd try to get right. I did run a Hackintosh for a while to get this advantage, but went back to the simplicity of maintaining Ubuntu.

I'm saying these virtual core chips do an even better job of scheduling in hardware, so yes, one gets better use of two physical cores by scheduling two or three threads, and leaving the last virtual core to other jobs.

Idontcare · Apr 28, 2011

TuxDave said:
Facinating. So who has a 6 core to test? So I would expect pretty good scaling up to 5 cores, 6th one starts deviating (last of the physical core), and then from 7 to 11 we get a slight hill and really tanks on the 12th? AND it indicates that quad cores are valuable in 2-thread applications since it wouldn't suffer from the last core problem?

We never tested a hex-core, but here's a post that covers the quad-cores.

http://forums.anandtech.com/showpost.php?p=29012784&postcount=107

edit: I didn't realize this was being extended as a "general case"

Here's some thread-scaling with more physical cores:

It's all older stuff, I haven't bothered updating the graphs to include the even higher thread-count stuff from Magny-cours and so. They don't have a "last core slowdown" either, strange?

Syzygies · Apr 28, 2011

Interesting that these graphs analyze interprocessor communication slowdowns as inevitable.

Functional languages have the advantage that once a value is created, it is treated as read-only. Every core can have a copy of this value in their cache if they like, with no possibility of the cache getting invalidated by a write elsewhere. Other computational models stall a lot, when they're waiting for an invalid cache value to update, that was changed by a different core. It's fairly mind-boggling what has to take place at the hardware level, for SMP parallelism to work correctly.

GHC Haskell still suffers from a "stop the world" memory management garbage collector. (That will soon change.) So the effects of a tardy last core could be more pronounced than in other models. Like waiting to resume a meeting while a participant is in the loo.

Idontcare · Apr 28, 2011

Syzygies said:
Interesting that these graphs analyze interprocessor communication slowdowns as inevitable.

Functional languages have the advantage that once a value is created, it is treated as read-only. Every core can have a copy of this value in their cache if they like, with no possibility of the cache getting invalidated by a write elsewhere. Other computational models stall a lot, when they're waiting for an invalid cache value to update, that was changed by a different core. It's fairly mind-boggling what has to take place at the hardware level, for SMP parallelism to work correctly.

GHC Haskell still suffers from a "stop the world" memory management garbage collector. (That will soon change.) So the effects of a tardy last core could be more pronounced than in other models. Like waiting to resume a meeting while a participant is in the loo.

Unless you are restricting yourself to extremely coarse-grained applications, or using hardware that has infinite bandwidth and zero latency, "interprocessor communication slowdowns" are as unavoidable as signal propagation itself.

Not many applications are extremely coarse-grained, nor are there many applications that lend themselves to embarrassing levels of parallelism.

That's why the corner-cases don't make for reliable generalizations.

TuxDave · Apr 28, 2011

Idontcare said:
It's all older stuff, I haven't bothered updating the graphs to include the even higher thread-count stuff from Magny-cours and so. They don't have a "last core slowdown" either, strange?

Did a quick search and apparently the "last core slowdown" is a Linux problem.

http://www.haskell.org/pipermail/glasgow-haskell-users/2009-April/017050.html

So your LinX (I dunno, is that a Linux based test?) shows what he shows. So I would guess the 6 core would do what I'm guessing for a LinX test.

drizek · Apr 28, 2011

LinX is Linpack X I believe, and I think it is a windows test.

http://www.xtremesystems.org/forums/showthread.php?t=201670

TuxDave · Apr 28, 2011

drizek said:
LinX is Linpack X I believe, and I think it is a windows test.

http://www.xtremesystems.org/forums/showthread.php?t=201670

Ah I see, Lin = Linear Algebra

Idontcare · Apr 29, 2011

TuxDave said:
So your LinX (I dunno, is that a Linux based test?) shows what he shows.

No, the example I showed is what you'd expect for the scaling of any shared-resource architecture when the resources being shared actually happen to dominate the time-to-completion of the individual threads.

It even happens with dedicated resource architectures (pure CMP) simply from the induced latency of data locality when spawning threads and retrieving the computed results. (it is physically unavoidable in real systems save for quantum-entanglement enabled computing constructs)

https://share.sandia.gov/news/resou...lower-supercomputing-sandia-simulation-shows/

This is why phenomenal quantities of money are spent on network fabrics in supercomputer clusters, as well as the primary advantage of moving to CMP (many cores on one die) as the undesired effects of interprocessor communication overhead are reduced (but never eliminated).

Syzygies · Apr 29, 2011

Idontcare said:
Not many applications are extremely coarse-grained, nor are there many applications that lend themselves to embarrassing levels of parallelism.

That's why the corner-cases don't make for reliable generalizations.

That's one way to look at it. Funny, I've got just such a corner case on my other screen, mapping out a Vcore vs Turbo Ratio minesweeper grid as I overclock my new build.

I'm an optimist, and lower bound proofs in computer science are rather rare. If I can't program corner-case efficiency, I conclude I'm a friggin' idiot who hasn't figured out a good way to look at the problem yet. I don't conclude that my failure is par for the course, and something I should accept.

Of course, I tend to enumerate possibilities, and filter the results. That's generally a slam-dunk to parallelize. Haskell can efficiently handle millions of threads in its computational model, mapped onto the handful of available operating system threads. One faces essentially a statistical problem, figuring out how to break a problem into millions of chunks, of variable and unknown complexity, so that the largest chunk turns out to be no more that 1% of the total work.

Idontcare · Apr 29, 2011

Syzygies said:
That's one way to look at it. Funny, I've got just such a corner case on my other screen, mapping out a Vcore vs Turbo Ratio minesweeper grid as I overclock my new build.

I'm an optimist, and lower bound proofs in computer science are rather rare. If I can't program corner-case efficiency, I conclude I'm a friggin' idiot who hasn't figured out a good way to look at the problem yet. I don't conclude that my failure is par for the course, and something I should accept.

Of course, I tend to enumerate possibilities, and filter the results. That's generally a slam-dunk to parallelize. Haskell can efficiently handle millions of threads in its computational model, mapped onto the handful of available operating system threads. One faces essentially a statistical problem, figuring out how to break a problem into millions of chunks, of variable and unknown complexity, so that the largest chunk turns out to be no more that 1% of the total work.

Ah, I see now.

The difference, in your opinion, between an application being fine-grained versus coarse-grained and embarrassingly parallel is simply a matter of intelligence and effort on behalf of the programmer?

That is a bold assertion.

Syzygies · Apr 29, 2011

Idontcare said:
That is a bold assertion.

Make this discussion more tangible. What's a scalable problem that interests you? I don't care how poorly it parallelizes for five minute cases, but for a case that takes an hour, or a day?

Voo · Apr 29, 2011

Syzygies said:
Make this discussion more tangible. What's a scalable problem that interests you? I don't care how poorly it parallelizes for five minute cases, but for a case that takes an hour, or a day?

Oh am I allowed to play here too? How about a algorithm for sparse graphs for a PRAM model that is practically relevant? Let's say for computing SCCs. And that's one problem for which parallel algorithms exist and depending on the language aren't that hard to come up with. But I think coppersmith showed for that particular problem an algorithm (the best one I know of) with O(|E| lg |V|) but Tarjan for example has O(|E| + |V|). Which makes it practically pretty much irrelevant.

And before you ask: Sparse graph algorithms are extremely important in several interesting fields, computational biology for one.
Anyways that just goes to show that there are problems that are hard to parallelise work efficiently and others where it's almost trivial (that is trivial for a parallel algorithm, ie still a much harder than the efficient sequential solution).
A parallel sort algorithm is easy - implement a parallel merge sort and use a sequential quick sort for the leaf case and even there as soon as we leave the nice cozy SMP world it's getting quite interesting (and I think that goes for a Haskel program as well - a naive solution won't work especially well in a cluster)

BTRY B 529th FA BN · Apr 29, 2011

But will it run Crysis2!? lol

j/k, i hate the game crysis

Idontcare · Apr 29, 2011

Syzygies said:
Make this discussion more tangible. What's a scalable problem that interests you? I don't care how poorly it parallelizes for five minute cases, but for a case that takes an hour, or a day?

Anything using Hartree-Fock, MPn, or Density-Functional Theory (DFT) algorithms in ab-inition computation chemistry.

Specific applications include Gaussian and pretty much anything on this page.

Furthermore I'd love to have access to a highly parallelized foreign currency exchange trading backtester. My current one is single-threaded, owing to the very reasons highlighted by Voo. Sparse matrix manipulations involving frequent results collection and updating.

Martimus · Apr 29, 2011

I remember writing code to filter and remove noise in graphical data. I was horrible at it, and had the hardest time writing fourier transforms to calculate the paterns in the picture and all my outputs ended up just looking blurry when I was able to remove the noise. I don't think I have a link to the code I used there, but I think that would be something that would be relatively easy to make into parallel code (as it was, it took maybe a half hour each time I ran the code to try to filter the picture, but I think I took it pixel by pixel and checked the surrounding pixels looking for patterns). It was really kind of a software antialiasing program I made, which kind of worked for that, but it wasn't really supposed to do that - it was supposed to find the noise pattern and compensate for it.

It has been a decade since I have written any control algorythms, but I would be interested how you would parallelize our control algorythm for the Autonomous Vehicles we created 10 years ago: http://eecapstone.udmercy.edu/2002_Platooning/vehicle_trailing/sourcecode.html

(Here is the flow chart to describe what the software was doing: http://eecapstone.udmercy.edu/2002_Platooning/vehicle_trailing/software.html
You need to click on the blocks to break wach block down into more detail. It is just the way the website programmer made the site)

It is likely an easy thing to do, as I am pretty sure our final control algorythm ended up being very simple (I think we found that the simpler we made it, the better it worked). But, I have no experience in writing parallel code, as the HC12 was definitely a very serial processor.

This may not have anything to do with the topic at hand, but I find it interesting going back into my old coding days before I became a pure hardware engineer.

Syzygies · Apr 29, 2011

Idontcare said:
Sparse matrix manipulations involving frequent results collection and updating.

First, as much as I'm enjoying this (and I'm thrilled to know not every overclocker is a 19 year old gamer), my apologies for helping this thread so far off the rails.

Remember, we were once talking about the dots on a performance graph, breaking an easy problem up into a small number of pieces on a bog-standard desktop multicore machine. Then I went all religious, talking about functional programming. I know dozens of non-functional languages, but my bad. At least I didn't say something about Bush.

The strongly connected component problem looks fascinating, right up my alley, but I worry that interesting cases have graph descriptions that fill memory. If so, the bottleneck will be forever waiting on RAM to see more of the graph, and dividing the work between cores won't help much. Abstract complexity bounds don't capture real-world realities like this, cache behavior is everything, the whole ball of wax.

Modifying the graph is another story. A functional approach could be particularly effective at noting properties of a fixed graph. Start updating the graph, and all hell breaks loose.

Are there public benchmark examples for any of these problems? The way that linear programming and integer programming code can be tested on famous examples with names people recognize?

Giga Flops, lol

Lifer

Elite Member

Lifer

Senior member

Elite Member

Senior member

Elite Member

Lifer

Golden Member

Lifer

Elite Member

Senior member

Elite Member

Senior member

Golden Member

Lifer

Elite Member

Diamond Member

Senior member