This is a great discussion. Lets keep it going!
Originally posted by: Ken g6
In theory, every execution of every function in a functional program should be its own thread. That's a lot of threads! I imagine the threading overhead is huge on our current 2-4 core processors. I seem to recall one of the Slashdot posters saying that most FP languages don't do threading unless you explicitly enable it.
Yes, that is a boatload of threads. But the problem isn't
finding threads, its
finding parallelism. They aren't the same because executing every function as a separate thread is enormously wasteful. Thats why we have variables -- to cache the results of computations. FP is fine from the perspective of a model, but it breaks down in practice. In short: executing FP functions in parallel is fine, and will speedup nicely, but its simply because the baseline is bad to begin with.
Originally posted by: Crusty
I don't think it's going to matter how many cores you have, there will always be way more threads executing then cores.
I disagree. Tilera has 80 cores already. I'm only running 42 processes atm, 523 threads in all. As Markbnj has pointed out, most of them are sleeping. Give Moore's law 2-3 generations: hundreds of on-chip threads. Will they actually build it? Probably not. Why? Because nobody knows how to write code for it (except a few very specialized codes).
Originally posted by: Markbnj
Various Windows frameworks may use multiple threads in message processing code, or wherever, but I don't know that the platform itself forces you into multiple threads. Most of my Winapps have been single threaded.
This is a good point: the API was purposely 'threaded', for some definition of 'threaded'. Windows didn't invent the event-driven interface of course, but they sure made it a fact of life for hundreds of thousands of programmers. Arguably, its a good thing. Event-driven programming is akin to threading -- it has a lot of the same horrible problems.
And now to what I agree is the heart of the matter:
Where does performance really matter? And where it does matter, how can the workload be partitioned such that threads make a difference?
Exactly. Let us consider three kinds of boundedness when it comes to computation:
*I/O bound computation: Threads are reading files, waiting for network traffic, responding to mouse movements, etc. How can parallelism help here? Actually, it can help a lot, under the right circumstances. Sending many file requests to a network file system can actually get them to be serviced in parallel, thereby increasing thoughput. But there's a limit -- one cannot improve a network link's fundamental bandwidth by beating it with threads. Nor is it useful to process mouse movement at a faster fundamental rate than the rate at which it arrives from the user.
A special case of I/O boundedness is graphics computation. Some may call this computation bounded, but since it happens across the I/O bus, I'm lumping it in with I/O. Explicit threading isn't going to help here -- its a good thing the GPU guys already have such a good handle on parallelism in their architectures, APIs, and JITs. Parallelism helps a LOT in that kind of extremely regular data-parallel raw computation.
*Memory bound computation: The limiting factor for these computations is either:
1) The round-trip time to memory (memory latency): e.g.: linked list traversal
2) The rate at which values can be fetched from memory (memory bandwidth): e.g.: array traversal.
In either of the above cases, Little's law puts hard performance bounds on single threads for case 1 and an arbitrary number of threads for case 2.
*Computation bounds: These are actually exceedingly rare. In order to truly be computationally bound, an application has to have a small enough working set such that it resides almost entirely in on-chip cache (thereby suffering neither from memory latency nor bandwidth constraints), and must furthermore saturate the computational capabilities of its host (Nehalem for instance, is no slouch). Threads can help a lot here as well: partitioning computation among multiple threads in this category can lead to linear speedup.
Something we haven't considered much so far is communication and synchronization. I think we'll probably all agree that the latter is pretty d*mn hard, and the former rather complicates simple models of how computation works -- but they're there, and both affect how well an application parallelizes.
So, in closing: to find the parallelism, you must understand your bottlenecks and parallelize the bottlenecks accordingly. The biggest challenge is: how do we go about marrying a deep understanding of these fundamental performance bounds with a convenient high-level abstraction for everyday programmers?