Functional Programming

degibson · Dec 10, 2008

Originally posted by: Modelworks
The answer I got a lot of the time was that the program was designed with single cores in mind and it is too hard to go back and break up the task to target multiple cores. I think the holy grail will be a processors that can divide single task among multiple cores without the aid of the programmer. Until then I am really starting to look more at the GPU and things like OPenCL, which just released its specs. http://www.khronos.org/registr...pecs/opencl-1.0.29.pdf

For those that are inclined, look into a research proposal from the 90's called 'Multiscalar'. Just google it -- it will pop up.

Originally posted by: Ken g6
There's a lot more history to CPU design than just adding instructions or the GHz race.

The funny thing about all this history is that it led back to thread-level parallelism. It turns out that no matter what the advances are in computer architecture, the advances in VLSI always trumped them -- hence the GHz race.

Unfortunately, we then hit the power wall. We just can't clock transistors that fast and provide them with power at the same time. Major bummer. Stupid CMOS...

Believe me, I'd love to invent an architecture that trumps all previous architectures at leveraging parallelism in serial code. It turns out thats a really hard thing to do.

Originally posted by: Markbnj
I bring this up because the UI and the way it interacts with the application and user offers a whole set of its own opportunities and challenges for parallelism.

This is actually pretty awesome. Here's an idea: what can we do to leverage the exiting expertise in UI optimization and turn that into performance optimization? Its actually the same insight, but for a different reason: instead of optimizing the user experience through less latent I/O handling, lets use that same expertise to optimize computation. Key questions:

*Will that expertise survive a few orders of magnitude of latency? For a delay to be noticeable to a human, it has to take ~100 ms. For thread-level parallelism to be useful, we're talking 1-10 ms (or more) worth of computation.

*Will the same intuition apply? If so, how do we express it?

Orthogonally: Would the entire thing be fixed if we changed the event handling model to allow simultaneous handling of multiple events?

imported_Dhaval00 · Dec 10, 2008

Originally posted by: degibson
Orthogonally: Would the entire thing be fixed if we changed the event handling model to allow simultaneous handling of multiple events?

If I am not mistaken, this is already done - at least in .NET. You can have a multicast delegate pointing to multiple methods and as soon as the underlying event is raised, all the eventhandlers will be invoked concurrently (assuming you're on a multicore/multiprocessor PC and the ThreadPool has enough resources). In essence, the same notion as FP - multiple functions being called in tandem.

degibson · Dec 10, 2008

Originally posted by: Dhaval00

Originally posted by: degibson
Orthogonally: Would the entire thing be fixed if we changed the event handling model to allow simultaneous handling of multiple events?

Click to expand...

If I am not mistaken, this is already done - at least in .NET. You can have a multicast delegate pointing to multiple methods and as soon as the underlying event is raised, all the eventhandlers will be invoked concurrently (assuming you're on a multicore/multiprocessor PC and the ThreadPool has enough resources). In essence, the same notion as FP - multiple functions being called in tandem.

Big difference: The functions aren't functional. I assume even beautiful old C# can still have data races -- how is synchronization done in this case?

imported_Dhaval00 · Dec 10, 2008

Originally posted by: degibson

Originally posted by: Dhaval00

Originally posted by: degibson
Orthogonally: Would the entire thing be fixed if we changed the event handling model to allow simultaneous handling of multiple events?

Click to expand...

If I am not mistaken, this is already done - at least in .NET. You can have a multicast delegate pointing to multiple methods and as soon as the underlying event is raised, all the eventhandlers will be invoked concurrently (assuming you're on a multicore/multiprocessor PC and the ThreadPool has enough resources). In essence, the same notion as FP - multiple functions being called in tandem.

Click to expand...

Big difference: The functions aren't functional. I assume even beautiful old C# can still have data races -- how is synchronization done in this case?

Of course. That's an issue I already alluded to in of my earlier posts. There will be race conditions/synchornization issues, unless the functions are themselves mutually exclusive - which is hard to imagine, especially when you're doing things like working on the same backend or working on a common file system.

Markbnj · Dec 10, 2008

But the key is that outside of data acquisition and physical control applications events are almost always caused by user actions. Who cares if we can handle multiple events simultaneously if the user can't think about two functions at the same time? I think the current UI model works well, and I don't expect a big push for more multi-threading in that layer.

For that matter, who says performance is even an issue? Why is Moore's law even relevant anymore, to anyone not involved in marketing next-generation processors? Leaving specialized engineering and scientific applications out of it, I would argue that current generation CPU performance is sufficient for all desktop and mobile computing needs, and will be for some time to come.

degibson · Dec 10, 2008

Originally posted by: Markbnj
For that matter, who says performance is even an issue? Why is Moore's law even relevant anymore, to anyone not involved in marketing next-generation processors? Leaving specialized engineering and scientific applications out of it, I would argue that current generation CPU performance is sufficient for all desktop and mobile computing needs, and will be for some time to come.

You could be right about this. If you are, it dramatically changes the business model for chip manufacturers.

imported_Dhaval00 · Dec 10, 2008

Originally posted by: Markbnj
But the key is that outside of data acquisition and physical control applications events are almost always caused by user actions. Who cares if we can handle multiple events simultaneously if the user can't think about two functions at the same time? I think the current UI model works well, and I don't expect a big push for more multi-threading in that layer.

For that matter, who says performance is even an issue? Why is Moore's law even relevant anymore, to anyone not involved in marketing next-generation processors? Leaving specialized engineering and scientific applications out of it, I would argue that current generation CPU performance is sufficient for all desktop and mobile computing needs, and will be for some time to come.

I can agree with you if we're only talking about UI applications. But there is so much out there beyond just UI - I have hardly seen an application in the past year or so that came out of the team I work on which doesn't have multicast delegates.

Thinking outside of forms (be it Windows or Web), I could seriously use more horsepower. I mean we have backend applications that do datawarehousing tasks and that crap is back-logged by two days. And these are systems running on top-notch SAN's, multiple quad cores and gigs and gigs of RAM. Of course, there are other bottlenecks, but as the data needs grow (along with the data itself), the number of threads and processors will only grow.

I could give a more elaborate example, but I don't feel like typing right now. But to get to the gist of it, I think we need more processors... of course, there is the whole business side - you need to fund R&D, and the current generation feeds the next (I know it sounds childish, but in a way that's how it is). I guess what I am getting at is that I don't care about Moore's Law, but I care about innovation (which comes from competition, and vice versa).

In other news: http://buffalobeast.com/133/bigthree.jpg

Markbnj · Dec 10, 2008

Thinking outside of forms (be it Windows or Web), I could seriously use more horsepower. I mean we have backend applications that do datawarehousing tasks and that crap is back-logged by two days. And these are systems running on top-notch SAN's, multiple quad cores and gigs and gigs of RAM. Of course, there are other bottlenecks, but as the data needs grow (along with the data itself), the number of threads and processors will only grow.

That's a fairly loose definition of horsepower, though. There are a lot of areas in which information technology has room to achieve order of magnitude improvements. Storage is amazingly slow given the data speeds we're getting on the mainboard now. So is the network. I'm sure those things will continue to improve, allowing the interfaces to catch up to the potential speed of the devices. I'd go out on a limb and bet your data warehousing processes aren't CPU bound.

At the UI... there's a lot worth talking about with respect to the user-facing layer. But I'm afraid it would devolve into a philosophical debate. What I would actually be interested in is hearing about some of the problems your team has solved using multiple threads to respond to events, in as much detail as your time and disclosure restrictions will allow.

imported_Dhaval00 · Dec 11, 2008

Here is an example - there are many others, but this is one of the most recent ones:

We needed to gather data from from disparate sources and populate three separate DB's (each department has its own DB that it budgeted for). Sources of data include XML files, MSMQ messages, Web Service calls, and four external DB's. The three dept. DB's archive billions of rows - we account for Type 2 updates (here is a link for folks who don't know what this is: http://en.wikipedia.org/wiki/Slowly_Changing_Dimension) which was a business decision. In 95% of the data warehousing (DW) tasks I have worked on, most customers go with Type 2 updates. Each dept. DB has its corresponding, in DW terms, a staging DB. A staging DB allows you to bring in updates on an incremental basis - no one wants to scan a billion rows. As a side note, most of the backend is hosted on SAN's, so disk IO lag is relatively on the lower side.

The primary tool used for ETL (Extract-Transform-Load) operations is SSIS. The whole design from top-to-bottom is too complex to explain, but I'll try and concentrate on the multithreading pieces:

1) Of the four external DB's, two are independent while the other two are related. When an ETL job starts, SSIS takes parallel paths and spawns off 'tasks' (threads) assigned to begin gathering data from the four DB's. These tasks are independent and can execute parallel to each other. These tasks ultimately populate the staging DB's.
2) We have to call around seven Web Services at the same time to retrieve XML blobs - we do various things ranging from running XPath queries to cleaning up the data. This, in my opinion, is the one task that we had the most finer over with regards to threading. The XML blobs are sent in chunks because of XML's verbose nature and other reasons I can't explain here. The Windows service has a timer which goes off every night. The eventhandler spawns off multiple threads each responsible for the Web Service calls. The threads call the Web Services asynchronously which, in essence, puts them at the mercy of the ThreadPool, but we have never had problems with this logic. The files (chunks) are written to disk in the order in which they're received. At the tail-end, the data selection and cleaning is done using separate threads - the opinion being that one thread could be receiving and saving the data while the other can start cleaning up files in parallel. Once the cleaning is done, the thread spawns off a SSIS package (XML task) which ultimately imports that data in the staging DB's - SSIS is built on top of .NET 2.0, so it has its own managed API. There is way too much logic involved here, and for the purposes of this discussion, some of the things I have said don't reflect completely on the things we actually do. They simply portray a high level view. One thing I forgot, this step can execute concurrently with step 1 - I say 'can' because as the DB's have grown in size, we've seen the two overlap each other.
3) Yet another SSIS package is responsible for pulling messages from an MSMQ queue. This is similar to step 2 in that ultimately it is an XML blob that we're importing from the MSMQ.

Once the staging DB is done loading, we make incremental loads to the three behemoths. This is the biggest bottleneck, even though the three DB's are loaded in parallel. Because we needed to maintain type 2 updates, the data files are partitioned. This complicates the reporting piece, which can be a story of its own, but ultimately it eases the overall burden in the ETL phase.

Believe me, I have stripped out so many details that the explanation above is totally childish. But I hope it portrays the bigger picture. We're currently working on making use of things like SQL Server Service Broker (don't know the performance implications yet), but over the lifespan of such projects, the complexity of running tasks in parallel has only increased.

I say we need more horsepower from the standpoint of fitting all these components on a single machine - we can 'scale out' or 'scale up'. With the scale out approach, the bandwidth bottleneck is by far the biggest pet peeve. We already have clients who desire real-time datawarehousing! I hardly think the scale out approach will be feasible in that case (but testing may have other things in store for us).

In any case, let me know if something isn't clear in the discussion above. As a side note, we're in the process of replacing the current Web Service handlers with WCF-based calls... testing shows WCF-binary calls are 3-4 times faster over the wire.

Markbnj · Dec 11, 2008

Interesting. Thanks for posting that description. I can see that it took some time and thought. My first reaction is that I'm curious why you decided to create threads instead of processes for those tasks? Do you have some inter-task control channel that you didn't want the IPC overhead on? I would think processes would be a more robust way to deal with what you've described, although perhaps not the XML fetch-assemble task.

I'm also curious as to why you have a goal of running this stuff on a single box. Perhaps the same control-channel overhead I mentioned above. It sounds like a perfect situation for horizontal scalability on cheap pizza boxes.

In any case what you have are a bunch of I/O tasks that are a prereq for the database updates, so obviously a perfect candidate for a parallel approach. But it still really sounds like the limit is I/O throughput, and what you're accomplishing with your task-based architecture is to overcome those I/O limitations through parallel execution. In other words, you have a lot more processor and memory bandwidth than you do disk or network I/O bandwidth, so you stack the tasks up.

Utilizing parallelism through tasks and services (processes) to overcome I/O limitations has been around a long time. I would say that most of the time it really doesn't involve the same issues that spawned the discussion of FP as a solution, because the tasks probably aren't hitting shared data structures. They are hitting shared database and file system resources, but I'm going to argue that this is a larger-grained, better-understood problem to solve.

If we're talking about multiple-cores and parallelism as the new performance paradigm, then almost by definition what we're talking about is using those techniques to make a single unit of work process faster than it could if serially executed on a single CPU. So I think we can stipulate for the remainder of this discussion that parallelism in the I/O and UI layers has its accepted uses, and previously described challenges.

The question that remains, then, is whether once we are at the limit of how fast a CPU can run a single thread, whether putting threads to work on the same payload will allow that limit to be exceeded (assuming additional cores to do the work)?

I can think of clear examples where it should work very well, and others where I'm pretty sure it won't ever work. I mentioned decoding streams and encrypting files in a previous reply. One of the areas where I think it should work very well is search. The benefits are obvious if the search is I/O bound, but even if it's in-memory you should be able to decompose the workload by carving up the real estate to be searched.

But what percentage of the applications out there are really just I/O and storage systems waiting on stuff to happen? It has to be 75%+. What percentage are actually running CPU-intensive tasks and saturating a single core with a single thread of execution? I'd bet less than 10%. So are we only having this discussion because the chip guys and the tech media get into a circle jerk over CPU speed?

Not to deprecate that 10%, but all that stuff is in areas that are massively researched and pretty damn well understood. They are already using massively parallel architectures to do the really computationally difficult stuff anyway.

degibson · Dec 12, 2008

Originally posted by: Dhaval00
Here is an example - there are many others, but this is one of the most recent ones...

Neat. Thanks for the description.

Originally posted by: Markbnj
Interesting. Thanks for posting that description. I can see that it took some time and thought. My first reaction is...

I agree with all of this. Good questions, observations.

imported_Dhaval00 · Dec 12, 2008

I definitely omitted the details, and thanks for noticing. In SSIS, there is this whole notion of 'parent' and 'child' packages. When you invoke a 'child' package, SSIS will launch it in its own process space - at that point the maximum resources that are allowed by the OS are allocated to the process.

Even the Windows Service - as I said, there are too many details not included. For example, we have our own generic 'plugin' architecture. At run time, Reflection is used to launch some of these plugins into their own AppDomains [again, there is the whole debate over Process vs. AppDomain] - reasoning being that some of the AppDomains need to talk to each other, and it makes sense to not induce the overhead of communicating between processes (there are some very nice white papers on MSDN regarding this topic). Plus, on 64-bit OSs, processes are not limited by the 2GB cap, so it doesn't makes sense to create, say, 10 AppDomains as opposed to 10 processes.

Yes, and I agree the biggest limit is the IO throughput, but the SAN approach also gives us craploads of network/disk bandwidth. We're already thinking of improving our future designs. One idea we came up with was to think in terms of multiple partition files - right now the entire DB is updated [since it is done incrementally, only the last partition is looked at]. However, we could fetch multiple partitions, and assuming there is still space available on those partitions [SQL Server's FILL FACTOR], we could launch updates in tandem. But again, all good on paper.

We have thought of the approach of scaling out [in fact a couple of earlier implementations I worked on were scaled out], but it increases complexity greatly, and as I mentioned earlier, network bandwidth is our biggest pet peeve. Also, we can't justify the costs once we start involving things like SAN's, backups, maintenance in general (updates, patches, etc.).

Back to FP, and your question, I think multiple cores is definitely the way to go. I took a couple of VLSI design courses, but don't remember all the nitty gritty details - one thing however that burgeoned was how easier the manufacturing/certification process was with multiple cores. Also, one reason why Intel/AMD have been increasing the cache is so that context switches across cores are super-fast [as opposed to across independent processors]. But even then, I think there will be an upper limit, unless we keep decreasing the die size (pico someone!). I guess this is somewhat analogous to that whole argument of AppDomain vs Process at the OS level. As a side note, VLSI design is not my forte any more and things might have changed drastically over the past three years -please correct me if something I said was off-base.

When it comes to CPU-intensive tasks, I'd say there has to be more than 10% - at least from the place where I stand. For example, I observe 16/24 cores at 100% on a daily basis

. When you think about it, almost all the front-end UI's talk to a backend server - be it Web, DB, etc. Even those entities require CPU - we have cores dedicated to DB processing and ASP.NET processing. So, I will agree that IO and storage systems are the bigger bottleneck, but the numbers you posted have to be from the year 2002

.

Regarding FP, I already think it can't be mainstream. Simply put, it will be used as a catalyst in the overall programming realm. The fact that the whole idea is based on immutable objects forces FP to be the new C# [value-based] struct? We are already using F# constructs in C# via things like lambda expressions and LINQ. Maybe we're all stuck in state, but state is something that can't be ignored - hence, the reason why HTML and CSS need a server backend. Also, FP to a certain level, advertises recursion as being good. I have nothing against recursion, but I can already visualize the huge number of bugs if this is not done properly at the compiler level and left to the programmer.

I forget where I was reading this, but I believe the discussion was about Haskell - someone was saying that it uses monads to manage state and that state is 'side effect'. I was furious for a moment because to me state is not a side effect. More so, because the individual was somehow trying to prove how FP is somehow greater than imperative programming (IP). I still think that FP will have to co-exist with IP in order for it to become mainstream.

degibson · Dec 13, 2008

Originally posted by: Dhaval00
I observe 16/24 cores at 100% on a daily basis . When you think about it, almost all the front-end UI's talk to a backend server - be it Web, DB, etc. Even those entities require CPU - we have cores dedicated to DB processing and ASP.NET processing.

Wow! If you see 100% utilization, then it would be really, really cheap to increase your performance by adding more cores and memory -- probably in another box or two, or more.

imported_Dhaval00 · Dec 13, 2008

Originally posted by: degibson

Originally posted by: Dhaval00
I observe 16/24 cores at 100% on a daily basis . When you think about it, almost all the front-end UI's talk to a backend server - be it Web, DB, etc. Even those entities require CPU - we have cores dedicated to DB processing and ASP.NET processing.

Click to expand...

Wow! If you see 100% utilization, then it would be really, really cheap to increase your performance by adding more cores and memory -- probably in another box or two, or more.

Sure. Usually, we contract with clients to upgrade their core systems every three years on average. And some of these projects involve too many requirements - anyone who has experience dealing with projects should be able to relate. Most of the time, during the span of a project, hardware is like the last thing on management's mind (even in cases like the one above where hardware should be the first thing spec'd out). Usually, this is something that is weighed in with other factors like costs of extending the SAN topology, programming overheads to synchronize across machines, etc. I, personally, have only scratched the surface - there are people on the team who have done this thing for 15 years. I usually weigh in on the hardware front, but I save my intensity and anger for code and design reviews

.

In any case, I guess I exaggerated a bit with the '100%'. When I say 100%, I don't mean the system is pegged at 100%, say, for 20 hours a day. Majority of the times this happens when the some of jobs end up clashing with each other at night (which is about three hours). The performance counters on a monthly basis average to ~50% CPU utilization which is still below the threshold of 75-80%.

Also, I specifically say 16 cores because that is how the SQL Server DB instances are allocated to the CPU cores.

Markbnj · Dec 13, 2008

It's a very interesting set of applications, Dhaval100, and I don't doubt that you are using quite a lot of CPU to do all that work. From my perspective most of what you're doing is multitasking. In systems where you basically want to do lots of different things at the same time you should benefit more or less linearly from added cores/threads. After thinking about this discussion a bit yesterday I think we can divide performance into several categories:

1. Heterogenous distributed systems like yours, where running more processes across more cores, with faster I/O channels, is generally the accepted approach to speeding things up.

2. Event-driven applications that spend most of their time waiting. Few of these really have any performance demands that aren't satisfied by current generation CPUs.

3. I/O intensive applications like a DBMS where threads and processes are utilized on top of a sophisticated shared file system. These are already highly parallel.

4. Computationally intensive algorithms, like codecs, encryption, searching and sorting, etc. Most of these are capable of saturating a single core on a modern CPU, and most are really hard to move to a parallel architecture.

5. Advanced scientific or engineering computation where even massively parallel architectures aren't fast enough.

The big benefits from increasing the core count over time will come in 1, 3, and 5. 2 will never get faster, and doesn't need to. 4 is where the hard questions about parallelism come into play.

So, to haul the discussion back around to it's starting point: does FP offer any breakthrough techniques that make it easier to parallelize "hard" transformation algorithms like encryption or a codec? What makes those things hard to chunk is that they are stateful streams.

imported_Dhaval00 · Dec 14, 2008

Originally posted by: Markbnj
So, to haul the discussion back around to it's starting point: does FP offer any breakthrough techniques that make it easier to parallelize "hard" transformation algorithms like encryption or a codec? What makes those things hard to chunk is that they are stateful streams.

Well, I don't know the ins and outs of FP, but if managing 'state' in FP will be like installing a plugin for a browser, I think it'll hardly be groundbreaking at the onset. I mean I am not against it, especially now that I have seen how concepts of FP have been put to use in .NET 3.5. It's just that I'd rather have a huge incentive if I am going to invest crap loads of time trying to learn something. So, on one hand, it *does* offer breakthrough concepts, but on the other it falls short along the 'state' front. Which is just hard for me to envision because I live and breathe 'state'.

When it comes to actual usage, I'd rather test it to see how it works on things like encryption (part of the example above includes a task that encrypts and decrypts things like users' SSNs). I won't be surprised if there are ways to chunk information and then amalgamate it once a function [in FP] finishes execution. I mean we see it in things like MSMQ in WCF and BizTalk Server where messages suffer from the 4MB limit - if the MSMQ pipe comes across a file greater than 4MB, it'll chunk the file into transactional pieces and appends a header to the message indicating 'to whom' the chunk belongs. At the receiving end, the client/server puts these chunks together and sends the file off to its destination.

But then in the approach above, you're still dependent on the 'state' of each chunk. Not to mention, you induce the transactional layer. So yea, it is doable, but can it be done in ways we haven't seen it before.

As a side note, I'd be interested in thoughts from people who work on *nix variants - my discussion has been specific to the PC world.

degibson · Dec 15, 2008

I work almost exclusively on Linux and Solaris these days. None of my thoughts at the moment are *nix-specific, however.

Originally posted by: Dhaval00
Well, I don't know the ins and outs of FP, but if managing 'state' in FP will be like installing a plugin for a browser, I think it'll hardly be groundbreaking at the onset. I mean I am not against it, especially now that I have seen how concepts of FP have been put to use in .NET 3.5. It's just that I'd rather have a huge incentive if I am going to invest crap loads of time trying to learn something. So, on one hand, it *does* offer breakthrough concepts, but on the other it falls short along the 'state' front. Which is just hard for me to envision because I live and breathe 'state'.

I, too, live and breathe 'state'. I think thats why I've always found it hard to grasp FP. I don't think 'state' can really be abstracted as a plugin to a functional language -- I think that when you add state to FP, FP loses some of its desirable parallelism properties (though, not being a language person, I have no idea what those are). I do know that whenever state is added to a parallel execution, that state needs to be synchronized: either by not sharing the state at all, though structured temporal sharing, or though unstructured temporal sharing with locks or TM.

Unshared data is easy, provided we can somehow communicate to the compiler that it is unshared and that we can also guarantee that at runtime. Shared data brings us back to
one of the fundamental problems with parallel VN programming, and still, compilers have no idea how to do it both fast and correctly.

In short, I think if FP is the answer, then adding explicit state to FP is NOT the way to go.

Search

Functional Programming

degibson

Golden Member

imported_Dhaval00

Senior member

degibson

Golden Member

imported_Dhaval00

Senior member

Markbnj

Elite Member <br>Moderator Emeritus

degibson

Golden Member

imported_Dhaval00

Senior member

Markbnj

Elite Member <br>Moderator Emeritus

imported_Dhaval00

Senior member

Markbnj

Elite Member <br>Moderator Emeritus

degibson

Golden Member

imported_Dhaval00

Senior member

degibson

Golden Member

imported_Dhaval00

Senior member

Markbnj

Elite Member <br>Moderator Emeritus

imported_Dhaval00

Senior member

degibson

Golden Member

TRENDING THREADS