programming and multi-core cpu's

her34 · Aug 29, 2005

1) i read people say that some programs are more inherently parallel than others. what makes some programs more easily optimized for multi-core and others not? can you give me an example of a desktop program that can't easily be made to utilize multi-cores?

i tend to think of things like word, where if you run spell check, you can just divide the pages by number of cores available and run task. when can you not just divide the work into equal pieces?

2) will optimizing for multi-cores hurt performance on cpu's with less cores? so for example if we reach the point where 4 cores are getting popular. app/game is programmed to divide out work evenly for 4 cores. if you now run the app on a 2 core or single core cpu, i would imagine the app to run really horrible because timing would be off. performance on 2 core would be worse than if the program was written with 2 cores in mind. so programmer will have to make a choice of how many cores to optimize for (optimize for 2 cores and 4 cores gain no performance, or optimize for 4 cores and alienate 2 core cpu's). and this gets worse when 6 or 8 cores come out.

this would be a headache compared to past where better cpu meant better performance without any work.

Soccerman06 · Aug 30, 2005

I would believe in games for example, you should be able to switch between 1, 2, 4, 8, and so on processes. This would make coding much more difficult but would benifit the consumer a lot more than just single-threading everything. Then again, I have no idea how coding works, so it might be just as hard /simple as making an app in multi-threads

pakigang · Aug 30, 2005

the way i think about multicore is that the OS should be optimized for it rather than the application. Word doesn't know there are single core or multicore processor, it the os, the kernel's job to handle hoe to divide the time of the cpu to process the information and at what priority.

As far as games are concerned, i think the game might not get full access to the processor so dual core programming can be tough. Remember HAL (Hardware Abstract Layer or something). I don't know how directx works internally but if it could provide some api to the game programmers to access the processor directly then maybe there might be some chance.

itachi · Aug 30, 2005

Originally posted by: her34
1) i read people say that some programs are more inherently parallel than others. what makes some programs more easily optimized for multi-core and others not? can you give me an example of a desktop program that can't easily be made to utilize multi-cores?

easier to explain with code..

real A [3][3] = {
{1, 1, 1},
{2, 4, 8},
{4, 8, 6}
};
real B [3][3] = {
{1, 1, 1},
{2, 2, 4},
{4, 8, 2}
};
real F [3][3];

// F = A x B
#define V_RMUL (rs) (A[0][_i] * B[rs][0] + A[1][_i] * B[rs][1] + A[2][_i] * B[rs][2])
for (int _i = 0; _i < 3; _i++) {
F [0][_i] = V_RMUL (0); F [1][_i] = V_RMUL (1); F[2][_i] = V_RMUL (2);
}

can easily be converted to run parallel by expanding the procedures.

struct Vector {real v1,v2,v3;};
struct Matrix {Vector a, b, c;};
const Vector & operator*(const Vector& v, const Matrix& m) {
Vector nv;
nv.v1 = v.v1 * m.a.v1 + v.v2 * m.b.v1 + v.v3 * m.c.v1;
nv.v2 = v.v1 * m.a.v2 + v.v2 * m.b.v2 + v.v3 * m.c.v2;
nv.v3 = v.v1 * m.a.v3 + v.v2 * m.b.v3 + v.v3 * m.c.v3;
return nv;
}
const Matrix & operator*(const Matrix& m1, const Matrix& m2) {
Matrix res;
res.a = m1.a * B;
res.b = m1.b * B;
res.c = m1.c * B;
return res;
}
Matrix A = ((1, 2, 4), (1, 4, 8), (1, 8, 6));
Matrix B = ((1, 2, 4), (1, 2, 8), (1, 4, 2));
Matrix F = A * B;

i tend to think of things like word, where if you run spell check, you can just divide the pages by number of cores available and run task. when can you not just divide the work into equal pieces?

that's not something that would really utilize the benefits of multiple cores.. and i don't think spellcheck would really be able to benefit much, it would be limited a lot by memory bandwidth (unless somehow they manage to store the whole dictionary in cache).

2) will optimizing for multi-cores hurt performance on cpu's with less cores? so for example if we reach the point where 4 cores are getting popular. app/game is programmed to divide out work evenly for 4 cores. if you now run the app on a 2 core or single core cpu, i would imagine the app to run really horrible because timing would be off. performance on 2 core would be worse than if the program was written with 2 cores in mind. so programmer will have to make a choice of how many cores to optimize for (optimize for 2 cores and 4 cores gain no performance, or optimize for 4 cores and alienate 2 core cpu's). and this gets worse when 6 or 8 cores come out.

this would be a headache compared to past where better cpu meant better performance without any work.

programs that use multiple independent threads shouldn't be hurt, if at all, when running on a single core architecture. ones that are designed to run on parallel architectures, like the matrix multiplication crap above, will most likely get hurt by the parallel optimizations. in the first part of the example i gave, everything in the loop will execute in parallel or relatively parallel internally. considering the amount of instructions that would have to be computed, there'd be a decent gap between each branch computation and the cost of a branch won't be that high. in the second part, the code is significantly larger.. when run on a single core system, the program wouldn't be able to execute faster by any measurable amount compared to the first part.. but it would need more than double the bandwidth, which it probably wouldn't get.

CTho9305 · Aug 30, 2005

Originally posted by: pakigang
the way i think about multicore is that the OS should be optimized for it rather than the application. Word doesn't know there are single core or multicore processor, it the os, the kernel's job to handle hoe to divide the time of the cpu to process the information and at what priority.

It doesn't work that way. The OS just provides some functions for applications ("read from this file", "draw a window on the screen here", "send this data to IP abc.def.ghi.jkl"). For computationally-intensive apps (seti@home, prime95, media encoding), the very fast majority of the time it's just the application's code running. The OS isn't really doing anything*. The program has a series of instructions, and the OS doesn't really know what they do, or how they accomplish their task (be it searching for aliens, finding primes, or playing a game). It can't split the instructions up into parallel streams (threads) to take advantage of multiple CPUs/cores - the application has to present threads to the OS if it wants them to run in parallel, and then the OS just has to figure out which threads to put on which cores.

*that isn't entirely true, but you'll have to trust me on it for now.

As far as games are concerned, i think the game might not get full access to the processor so dual core programming can be tough. Remember HAL (Hardware Abstract Layer or something). I don't know how directx works internally but if it could provide some api to the game programmers to access the processor directly then maybe there might be some chance.

That's not really relevant here. You (the programmer) create a few threads, and Windows will do a pretty good job of spreading them over the various processors. If you want more control, there are a lot of hints you can give it to get it to do exactly what you want. "Direct access" doesn't help (and wouldn't do what you seem to think it would).

programs that use multiple independent threads shouldn't be hurt, if at all, when running on a single core architecture.

That's not entirely true for two reasons. 1) if you have a single-threaded program, and break it up into threads, then run it on a single processor, you're going to be paying some overhead in context switches between threads. 2) There are many situations where the parallel algorithm to accomplish a task is slower than the sequential algorithm for the same task, unless you have multiple processors. If you wrote a sort routine that was 50% slower than quicksort but could take advantage of many threads, you'd use that if you targeted a multiple-processor machine, but unless you switched back to quicksort for the single-processor machine, you're going to be taking a big unnecessary performance hit.

cquark · Aug 30, 2005

Originally posted by: her34
1) i read people say that some programs are more inherently parallel than others. what makes some programs more easily optimized for multi-core and others not? can you give me an example of a desktop program that can't easily be made to utilize multi-cores?

i tend to think of things like word, where if you run spell check, you can just divide the pages by number of cores available and run task. when can you not just divide the work into equal pieces?

You've got the basics of the idea right there. If you can divide a task into largely independent pieces, then it's highly parallelizable. If you can't, it's not. To use your example, spell checking only cares about individual words, so it's highly parallelizable, while grammar checking needs to look at sentences, so it's not as parallelizable.

To use a more realistic example, many physics simulations divide the world into a grid and assign chunks of the grid to each processor on massively parallel supercomputers (thousands of processors). Performance is good as long as there's not a lot of interactions between chunks of the grid. When there are interactions, one thread has to wait for the other to complete, then they have to share data, and both actions hurt performance.

Languages like erlang where the fundamental unit of programming is the thread may help us devise parallel software more easily.

Calin · Aug 31, 2005

A kind of thing not easy to run parallel would be some kinds of sums. Like the Fibonacci sum, let's say, when F(n)=F(n-1) + F(n-2).
If you compute that linear (starting with F(0)=1, F(1)=1, at every step you find the next term.
How would you do that in parallel? You would have to communicate from one processor to the other the next term. Assuming you would have fast interprocessor communication and alot of work in computing, you could write the sum as F(n) = (F(n-2) + F(n -3)) + F(n-1). So, one processor computes the odd numbers, the other computes the even numbers, and they both have to read the previous 3 values.
Now, in the simplest case of today's processors, for small Fibonacci numbers, you just have to make three adds - and that takes some let's say 25 cycles (taking into account loads, stores and pipeline length). However, an access to the cache of the other processor would take hundreds of cycles, so more time is wasted on waiting for the next result.

In this particular case, you could devise a clever algorithm that could work twice as fast on two processors than on a single one. However, sometime this simply is not possible

Calin

Calin · Aug 31, 2005

And to answer your first question, the most easily optimised workloads for multiprocessor/multicore/etc are the kind of Seti@home (workloads that can be loaded in several seconds, unloaded in even less, and are processed for hours and hours).
One of the processes I feel to be hard to run in parallel would be nuclear explosions simulations - when you start to have thousands and thousands (or more) neutrons, moving in all directions thru a load of nuclear fuel, being able (or not) to create a fission in one of the atoms. Then, every neutron would need to interact with a part of the nuclear fuel that could have been affected (forced into fission) in the previous step by other neutron. This would account to a small workload that need to access huge data (the entire chunk of nuclear fuel), and if possible a version of the data that is current or at most a few "computing steps" old.
That was the difference between supercomputers of the past and supercomputers of the present - old ones seemed to have a better communication power/processing power ratio. And while some of the supercomputers need alot of processing power, some need that communication power more than anything else (as clever algorithms cannot always supplement delays in refreshing data)

itachi · Aug 31, 2005

Originally posted by: Calin
One of the processes I feel to be hard to run in parallel would be nuclear explosions simulations - when you start to have thousands and thousands (or more) neutrons, moving in all directions thru a load of nuclear fuel, being able (or not) to create a fission in one of the atoms. Then, every neutron would need to interact with a part of the nuclear fuel that could have been affected (forced into fission) in the previous step by other neutron. This would account to a small workload that need to access huge data (the entire chunk of nuclear fuel), and if possible a version of the data that is current or at most a few "computing steps" old.

mm.. i'd have to disagree. i don't think any single processor system would be capable of computing such simulations in a reasonable amount of time. what you described there would be a parallel job.. multiple computers would need to be running the simulation at the same time. communication between each process would be done through either a shared memory architecture, or MPI or some variant of it. multiprocessor and dual core systems have a shared memory architecture and would be more than capable of running the simulation (although.. for a single computer, having an extra cpu/core wouldn't change the fact that the simulation requires computations on millions, if not billions or trillions, of particles).

That was the difference between supercomputers of the past and supercomputers of the present - old ones seemed to have a better communication power/processing power ratio. And while some of the supercomputers need alot of processing power, some need that communication power more than anything else (as clever algorithms cannot always supplement delays in refreshing data)

now that's just wrong.. processing power is limited by the communications interface. if a processor can perform a certain computation in 1 ms, what good would it be if it took 1 s for it to reach the main node? another processor that's waiting on the results of the other computation would have to wait 2 seconds for the data to reach it.. now if the computations took an hour and would be independent of one another, the communications interface would be less important.. but if that were the case, the job wouldn't require a supercomputer and would run just as well on a cluster.
there are also physical limitations on how fast the communication interface can be and the number of cpus that can be placed in 1 main node.. which is why supercomputers are faster than clusters and why a lot of "supercomputers" are clusters of supercomputers.

the splat in the hat · Aug 31, 2005

mmm just think age of empires with the ai controlling the baddies in one core and the second core running game, man fast and deadly ai .....

1 of about a trillion things they will be able to do in a few months...

Future Shock · Aug 31, 2005

Originally posted by: Calin
And to answer your first question, the most easily optimised workloads for multiprocessor/multicore/etc are the kind of Seti@home (workloads that can be loaded in several seconds, unloaded in even less, and are processed for hours and hours).
One of the processes I feel to be hard to run in parallel would be nuclear explosions simulations - when you start to have thousands and thousands (or more) neutrons, moving in all directions thru a load of nuclear fuel, being able (or not) to create a fission in one of the atoms. Then, every neutron would need to interact with a part of the nuclear fuel that could have been affected (forced into fission) in the previous step by other neutron. This would account to a small workload that need to access huge data (the entire chunk of nuclear fuel), and if possible a version of the data that is current or at most a few "computing steps" old.
That was the difference between supercomputers of the past and supercomputers of the present - old ones seemed to have a better communication power/processing power ratio. And while some of the supercomputers need alot of processing power, some need that communication power more than anything else (as clever algorithms cannot always supplement delays in refreshing data)

As it turns out, some of the largest parallel processing machines are at Los Alamos simulating nuclear explosions and weapons design. So yes, it may be slightly inefficient, but it's the only way to get that much horsepower anymore. Before, they DID use non-parallel vector supercomputers (Crays XMps and older CDC Cyber 205s) for that type of work, but vector supercomputers are basically a dead breed - highly efficient for superpiplining complex math, but not good at much else...

Future Shock

Calin · Sep 1, 2005

Originally posted by: itachi

Originally posted by: Calin
One of the processes I feel to be hard to run in parallel would be nuclear explosions simulations - when you start to have thousands and thousands (or more) neutrons, moving in all directions thru a load of nuclear fuel, being able (or not) to create a fission in one of the atoms. Then, every neutron would need to interact with a part of the nuclear fuel that could have been affected (forced into fission) in the previous step by other neutron. This would account to a small workload that need to access huge data (the entire chunk of nuclear fuel), and if possible a version of the data that is current or at most a few "computing steps" old.

Click to expand...

mm.. i'd have to disagree. i don't think any single processor system would be capable of computing such simulations in a reasonable amount of time. what you described there would be a parallel job.. multiple computers would need to be running the simulation at the same time. communication between each process would be done through either a shared memory architecture, or MPI or some variant of it. multiprocessor and dual core systems have a shared memory architecture and would be more than capable of running the simulation (although.. for a single computer, having an extra cpu/core wouldn't change the fact that the simulation requires computations on millions, if not billions or trillions, of particles).

What I am trying to suggest is that those multiple computers are kept back by the communication time. Think at all the benchmarks for supercomputers, and how they are able to reach 10% to let's say 50% of the sum of the speed of a single one of its processors. This is just time lost on communication tasks.

That was the difference between supercomputers of the past and supercomputers of the present - old ones seemed to have a better communication power/processing power ratio. And while some of the supercomputers need alot of processing power, some need that communication power more than anything else (as clever algorithms cannot always supplement delays in refreshing data)

Click to expand...

now that's just wrong.. processing power is limited by the communications interface. if a processor can perform a certain computation in 1 ms, what good would it be if it took 1 s for it to reach the main node? another processor that's waiting on the results of the other computation would have to wait 2 seconds for the data to reach it.. now if the computations took an hour and would be independent of one another, the communications interface would be less important.. but if that were the case, the job wouldn't require a supercomputer and would run just as well on a cluster.
there are also physical limitations on how fast the communication interface can be and the number of cpus that can be placed in 1 main node.. which is why supercomputers are faster than clusters and why a lot of "supercomputers" are clusters of supercomputers.

I really don't have hard figures, but I think the supercooled supercomputers of the past (Cray) would have a better inter-process communication capability, especially considering they had a small number of processors than the current 10,000 processor supercomputers.
The "speed" of a supercomputer in supercomputer benchmarks depends on the speed and bandwidth of the inter-processor communication. It seems possible that more "processing power" is lost in communication tasks

Calin · Sep 1, 2005

Originally posted by: Future Shock

Originally posted by: Calin
And to answer your first question, the most easily optimised workloads for multiprocessor/multicore/etc are the kind of Seti@home (workloads that can be loaded in several seconds, unloaded in even less, and are processed for hours and hours).
One of the processes I feel to be hard to run in parallel would be nuclear explosions simulations - when you start to have thousands and thousands (or more) neutrons, moving in all directions thru a load of nuclear fuel, being able (or not) to create a fission in one of the atoms. Then, every neutron would need to interact with a part of the nuclear fuel that could have been affected (forced into fission) in the previous step by other neutron. This would account to a small workload that need to access huge data (the entire chunk of nuclear fuel), and if possible a version of the data that is current or at most a few "computing steps" old.
That was the difference between supercomputers of the past and supercomputers of the present - old ones seemed to have a better communication power/processing power ratio. And while some of the supercomputers need alot of processing power, some need that communication power more than anything else (as clever algorithms cannot always supplement delays in refreshing data)

Click to expand...

As it turns out, some of the largest parallel processing machines are at Los Alamos simulating nuclear explosions and weapons design. So yes, it may be slightly inefficient, but it's the only way to get that much horsepower anymore. Before, they DID use non-parallel vector supercomputers (Crays XMps and older CDC Cyber 205s) for that type of work, but vector supercomputers are basically a dead breed - highly efficient for superpiplining complex math, but not good at much else...

Future Shock

It's all the effect of huge performance increase and significant price decrease in "desktop-level" (or workstation level) processors compared to what dedicated supercomputer companies were able to design. So, aggregating together 10,000 "desktop" processors (even if Itanium is by no way a desktop computer, its price and performance is closer to the "desktop"range than the "supercomputer"range) is cheaper, uses less electricity (no cryogenic cooling), and has much more theoretical MIPS and MIPS in the application it needs to run.
On the other hand, optimizing workload for massively multiprocessor supercomputers is more difficult than for vector supercomputers, but decrease in run times are spectacular

Calin · Sep 1, 2005

I just checked, the first three supercomputers (www.top500.org) use IBM PowerPC 440 processors @700MHz (65,536 and 40,000 of them for 1st and 2nd), and 10,000 Itanium2 processors (the third). Just the fourth uses some NEC 1000MHz processors (5,000 of them), being the first of a less common processor.
Interesting, the #10 is a Cray XT3 running Opteron 2000MHz, one more example of classical supercomputer builders going the route of "desktop" processors.

Titan · Sep 1, 2005

I'll write this reply from the PC world perspective. IBM has been doing multi-cores and multi-way supercomputing for years, as is noted above. Scientific things can be broken up nicely because there is so much information to crunch, but when it comes to personal uses, it's hard to make use of many cores to help a single task. We will get there, but probably very slowly.

Writing multithreaded apps is in general, much harder than to write a single-threaded one if you want to take full advantage of all processors. It's easy to write a windows app (say MS Word) that runs a spell-checking thread in the background in a seperate thread, but from the outset, Word doesn't use a lot of CPU while it is waiting for you to type. The key to having multiple CPUs/cores is taking full advantage of them. There are a number of obstacles that must be overcome to thread a program easily.

The first thing you must realize is that a modern CPU is designed to run one program. Now if you have 2 CPUs running one program, they aren't designed to help each other at a low, hardware level so there is a lot of system task overhead like context switching that must be managed by the OS.

Another way of saying things is that having one CPU do one thing is easy, and having many CPUs each do one of many things is also easy. Having one CPU do many things is what we do today in a multitasking windows OS and it works pretty well, mainly because most things don't hog the CPU. Now as we move to multicore, things are turned on their head as we want one thing to be done by many CPUs and there are some basic computer science reasons why this is very hard, if not impossible in some cases.

A practical problem is that there is no universal standard for writing multithreaded code. If you write a C++ app in windows, you make it multithreaded in Windows by calling code from a Windows library to create a thread. Now if you wanted to port that app to Linux, you have to use different libraries for your threads (pthreads). This is not fun work to do. If we do end up with scaled multi-core CPUs (4, 8, 16 cores) it would be extremely useful if the x86 instruction set added hardware level instructions for creating and managing threads in assembly so there would be no need to ratify a software standard for use by programmers. This would require pushing some tasks that windows does (like task scheduling) to the hardware level. Right now, if you write a windows app, it's very hard to find out how many CPUs you have at run-time and create that many threads. That's some nasty low-level system code to conjur up.

When you start programming a multithreaded app, the big problem you have to worry about is serialization of data. If you have 2 threads that need to write to one piece of data in memory, they have to wait for each other, and which one gets to goes first? This is done by blocking one thread and have it wait with a lock. In your example of a spell checker that splits the number of pages in 2, the efficiency of the code is determined by how your data is structured, and how each CPU has access to that data. If all the words were stored in memory, there are a number of different algorithms you could write to split the work up into 2 threads, but they would have to know where to stop and only access their own data. On some systems, you may be using double the ram depending on how memory sharing is implemented. If, on the other hand, the spell checker was reading the words from a hard disk, it would have to do a lot of I/O reads to check each word, and would probably buffer some of it. In that case, if you have 2 threads doing I/O reads from different places, performance will slow down as your single disk is now doing random access instead of sequential access, and then you have to manage multiple buffers which could overlap or become outdated.

Simply put, multithreading makes things complicated. An analogy is imagine your buddy came along to help you in everyday tasks, and you want to get the most things done. Well, for some things, you don't need your buddy (like say, performing neccessary bodily functions). Other things you have to tell your buddy so much information so that he can help you, it's not worth telling him and just doing things yourself. Otherwise, your buddy can be helpful, but having him doing the same thing you are doing often presents many problems. If you think about it, it comes down to how clever you and your buddy are at getting work done and helping each other out in optimal ways.

There's a lot of CS you can get into to answer your question. Look up semaphores and race conditions.

In case you or anyone else is curious, games tend to be higly single-threaded, and very hard to multithread. The reason is that every game is just a big (single) loop that repeats and performs a sequence of steps. And since (most) games are not locked to a particular frame-rate, a game is constantly repeating tasks, never giving the CPU a break. A typical game loop for a FPS might look like this:

while(1)
{
// handle event handlers to get user input from mouse/kbd
// update game world with new user input
// call AI functions to determine new enemy positions (can be CPU intensive)
// Update world with AI positions
// reposition moving objects in the world
// check world for colissions (CPU intensive, 3D space is hard to search)
// resolve collisions, redetermine positions
// draw graphics for this frame (CPU / GPU intensive)
}

Note that each CPU intensive thing depends on the thing that came before it.

Since we are constantly repeating this loop, and there is only one game world that is constantly changing, multithreading is very hard. I'm not saying it can't be done, but you need to be a skilled programmer, well above most average coders of today. The problem is keeping everything in sync and allowing multiple threads to access the same game world without stepping on each other's toes. While thread 1 may be the main loop that has already resolved collisions, thread 2 making AI decisions is using information from the old game state, what if one thread outraces the other so the AI falls behind by minutes? It's tricky, it can be done, but in a complicated way. There are things that can be done in seperate threads, but often the question is "why?" since they require very little cpu to begin with, like networking and I/O. In order to really take full advantage, you need to split up the CPU-intensive tasks of the game loop, and keeping them in sync is tricky to say the least.

By the time we have 16-way cores, we may have some games that are good at using 2-4 of your cores. Hardware develops fast, software develops slowly.

I don't mean to rain on anybody's parade, but I'm hoping I can offer some insight.

-Titan

redhatlinux · Sep 1, 2005

Back in the 'old days', muli-cpu' designs got upto 1.7 the power of a single cpu. I don't have any reason to believe that this is not attainable. Do the boys in 'black' use massively parallel processors to get the 'job' done. I know for sure, can't post as to how I know.
The earlier post about the OS not using up much processor time, doesn't ring true. Maybe Linux doesn't but Winblows certainly does. Writing large applications for multi-processors for desk top use is not very profitable unless you write games. CPU power is generally NOT what the average Joe needs. More memory, faster hdds etc make much more difference.

Inter-processor communications, is handled much better by AMD. Their CPUs were designed with multi-core. Intel has stitched two core together, still uses the Bus for inter-core chat.

Consider this, Linux running in one core, supportingl the 'strange stuff' that it can. Windows in the second core with a 'tricked out bios' maybe need a couple of other 'tricks' too. The tricked BIOS makes these 'odd devices' look like something Windows supports. Fakes out Win and 'Poof', Win will now support the electric Toaster, Microwave, Automatic Dog Washer of the future. A waste of a core, maybe.

her34 · Sep 1, 2005

Originally posted by: tkotitan2
In case you or anyone else is curious, games tend to be higly single-threaded, and very hard to multithread. The reason is that every game is just a big (single) loop that repeats and performs a sequence of steps. And since (most) games are not locked to a particular frame-rate, a game is constantly repeating tasks, never giving the CPU a break. A typical game loop for a FPS might look like this:

while(1)
{
// handle event handlers to get user input from mouse/kbd
// update game world with new user input
// call AI functions to determine new enemy positions (can be CPU intensive)
// Update world with AI positions
// reposition moving objects in the world
// check world for colissions (CPU intensive, 3D space is hard to search)
// resolve collisions, redetermine positions
// draw graphics for this frame (CPU / GPU intensive)
}

Note that each CPU intensive thing depends on the thing that came before it.

Since we are constantly repeating this loop, and there is only one game world that is constantly changing, multithreading is very hard. I'm not saying it can't be done, but you need to be a skilled programmer, well above most average coders of today. The problem is keeping everything in sync and allowing multiple threads to access the same game world without stepping on each other's toes. While thread 1 may be the main loop that has already resolved collisions, thread 2 making AI decisions is using information from the old game state, what if one thread outraces the other so the AI falls behind by minutes? It's tricky, it can be done, but in a complicated way. There are things that can be done in seperate threads, but often the question is "why?" since they require very little cpu to begin with, like networking and I/O. In order to really take full advantage, you need to split up the CPU-intensive tasks of the game loop, and keeping them in sync is tricky to say the least.
-Titan

instead of trying to parallelize the loop, can't game makers just paralleize each step of the loop?

so in your example there are 8 steps. 3 are cpu intensive. instead of trying to give one thread for ai, another thread for collisions, and trying to synchronize the threads, make it so that when loop hits the collisions step all the cores work on it, then loop moves on, then when loops hits ai step, all the cores work on that, etc.

Velk · Sep 1, 2005

Edit : Post messed up, see below.

Calin · Sep 2, 2005

Originally posted by: her34

Originally posted by: tkotitan2
In case you or anyone else is curious, games tend to be higly single-threaded, and very hard to multithread. The reason is that every game is just a big (single) loop that repeats and performs a sequence of steps. And since (most) games are not locked to a particular frame-rate, a game is constantly repeating tasks, never giving the CPU a break. A typical game loop for a FPS might look like this:

while(1)
{
// handle event handlers to get user input from mouse/kbd
// update game world with new user input
// call AI functions to determine new enemy positions (can be CPU intensive)
// Update world with AI positions
// reposition moving objects in the world
// check world for colissions (CPU intensive, 3D space is hard to search)
// resolve collisions, redetermine positions
// draw graphics for this frame (CPU / GPU intensive)
}

Note that each CPU intensive thing depends on the thing that came before it.

Since we are constantly repeating this loop, and there is only one game world that is constantly changing, multithreading is very hard. I'm not saying it can't be done, but you need to be a skilled programmer, well above most average coders of today. The problem is keeping everything in sync and allowing multiple threads to access the same game world without stepping on each other's toes. While thread 1 may be the main loop that has already resolved collisions, thread 2 making AI decisions is using information from the old game state, what if one thread outraces the other so the AI falls behind by minutes? It's tricky, it can be done, but in a complicated way. There are things that can be done in seperate threads, but often the question is "why?" since they require very little cpu to begin with, like networking and I/O. In order to really take full advantage, you need to split up the CPU-intensive tasks of the game loop, and keeping them in sync is tricky to say the least.
-Titan

Click to expand...

instead of trying to parallelize the loop, can't game makers just paralleize each step of the loop?

so in your example there are 8 steps. 3 are cpu intensive. instead of trying to give one thread for ai, another thread for collisions, and trying to synchronize the threads, make it so that when loop hits the collisions step all the cores work on it, then loop moves on, then when loops hits ai step, all the cores work on that, etc.

Let's say you have a square picture, a big one. You want to find a line that is a quarter of the side long. You can make 4 smaller squares, and search in each one. If you're lucky, you've found it. If not, "move" the squares just half their side to right and down, and fill the "empty" space they cover with the data from left/top. Search in those small squares also, and you will find the line. This is a data load that can be shared to 8 cores.
Can it be shared to 32 cores? I really don't think so (the picture you have is a bitmap, and you can access any one of the individual pixels).
Let's assume your picture is 10 thousand by 10 thousand pixels. This would be 100 millions dots at 32 bits per pixel, or 400 MB. Your computer can keep it in memory. If you go dual core now, you would need either to share the data to all the cores (halving the bandwidth, or the speed with which the processors are able to retrieve data) or double the memory footprint (the data is read-only). You might be able with the memory you have in the system to do this, but if you go quad-core, you're out of luck.

Let's say you want to find a perfect green line in the picture. Everything works just fine.
Let's say you want to find all the perfect green lines in the pictures. Just the same, everything work fine, even if the lines intersect themselves.
Now let's assume you want to color all the perfect green lines in red. If one processor finds one line and colors it, all the other lines that were intersected with it suddendly don't have the correct length (they are cut in the middle by a red spot)

I know my requests are a little strange, but this is a situation where you can't easily share a workload to more cores than a certain number.

CTho9305 · Sep 2, 2005

Originally posted by: tkotitan2
The first thing you must realize is that a modern CPU is designed to run one program. Now if you have 2 CPUs running one program, they aren't designed to help each other at a low, hardware level so there is a lot of system task overhead like context switching that must be managed by the OS.

Ideally having multiple CPUs reduces context switching, since you can have each time slice be twice as long.

A practical problem is that there is no universal standard for writing multithreaded code. If you write a C++ app in windows, you make it multithreaded in Windows by calling code from a Windows library to create a thread. Now if you wanted to port that app to Linux, you have to use different libraries for your threads (pthreads). This is not fun work to do.

I wrote a multithreaded raytracer on Windows, and only had to add/modify a few lines of code to make it use either Windows' thread API or pthreads depending on the platform. I'd post the code so you could see, but it's on a computer with no net connection right now (Comcast sucks). It wasn't hard to port. Besides, IIRC there are pthreads implementations for Windows if you really don't want to do any work yourself.

If we do end up with scaled multi-core CPUs (4, 8, 16 cores) it would be extremely useful if the x86 instruction set added hardware level instructions for creating and managing threads in assembly so there would be no need to ratify a software standard for use by programmers. This would require pushing some tasks that windows does (like task scheduling) to the hardware level.

That's a pretty huge deal, and opens truckloads of cans of worms.

Right now, if you write a windows app, it's very hard to find out how many CPUs you have at run-time and create that many threads. That's some nasty low-level system code to conjur up.

No it isn't. There's sample code on MSDN, and you just create a variable (one line), call a function (another line), and read the dwNumberOfProcessors field from the variable (one final line). It doesn't get much easier than that.

When you start programming a multithreaded app, the big problem you have to worry about is serialization of data. If you have 2 threads that need to write to one piece of data in memory, they have to wait for each other, and which one gets to goes first? This is done by blocking one thread and have it wait with a lock. In your example of a spell checker that splits the number of pages in 2, the efficiency of the code is determined by how your data is structured, and how each CPU has access to that data.

Locking is very important. If you don't use enough locks, you can corrupt your data or crash the program. These problems are often hard to reproduce, and as a result, hard to debug. If you lock too agressively, you can take major performance hits. Locking/serialization is probably one of the central problems related to multithreaded programming.

On some systems, you may be using double the ram depending on how memory sharing is implemented.

All systems of interest are shared-memory multiprocessors.

Originally posted by: redhatlinux
Back in the 'old days', muli-cpu' designs got upto 1.7 the power of a single cpu. I don't have any reason to believe that this is not attainable. Do the boys in 'black' use massively parallel processors to get the 'job' done. I know for sure, can't post as to how I know.

Everybody knows that the government uses parallel, because it's public information. Have you heard of the ASCI machines? All modern supercomputers are massively parallel multiprocessors. A linear speedup is attainable on some tasks, and there is no speedup available on other tasks.

The earlier post about the OS not using up much processor time, doesn't ring true. Maybe Linux doesn't but Winblows certainly does.

I suspect you don't know what you're talking about. The simplest way to look would be to watch task manager in Windows or "top" in linux to see where the CPU time goes. For most tasks, the OS just sits there doing nothing. The OS only does something when:
1. an app makes a system call (requires a context switch)
2. an interrupt occurs (e.g. you hit a key on the keyboard, move the mouse, a network packet arrives, etc)
3. a timeslice expires (the current thread has used up its available time, and the OS picks another thread to run for a while - note that I believe this is actually implemented as a timer interrupt).

As long as you're doing actual computation (rather than, say, doing a lot of communication over a network), the OS will do nothing for your timeslice (say, 10ms), then a context switch (a handful of microseconds) and then nothing again until the next timeslice expires.

I did not reply to statements I couldn't understand.

tinyabs · Sep 2, 2005

Dynamic execution in Pentium family had already achieved instruction parallelism since 1995. Many applications never take advantage of that bcos of possible bugs or bcos the programmer didn't know.

Actually, the hurdle to processor parallelism is simply programming language. Multithreading is not natural to programming languages. As far as I know, only Ada supports multithreading in language. There's only a few place where you need to spawn threads, usually core part of the application.

Even if a programmer can create thread easily in code, he still need to check the result of the threads and whether it has ended; it's not calling function and get return code anymore. So language has to support this generic to make things easier. The complexity increase exponentially as more threads are written.

CTho9305 · Sep 2, 2005

Dynamic execution in Pentium family had already achieved instruction parallelism since 1995. Many applications never take advantage of that bcos of possible bugs or bcos the programmer didn't know.

Huh? A major point of out-of-order execution is that you don't have to order the instructions optimally - the CPU will execute them in the best order it can... you shouldn't really have to be doing anything to "take advantage" of it.

Velk · Sep 4, 2005

Hmm, my original reply seems to have vanished into the void. Oh well.

Originally posted by: her34
1) i read people say that some programs are more inherently parallel than others. what makes some programs more easily optimized for multi-core and others not? can you give me an example of a desktop program that can't easily be made to utilize multi-cores?

i tend to think of things like word, where if you run spell check, you can just divide the pages by number of cores available and run task. when can you not just divide the work into equal pieces?

A spellchecker is actually one of the things that is quite poorly suited to being converted to multicore. There are a few reasons why, I'll run through some major ones :

1) User input. It's the bane of any multi-threaded process, and it is no different here. When you run your spellcheck and split it off into the different threads, the first time one of the work threads encounters a spelling error, it needs to pause itself and all the other work threads, signal the UI thread that there is an error and what it is, the UI thread then has to prompt the user for what to do about the spelling error - ( ignore, add, replace with... etc ). You can, to some extent, get around this by having the threads build some sort of list of suspect words, and when they are all finished then present the user with a list to work through, but this has the downside of having to perform the entire spellcheck before starting on the actual work, plus having a possibly large amount of additional memory to store the list.

2) Multiple threads working on the same data. Regardless of how your document is organised, your threads need some way of keeping track where they are at. The problem with this is that the threads are actually modifying the document as they go, changing the length, number of words etc. If thread 1 changes 'its' to 'it is' the other threads are suddenly editing a different part of the document. To avoid this you need to be able to split the document up into multiple copies which they can work on without corrupting each other, which both adds to the memory requirements and requires an additional step to recombine them when they are finished. The other weak point is the dictionary that they are comparing against. If thread 1 adds 'multithreadedness' to the dictionary the other threads need to know that the dictionary has changed, and restart checking ( otherwise you run this risk of having the user prompted multiple times for the same word or having the other threads skip a word )

3) No consistency. Unless you have some fairly reliable evaluation mechanism, you are going to end up with situations like using 4 threads to spellcheck evenly portions of a document which contains the first 20 pages of dense text and then 800 pages of technical diagrams. In this particular case, you could probably evaluate it by word count, but it's one example of how the load balancing mechanism for threads is not always as obvious as it might seem.

4) Sequential ordering. For the user to be able to get the context of the misspelt word, the spellchecker generally provides the text around it, usually by scrolling the document down as it goes. With several threads performing this function, the user view of the document will scroll erratically backwards and forwards. Not a problem from a technical perspective, but fairly bewildering for the person actually using it.

5) Simply not necessary. This ties in some with the first point, but spellcheckers have switched from bulk check of large documents to marking errors as the document is entered. Even a slow single processor can spellcheck thousands of times faster than the user can actually type new words in, so for that kind of spellchecking, multithreading is entirely redundant.

2) will optimizing for multi-cores hurt performance on cpu's with less cores? so for example if we reach the point where 4 cores are getting popular. app/game is programmed to divide out work evenly for 4 cores. if you now run the app on a 2 core or single core cpu, i would imagine the app to run really horrible because timing would be off. performance on 2 core would be worse than if the program was written with 2 cores in mind. so programmer will have to make a choice of how many cores to optimize for (optimize for 2 cores and 4 cores gain no performance, or optimize for 4 cores and alienate 2 core cpu's). and this gets worse when 6 or 8 cores come out.

Absolutely yes.

Firstly, there is the obvious proviso that programs that take full advantage of a multi-core system will run like a dog on a single core, for the same reason that a program that takes full advantage of a FX-77 will run like a dog on an athlon 1200.

Secondly, context switching multiple threads on a smaller number of CPUs than expected consumes processor time, and possibly memory, that isn't needed. Optimising for two cores or four cores is somewhat different to dynamically and efficiently allocating threads based on current capacity, which requires another level of sophistication again.

Calin · Sep 5, 2005

Originally posted by: CTho9305

Dynamic execution in Pentium family had already achieved instruction parallelism since 1995. Many applications never take advantage of that bcos of possible bugs or bcos the programmer didn't know.

Click to expand...

Huh? A major point of out-of-order execution is that you don't have to order the instructions optimally - the CPU will execute them in the best order it can... you shouldn't really have to be doing anything to "take advantage" of it.

That "instruction parallelism" - in fact ordering the instructions in the best possible way for a processor to work on them - is a compiler-related task - see ASM instructions of the type "jump and execute", when AFTER the jump instruction the processor has one or more instruction it has to execute. This helps keeping the pipeline filled (pipeline would be emptied after a "jump" instruction, as the microprocessors weren't complex enough to know beforehand about the jump instructions. Now this isn't such a big problem, as most of the jumps are correctly identified and the pipeline isn't emptied.
An "out of order" architecture makes that task much easier for the compiler, and this "instruction order optimisation phase" can be eliminated from the compiler with small/very small speed penalties.
For an "in-order" architecture, instruction order optimisation can bring easily speed improvements in the order of tens of percents.

youph · Sep 13, 2005

Originally posted by: her34
1) i read people say that some programs are more inherently parallel than others. what makes some programs more easily optimized for multi-core and others not? can you give me an example of a desktop program that can't easily be made to utilize multi-cores?

i tend to think of things like word, where if you run spell check, you can just divide the pages by number of cores available and run task. when can you not just divide the work into equal pieces?

2) will optimizing for multi-cores hurt performance on cpu's with less cores? so for example if we reach the point where 4 cores are getting popular. app/game is programmed to divide out work evenly for 4 cores. if you now run the app on a 2 core or single core cpu, i would imagine the app to run really horrible because timing would be off. performance on 2 core would be worse than if the program was written with 2 cores in mind. so programmer will have to make a choice of how many cores to optimize for (optimize for 2 cores and 4 cores gain no performance, or optimize for 4 cores and alienate 2 core cpu's). and this gets worse when 6 or 8 cores come out.

this would be a headache compared to past where better cpu meant better performance without any work.

Algorithims have many different formulations.

There are parts that are serial in nature (meaning things need to happen in order) and parts that lend themselves to parallel programming.

Programmers need to find ways to vectorize their algorithims, but, as this example will show, there are some problems that, by nature, are non-vectorizable.

For example,

Say the program is modeling the motion of a particle in space. This problem is very serial in the sense that to find the location of the particle at time 1 requires that you know the location of the particle at time 0, hence you cannot find the location of the particle at time 0 and 1 at the same time.

This program would not lend itself well to multiple cored cpus. (yes im sure there are fancy optimizations available but the genral algorithim is not parallalizable.)

I think most computing tasks can be broken down further into atomic units of processing like the spell check example. Its just a matter of programmers thinking in a different way.

programming and multi-core cpu's

Senior member

Diamond Member

Member

Senior member

Elite Member

Golden Member

Diamond Member

Diamond Member

Senior member

Banned

Senior member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Senior member

Senior member

Senior member

Diamond Member

Elite Member

Member

Elite Member

Senior member

Diamond Member

Member