I think I know what you are talking about with this term "staying parallel". You mean threaded instructions happening simultaneous to another set of threaded instructions right?
If there aren't enough instructions happening simultaneously then benefits of additional cores goes wasted. If one set of threaded instructions is no longer used by the game another threaded set needs to take its place in order to preserve the maximum potential of the processor.
Let's say you have a game with 4 primary steps:
The first step has 8 things happening, and we'll divide by half for each next step.
xxxxxxxx
xxxx
xx
x
With a quad core processor, that first step can be made 4x as fast, the 2nd 4x as fast, but no scaling for the 3rd over a dual core, and nothing for the 4th over a single core:
xx xx xx xx
x x x x
x x
x
This will eventually happen for any app, it just depends on how quickly it happens.
Even for a simple example like adding a bunch of numbers together:
If I have the numbers 1 through 10 (and thus a limited data set), I would add them together like this on a single core processor:
1 + 2 = 3
3 + 3 = 6
6 + 4 = 10
10 + 5 = 15
15 + 6 = 21
21 + 7 = 28
28 + 8 = 36
36 + 9 = 45
45 + 10 = 55
Of course there's little tricks to do this faster, but ignoring them, this is how a single core processor would do it. (one trick: 1+10 = 11, 2 + 9 = 11 and so on, you have 5 sets of 11, so just multiply 5 * 11, if you can program something to recognize this)
On a multi-core processor, you could do:
1 2 3 4 5 6 7 8 9 10
1+2 3+4 5+6 7+8 9+10
3+7 11+15 19
10+26 19
36+9
55
Not only does going multicore sometimes mean not using tricks, it can mean using a less efficient algorithm in general. Not in the two examples above, both took 9 operations, however the single core took 9 steps, while the multicore only took 5. So 5x the processing power gave us less than a 100% speed up, and this is how most algorithms are, before scaling difficulties from the lack of instantaneous communication come into play. Some algorithms are even recursive by nature, and can't be parallelized at all (assuming working on the same data).
Some algorithms, in order to do them in parallel, require a less efficient implementation entirely.
A contrived example would be multiplication or division. What I just did was 11*5, but if addition only takes 1 cycle, while mutiplication takes...6, you achieved a speed up by changing the code to parallel additions, but just barely.
Edit:
When dual core processors first game out, you needed at least a 50% advantage in clock speed to make a single over a dual worthwhile. Now you need 100%. The same will eventually hold true for quad over dual. Games will become designed around the limitations and strengths of multi core processors, even if there's a 'more efficient' way to do things on less cores, the preference will be to program to run best with more hardware, not less.
The only thing that will hold this back is communication bandwidth. The Core 2 Quad design wouldn't have scaled very well going to more cores, but the phenom and i7 should scale pretty well. It's feasible that someday we could be at the level of parallelism shown above, with each thread's granularity small enough to do individual additions or multiplications.