Memory is cheap. CPU time is not. So short answer, I prefer wasting memory to free up the CPU :evil: See below for example.
When I write multi-core stuff I double buffer object state so that any CPU can access the read only current state from the public state while the object itself is running on another core updating it's own internal private state, then I sync and swap.
Call it 'serial multithreading' or 'scatter gather' multi theading because you control the program from a single thread, but scatter the work to be done in multiple threads, and gather the results back and synchronize. The main loop runs on one CPU then all of the sudden there is an explosion of threadable work tasks to keep all available cores busy.
For example say you have 1000 objects:
for(i=0;i<1000;++i)
QueueJobTask(objects[ I], DoPhysics); // #1
WaitForMultipleObjects(objects, 1000); // #2
for(i=0;i<1000;++i)
objects[ I].SwapStates(); // #3
#1) this loops runs extremely fast and does nothing more than perform a locked increment and add a job to the job queue and return. In fact by the time you are adding the second or third job, a worker thread has already woken up and started processing the first job on another CPU core.
#2) this waits for all jobs in the queue to be completed before moving on (sync) and also puts this control loop to sleep so that the CPU that issues this call to start jobs can join in on the job queue, and the main loop won't be awaken until the condition of all the work jobs finish. All 2, 4, or 8 cores are working on your queued jobs and nothing else.
#3) after all jobs are finished and no objects are updating, swap public/private states so the next iteration of the main loop can observe the updated 'current' state that was just updated this frame
This uses a technique known as thread pooling where you aren't creating and destroying threads (extremely expensive) but create 1 worker thread per core at program start, and put them in a loop that goes to sleep until a job is added to the queue. A 'job' is just a function pointer that the worker thread loop calls, returns, retires the job, grabs the next job, etc. Your main program is still serial in nature (which games are, no avoiding the fact that a game loop has to occur in order every time and each step depends on the previous step), but exploits parallelism at each step requiring extensive work. For example, you can't run AI until object positions and environmental factors have been decided (ie: physics). So you can't run AI and physics together, but you can thread the physics by creating threaded jobs to resolve the 100 collision pairs you just discovered... You can picture it like a Protoss carrier in Starcraft performing a 'scatter/gather' maneuver with it's drones as cores are put to work at each step.
The double buffer of data means that no locking is needed whatsoever, other than the locking needed for interfacing with the job queue when adding or removing jobs. So this means if object 103 needs to read the state of object 456 (a Cat AI tracking a Bird wanting to eat it) the state of all objects are guaranteed consistent and unchanging for the duration of the frame. So you don't stall your other cores with entering and leaving with critical sections, and at the same time you can read x from another object that is also currently updating itself (only it's internal state), and know that y and z haven't been updated (end up with old x, and new y and z). It also guarantees that two Cats won't see the same Bird at different locations (new and old) as the Bird thread runs in parallel and moves the Bird around. This is very similar to having x, oldx, y, oldy, etc. Double buffer everything that another thread might need to access while you are updating it.
So you don't waste time and resources with things like critical sections and locks because the reads to other objects are read only. If you need to modify an object (ie: bullet collision with player avatar subtracts 50 life) you send it a message and that message is processed on the next update, internally. So adding and removing jobs from the queue are the only thing that need to be in a critical section, and since it's simple incrementing and decrementing a job count and adding/removing a job from a list, it locks and unlocks very fast. Basically I could run something like this on a 4 core CPU and see all 4 cores at 100% with no usual pitfalls associated with multi threading. This also means seeing double performance when going form 4 cores to 8 cores, etc.
So back to topic, the double buffering of all object states consumes much more memory (remember even things like meshes need to be doubled, so that physics can access the 'current' mesh for interference detection but the AI will be animating the 'soon to be current' copy), but memory is cheap, and this method allows multiple cores and thousands of objects to scale extremely well, as they were meant to. Double buffering states is the only method I am aware of that completely decouples threads from the shared data they need to access and update simultaneously and allows uninhibited all out thread crunching without the need for expensive locks and exclusion methods on shared data.