A primary core, is there?

Drsignguy · Apr 20, 2008

It is a question that I have been contemplating for awhile now and some good information and incite would be greatly appriciated..... Is there a primary core in duels and quads? Most programs are single threaded and don't utilize a quad core to its fullest potential just yet, with the exception of some video encoding programs etc.. Is there a specific core that the processor utilizes primarily over the second in a duel, or other 3 in the Quad?

Ricemarine · Apr 20, 2008

(someone correct me) But with single-threaded apps, the processor will use the core that has the Affinity of 0, since that should be the primary core in multicore processors.

Idontcare · Apr 20, 2008

Originally posted by: Ricemarine
(someone correct me) But with single-threaded apps, the processor will use the core that has the Affinity of 0, since that should be the primary core in multicore processors.

That is incorrect, at least for windows and linux based OS's.

Linux and windows based OS's actively switch (or hop as it is called sometimes) the thread from core to core many many times per second.

Open a task manager while running a single-threaded app and you'll see all the cores are nearly equally loaded. If you have a quadcore then each core will be 25% loaded. If you have a dualcore then each core will be 50% loaded.

This actually hurts performance of the single-threaded app as the thread migrates from core to core but the cache data it was accessing experiences a lag as the data must go thru the FSB (or thru the HT) to get to the core which now has the thread (albeit briefly). This is sometimes called "cache thrashing".

Folks who are sensitive to the performance of their single-threaded apps will usually assign a processor affinity to the application (task manager, etc) to ensure thread hopping isn't degrading their single-threaded performance.

The best I have been able to determine is that the OS's are intentionally designed to not have a primary core (i.e. they are designed to thread hop and cache thrash).

Drsignguy · Apr 20, 2008

Ok thanks for that incite, nicely put. So, as for single-threaded apps, it hurts the performance as it "hops" from core to core as there is no "set" or primary core. How about multi-threaded apps? How does the processor split each thread through each core? Does it isolate, (Encoding as an example), Audio, video,etc..

firewolfsm · Apr 20, 2008

You can easily see thread hopping on a quad core, if I run three work threads on Prime then I get mixed CPU usage on all 4 cores.

myocardia · Apr 21, 2008

Originally posted by: Drsignguy
So, as for single-threaded apps, it hurts the performance as it "hops" from core to core as there is no "set" or primary core.

Yes, but it's fairly minimal. I believe the numbers I've seen from credible sources were between 8 & 15%, depending on the application, with ~10% on average. That's why the majority of us with quads don't bother with trying to manually set affinity.

How about multi-threaded apps? How does the processor split each thread through each core? Does it isolate, (Encoding as an example), Audio, video,etc..

That depends on the app. Games that are multi-threaded do it the way you describe, but apps that do audio/video encoding, video conversion, or photo editing merely break the total amount of work done into two parts for dual-cores, or four parts for quad-cores, and each core does it's portion.

v8envy · Apr 21, 2008

The hops aren't so bad on Linux -- you get a pretty significant chunk of a second per core. You can see the 0-100% plateaus on alternating cores quite easily. Now desktop windows -- yeah, looks to switch many times a second.

Ratman6161 · Apr 21, 2008

Originally posted by: v8envy
The hops aren't so bad on Linux -- you get a pretty significant chunk of a second per core. You can see the 0-100% plateaus on alternating cores quite easily. Now desktop windows -- yeah, looks to switch many times a second.

Hmm. I've noticed the hopping behavior, but each core does seem to get a significant amount of time - enough to be noticible anyway. this is on Vista Enterprise x64.

On the setting processor affinity issue, I know you can right click the process in task manager and set the affinity. But that seems to only affect the current run, I.e. next time you start that program, you will have to set the affinity again if you want to. so the quesion is, is there a way to set it such that it uses the assigned core every time the program runs?

Nathelion · Apr 22, 2008

How the multiple cores are managed depends a lot on the OS; in most modern OSs the kernel handles resource allocation transparently, and the apps themselves don't have a whole lot of say in when and where they get to execute. IIRC XP uses a modified Round Robin scheme, so our hypothetical single-threaded app will switch processors between every time quantum assuming there's nothing else going on in the background. There is a significant performance gain to setting core affinity manually.

Of course, once you get to multi-socket NUMA systems things get a bit more complicated...

Lonyo · Apr 22, 2008

Originally posted by: myocardia

Originally posted by: Drsignguy
So, as for single-threaded apps, it hurts the performance as it "hops" from core to core as there is no "set" or primary core.

Click to expand...

Yes, but it's fairly minimal. I believe the numbers I've seen from credible sources were between 8 & 15%, depending on the application, with ~10% on average. That's why the majority of us with quads don't bother with trying to manually set affinity.

How about multi-threaded apps? How does the processor split each thread through each core? Does it isolate, (Encoding as an example), Audio, video,etc..

Click to expand...

That depends on the app. Games that are multi-threaded do it the way you describe, but apps that do audio/video encoding, video conversion, or photo editing merely break the total amount of work done into two parts for dual-cores, or four parts for quad-cores, and each core does it's portion.

A 10% perf drop isn't major? 1~2% isn't major, 5% is "not huge", 10% is pretty significant.

Soundmanred · Apr 22, 2008

I have yet to see a duel core.
Or a core duel.

myocardia · Apr 22, 2008

Originally posted by: Lonyo
A 10% perf drop isn't major?

It isn't to me. See, when your processor & system are already more than fast enough, what's 8 or 10% (or even 14 or 15%)? Now, if you have a processor that's slow already, that would be a different story.

Idontcare · Apr 22, 2008

Originally posted by: myocardia

Originally posted by: Lonyo
A 10% perf drop isn't major?

Click to expand...

It isn't to me. See, when your processor & system are already more than fast enough for your own applications and needs, what's 8 or 10% (or even 14 or 15%)? Now, if you have a processor that's slow already, that would be a different story.

I agree provided the statement includes the words I added in bold/underline.

Personally a vaporphase cooled 4GHz quadcore rig was too slow running 4 simultaneous instances of a single-threaded program of relevance to me. Locking threads to cores via setting affinity gave me back another 10%, which I sorely needed.

I suspect most F@H'ers, or really anyone who number crunches with single-threaded apps while multitasking multiple instances, would probably like 10% more performance.

But if you are mostly gaming then the GPU situation is where you are likely bottlenecked so who cares (too much) about core hopping in that case.

BrownTown · Apr 22, 2008

Originally posted by: myocardia

Originally posted by: Lonyo
A 10% perf drop isn't major?

Click to expand...

It isn't to me. See, when your processor & system are already more than fast enough, what's 8 or 10% (or even 14 or 15%)? Now, if you have a processor that's slow already, that would be a different story.

WOW, thats amazing to me that someone who would buy a quadcore would also not worry about 10% performance. I mean if thats the case why are you shelling out so much money for the quad to begin with, its performace isn't gonna be 10% better than a dual core in 99% of applications.

myocardia · Apr 22, 2008

Originally posted by: Idontcare

Originally posted by: myocardia

Originally posted by: Lonyo
A 10% perf drop isn't major?

Click to expand...

It isn't to me. See, when your processor & system are already more than fast enough for your own applications and needs, what's 8 or 10% (or even 14 or 15%)? Now, if you have a processor that's slow already, that would be a different story.

Click to expand...

I agree provided the statement includes the words I added in bold/underline.

Haha, as do I, hence the reason I used those words.

But if you are mostly gaming then the GPU situation is where you are likely bottlenecked so who cares (too much) about core hopping in that case.

You obviously don't own M$'s FSX. It's a CPU-bound monster, and can use >4 cores.

Originally posted by: BrownTown
I mean if thats the case why are you shelling out so much money for the quad to begin with, its performace isn't gonna be 10% better than a dual core in 99% of applications.

I honestly could care less about 99%, or even 99.999999999% of applications. As long as using a quad-core benefits me in the app that I bought it to use with, I'm quite happy with my quad.

Drsignguy · Apr 22, 2008

I like my Quad also. Thanks for all the input!

TurtleCrusher · Apr 27, 2008

I don't know if cache thrashing is real. I have a x2 4200+ which is really a two 3500+'s on one die. My FPS in dod:s went up a good 25-40% after upgrading from a 3500+. This is all depending upon game variables also.

I just observed some quick benches. I did these at 640x480 w/ DX 5.0, to eliminate the bottleneck that is my x1650.

On dod_anzio, in the plaza i get 155-160fps. That is with the game on both cores.

After assigning the game to affinity 1, 130-150fps.

This is a single threaded application too. When i put the CPU history to one-graph all CPU's, 45-51%. that 1% i bet is background services.

I'm keeping it on both cores you guys.

Foxery · Apr 27, 2008

Sorry guys, all of your engineering knowledge and experience with Intel CPUs has just been flushed down the drain by some guy with an AMD running one very old game.

Originally posted by: Idontcare
This actually hurts performance of the single-threaded app as the thread migrates from core to core but the cache data it was accessing experiences a lag as the data must go thru the FSB (or thru the HT) to get to the core which now has the thread (albeit briefly). This is sometimes called "cache thrashing".

What was the original motivation for doing this? Spreading out the heat dissipation and general wear and tear, or just Microsoft's usual shortsightedness towards future hardware?

I had noticed this on my machine, and it bugged me. :/ Fortunately, I have Folding@Home running on all cores now.

Lord Banshee · Apr 28, 2008

I tested this with a recent program i written in C for a program in class. There is no threading, executes 3x3 image convolution filter fives times, outputs the avg cycle count from rdtsc of each filter, outputs avg.

Test 1:
Ran program 5 times took avg of all avg_cycle_out
start test.exe

Test 2:
Run program with affinity 1, 5 times, took avg of all avg_cycle_out
start /AFFINITY 1 test.exe

result that the affinity set to 1, Test 2, ran a whooping 0.93% faster. That is <1%.

this was ran on my T61 core2duo laptop with vista x64. My program isn't probably the best way to test this, but it proves the point to "ME" that it makes no difference. Someone else want to test some single threaded repeatable software and see what you guys get?

Foxery · Apr 28, 2008

Once again, your test program is WAY too small to be useful. Process a 3000x3000 image or an HDTV feed.

Your test platform is also not what we are talking about, since both cores in a C2Duo share the L2 cache. Your data doesn't go out to the bus; a quad with two seperate L2 caches does.

Lord Banshee · Apr 28, 2008

Whats funny you have no idea what size image i was using, good job. It was done with a 1400x1050x24bit image so roughly ~1.5million pixels. I just redid the test for a 3000x3000x24bit image and the result showed a 0.5% increase with affinity set to 1. Again less than 1%, and it most likely has everything to do with the L2 cache being shared, i could try on my Dual Core Opteron, maybe, but that still isn't close to what a quad would have to do.

Now i never said my program was the best to show these results but the OP stated both dual core and quad core so i tested my dual, where is your tests? And if i had a quad core i would test it but i don't. So if you don't want to contribute to the thread maybe you should not post.

Idontcare · Apr 28, 2008

Originally posted by: Foxery
Once again, your test program is WAY too small to be useful. Process a 3000x3000 image or an HDTV feed.

Your test platform is also not what we are talking about, since both cores in a C2Duo share the L2 cache. Your data doesn't go out to the bus; a quad with two seperate L2 caches does.

You hit on the key issue with thread migration. Obviously the issue is a matter of the cache...namely the penalty for missing the cache.

Missing the cache becomes more likely when a thread migrates to another core which doesn't share cache with the core the thread just migrated from...if the time it takes for the cache contents to migrate is reasonably long such that thread execution outraces cache content migration.

So obviously this is a "corner case" where the more dependent your single-threaded application is on a dataset that resides inside a given cache level then the more likely its performance is going to degrade when the thread migrates but the cache is slow to catch-up.

This is also a situation where I'd expect a B3 Phenom to run circles around a Yorkfield (IPC-wise) because of the force L2->L3 cache evictions (to avoid TLB) will serve to ensure that as a thread migrates the cached data it was accessing is likely no further away than the shared L3$.

Given C2D performance tends to hinge on its large L2$ and prefetchers hiding latency/bandwidth performance I would expect the more applications to be impacted by thread migration on C2D (quad) based systems than on Phenom systems for exactly the reasons you highlighted earlier.

This is but one issue that all NUMA systems (hardware NUMA at the socket at least) share in common.

Idontcare · Apr 28, 2008

Originally posted by: Foxery

Originally posted by: Idontcare
This actually hurts performance of the single-threaded app as the thread migrates from core to core but the cache data it was accessing experiences a lag as the data must go thru the FSB (or thru the HT) to get to the core which now has the thread (albeit briefly). This is sometimes called "cache thrashing".

Click to expand...

What was the original motivation for doing this? Spreading out the heat dissipation and general wear and tear, or just Microsoft's usual shortsightedness towards future hardware?

I had noticed this on my machine, and it bugged me. :/ Fortunately, I have Folding@Home running on all cores now.

Best I can tell its not so much done for any given reason but rather was never done. In other words it takes work to make the OS become an active thread manager...you have to spend programmer hours creating an OS that actively manages thread affinity and core utilization (with feedback of course as you obviously couldn't get away with just setting a thread's affinity and forgeting about it, you need to come back every now and then to see if the core is underutilized or if the thread has finished, etc).

So I chalk it up to another one of those things where the software guys look at the hardware guys and say "its your job to make things go fast, so speed up your thread execution already...its our job to give your customers something to do on your hardware and that's all".

Foxery · Apr 28, 2008

Originally posted by: Lord Banshee
Whats funny you have no idea what size image i was using, good job. It was done with a 1400x1050x24bit image so roughly ~1.5million pixels. I just redid the test for a 3000x3000x24bit image and the result showed a 0.5% increase with affinity set to 1. Again less than 1%, and it most likely has everything to do with the L2 cache being shared, i could try on my Dual Core Opteron, maybe, but that still isn't close to what a quad would have to do.

Now i never said my program was the best to show these results but the OP stated both dual core and quad core so i tested my dual, where is your tests? And if i had a quad core i would test it but i don't. So if you don't want to contribute to the thread maybe you should not post.

You said "3x3", and "for a class." This seemed pretty straightforward the way it was written; which apparently is not what you meant at all.

However, you're still missing how radically different an Athlon or Phenom's cache and bus architectures are from a C2Quad. Eat your own attitude about "contributing" to the thread, mkay?

Originally posted by: Idontcare
Best I can tell its not so much done for any given reason but rather was never done. In other words it takes work to make the OS become an active thread manager...you have to spend programmer hours creating an OS that actively manages thread affinity and core utilization

This is beyond my programming experience, but isn't it more complicated to move a thread around than to keep it attached to a CPU for as long as possible? Or, maybe what I should really be asking is, what causes a thread to leave its current CPU and not come back? Is it as simple as being pushed into the Stack while another app requests some cycles, then being picked off a shared Stack by an arbitrary CPU?

Besides, Microsoft writes OS versions for servers, so it's not like a multi-CPU platform is news to them. It sounds like lazy-ass design to me. (Color me surprised!)

I don't suppose Vista's scheduler handles this scenario better? I don't have any Vista machines to try.

Lord Banshee · Apr 28, 2008

No i said a"3x3 image convolution filter". i.e. the kernel is 3x3 big, if you type that exact phrase in to Google you see what i was taking about. Not sure what classes you take but mine are not simple and straightforward. Maybe you should not assume so much?

And i know how different the core2duo and athlon/phenom cache's are and that is why i said "but that still isn't close to what a quad would have to do" where quad i meant core2quad, and "to do" is having to go though the FSB. And seeing how there is two companies out there, it might be a good test to see which one has a better implementation of executing single threaded code that is spread out the windows likes to do it. Personally i would agree with Idontcare here, AMD design with all four cores being able to communicate in such a fast connection you would think they would have an upper hand. But unfortunately it isn't the best performance due to its other hardware components that just are not as fast as Intel at the moment. But then again with Intel also having the internal MemController, QuickPath, multi-levels of shared cache with the Nehalem i think there would be no difference very soon.

But anyway is there a test that you would find appealing to run?

A primary core, is there?

Platinum Member

Lifer

Elite Member

Platinum Member

Golden Member

Diamond Member

Platinum Member

Senior member

Senior member

Lifer

Lifer

Diamond Member

Elite Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Golden Member

Golden Member

Golden Member

Golden Member

Elite Member

Elite Member

Golden Member

Golden Member