Intel Xeon E5405 harpertown (core 2 quad), multi-core performance issue

atom2626 · Jul 10, 2010

Hi there,

I have a setup of two Intel Xeon E5405 harpertown (core 2 quad). I'm developping a demonstration of falsesharing effect to be taught in the parallel processing course at my university.

My test involves two systems made out of two threads working as productor-consumer on a buffer that resides or not in the same cache line. Systems are always executing on a pair of cores that shares L2 cache.

I wanted to highlight performance between having the two system working on a single chip or on two separate chips. Given that even when both systems are on the same chip, they don't share any cache, I was expecting to have the same performance in both case. Surprisingly, performance obtained in multi-core configuration are worst than those obtained from multi-processor setup. (both with or without false sharing)

Using VTune performance analyser, I discovered that the multi-core configuration was generating way more snooping related traffic, but isn't this protocol only for multi-processor?

I also noticed much more cache miss in L2 for multi-core config, but I suspect this to be a consequence of snoop requests.

To state it clearly, I was wondering if there is any communication protocol between the two dies of an intel quad-core, that is likely to be related to cache coherency, that affect execution in a different way than how it happens when using two processors.

If you have any sugegstion on what could explain my results, feel free to let me know.

Thanks,
Fred

Markfw · Jul 10, 2010

first, welcome to the forums. Second, even though I won't move this thread, this question might be better answered in the highly technical forum.

Lets see how this thread develops.

degibson · Jul 11, 2010

To state it clearly, I was wondering if there is any communication protocol between the two dies of an intel quad-core, that is likely to be related to cache coherency, that affect execution in a different way than how it happens when using two processors.

I'll tackle this bit first. Yes, there is a coherency mechanism, which requires inter-socket communication.

I also noticed much more cache miss in L2 for multi-core config, but I suspect this to be a consequence of snoop requests.

This may be underestimating your competence-my apologies in advance-but it's worth asking how you know when your threads are on the same die, and how you know when your threads run on different die. As your post suggests, you seem to be observing the opposite effect than you might expect.

My test involves two systems made out of two threads working as productor-consumer on a buffer that resides or not in the same cache line. Systems are always executing on a pair of cores that shares L2 cache.

I'm not sure what you mean by 'always executing on a pair of cores that shares L2 cache', when in the next paragraph you say:

I wanted to highlight performance between having the two system working on a single chip or on two separate chips

Aside from not quite understanding your setup, I wonder how big your working sets are? If they're big enough to overflow caches, you might see better speeds from an unshared cache on two different chips.

Lastly, false sharing is very, very hard to capture in the macro-world. Its easier in the micro-world with simulation. The reason for this is that independent executions tend to hold line permissions as long as they need them, while the other execution just waits. That is, it is hard to get concurrent, unsynchronized access to the same cache block to actually interleave, rather than just serialize. This is by design of the coherence mechanism--serializing that behavior is usually the fastest, because of the overhead of cache-to-cache transfers.

atom2626 · Jul 12, 2010

First, thank you for your help and the welcome. I'm new to this forum, if you feel this thread belong to somewhere else, I'm surrendering to your judgement.

I add here some code details about the setup that may help in understanding it.

#define NO_FALSESHARING

/*semaphore for synchronicity, sems are 16 bytes long so in false sharing they all reside in the same cache line*/

sem SemFreeSlotA;
sem SemFilledSlotA;

#ifdef NO_FALSESHARING
char PaddingA[128];
#endif

sem SemFreeSlotB;
sem SemFilledSlotB;

/*Buffers declaration: in false sharing, the two buffer resides within the same cache line*/

int BufferA[4];

#ifdef NO_FALSESHARING
char PaddingA[128];
#endif

int BufferB[4];

Threads 1,2 read and write in BufferA, while threads 3,4 r/w in Buffer B, they don't do any useful work.

/*All threads are made from this template, only waited on and posted semaphore differs, along with buffer*/

Thread1:
wait on SemFreeSlotA;
read in BufferA;
compute NewSata;
write in BufferA;
post SemFilledSlotA;

And about the processor: Intel core 2 duo quad.

It is called this way because each processor, I have two, is made out of two core 2 duo dies.

The core 2 duo has 2 cores, each one having their own L1 cache and sharing a L2 cache.

Back to the discussion:

Yes, there is a coherency mechanism, which requires inter-socket communication.

Yeah, I know about this, but I was wondering if there is a coherency mechanism between the two dies of a single processor that exist over the inter-socket communication. (each processor as two "independent" bus interface, one from each die, are they so independent?)

I use pthread_setaffinity_np() to set the mask that defines which cores a thread is allowed to be executed on. This should answer:

how you know when your threads are on the same die, and how you know when your threads run on different die

And don't worry about my competence, I'm here to learn ^_^

I wonder how big your working sets are

They are only four int long, 16 bytes. The two buffer can reside in the same cache line, which is 64 bytes long.

I'm not sure what you mean by 'always executing on a pair of cores that shares L2 cache'

I think the description of the processor helps answering this question.

The reason for this is that independent executions tend to hold line permissions as long as they need them, while the other execution just waits. That is, it is hard to get concurrent, unsynchronized access to the same cache block to actually interleave, rather than just serialize.

I think it is the best path to investigate for now, it is consistent with some results obtained from thread profiler. Do you have any information on which mechanism is responsible for that, I guess it is in the protocol for the system bus.

As your post suggests, you seem to be observing the opposite effect than you might expect.

Not the opposite, I expect the two situation to lead to pretty much the same results, since both system must access RAM when there is a miss in L2. This wouldn't be the case if there was a L3 cache shared between the two core 2 duo, but it's not the case.

Thanks again, I hope this post helps in understanding the problem, I'm going to look at the bus sharing protocol, your last intervention seem to be where to look at.

I'll let you know what I find, meanwhile let me know if you have any other ideas.

best,
Fred

degibson · Jul 12, 2010

1. Sorry for curt responses. I'm traveling and using my iPhone to type this.
2. I don't follow all the details of intel offerings, but I'm pretty sure multiple-die Single-package offerings are still way out of consumer range. I'm going to assume you have a two-socket core 2 quad setup. Please correct me if I am wrong.
3. How do you know what the thread mask means? Aka, how the mask equates to cores and chips? That is, logical p0 might or might not be on the same chip as p1.
4. I don't see your implementation of semaphore, but I'll bet your code is way too synchronization limited to see any effect from false sharing.
5. I suggest use of [ code] [ /code] blocks for future code segments.
6. Your best chance of forcing false sharing without having your performance destroyed by synchronization is to ping-pong your buffers back and forth without sems or locks. Eg

6a. Initialize both buffers to 0.
6b. Thread t0 increments buffA, t1 increments buffB.
6c. Insert a store fence after each increment. That will force retirement to the memory system. If you don't know what that means, google "memory consistency", or just trust me 🙂
7. MOD: if you want to evict this thread, we do this on the programming forum a lot.
8. Good luck

atom2626 · Jul 12, 2010

1. No problem, have a nice trip.

2. I know for sure that we are using a dual-die single-package processor and we have two of them (8 cores total). I don't know which, if any, resources are shared on the motherboard, though. I asked for motherboard ID, may give me a hint.

3. I did some test, before doing this code, to make sure that masks were the one I though and they were. They are numbered in logic order: 0-3 being first 4 cores (0-1 sharing L2) and 4-7 same for the second processor.

4. My mistake there is a for loop before the sem_wait making sure that the problem execute for long enough (execution time turns around 15 sec)

5. Thanks, I'll do so

6. I like the idea, I will try it, but I was asked to take a look at semaphore, because they are commonly shared amongst multiple threads and "usually" declared in a serial fashion, no matter their threads affinity. Briefly, common cadidate for falsesharing. Still, I like the idea of simplifying the code while investigating, thanks.

6a. Yeah I simplified the code to lighten the reading, it's already done.
6c. Again thanks, I'll look at that, make sense.

7. Actually, we bifurcated on the code, but my main issue is related to the CPUs. My false sharing example works. I have significant improvement of the performance. What I can't explain is why those performance are worst in single die situation, compared to dual die implementation, when, from my point of view, they should be grossly the same.

8. Thanks a lot, I'm definitely not giving up. I won't work on that before two days (I'm only part-time on this project), I'll let you know what comes out of it.

Intel Xeon E5405 harpertown (core 2 quad), multi-core performance issue

atom2626

Junior Member

Markfw

Moderator Emeritus, Elite Member

degibson

Golden Member

atom2626

Junior Member

degibson

Golden Member

atom2626

Junior Member

TRENDING THREADS