question about memory access in NUMA environment

eLiu

Diamond Member
Jun 4, 2001
6,407
1
0
First, I apologize in advance if my terminology isn't quite right but hopefully my meaning is clear.

I'm specifically interested in the AMD Magny-Cours processor as used in this system:
http://www.nersc.gov/users/computational-systems/hopper/configuration/compute-nodes/
but if the answer varies by architecture I'd be curious about that too.

Hopper compute nodes consist of two Magny-Cours processors, each with 12 cores. Each 12-core processor is really two six core processors (each with its own memory controller) "glued" together on a single die. So there are 4 NUMA nodes.

So consider one compute node with 32 GB of RAM. Server setups for this processor allow any core on any NUMA node to access all 32GB of RAM. But each NUMA node (each w/its own mem controller), is only "directly" connected to 8GB of RAM. And the NUMA nodes are connected to each other by some kind of super high bandwidth interconnect (HyperTransport?). That is my understanding anyway based on this:
http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/2
and
http://www.phys.uu.nl/~steen/web10/amd.php

So there's 1.5 full HT channels connecting NUMA nodes on the same package. 1 full HT channel connects cores "across" from each other whereas cores "diagonal" from each other only get 0.5 channels. According to wikipedia, a single HT channel is 16bits wide meaning that the bandwidth is 25.6GB/s (bidirectional), whereas the effective memory bandwidth is 14GB/s per NUMA node (discussed in prev link).

I know that if cores from different NUMA nodes are trying to write to the same block of memory, slowdowns occur due to concurrency issues.

My question is: how is memory performance affected if I use only ONE core on the whole compute node? For example, say I need 32GB of RAM per core, so 23 cores will sit idle. Say that single core sits on NUMA node 0. Will that core interface with NUMA node 0's 8GB of RAM at optimum "speed" but experience slowdowns when reading/writing to RAM associated with the other 3 NUMA nodes?

My guess is that reaching the remaining 24GB of RAM involves going through the memory controllers on the other NUMA nodes and/or lack of HT bandwidth. In particular, accessing the memory tied to NUMA node 3 from NUMA node 0 would have less bandwidth if it uses the half-channel HT connection as opposed to using full-channel connections from node 3 to node 2 then to node 0. Or does something else happen when only 1 core (or even 1 NUMA node) is in use? Doesn't seem like this is possible since NUMA nodes have individual memory connections; thus the HT channels linking the NUMA nodes (well just the diagonal links) are the limiting factor unless they are bypassed in certain scenarios.
 
Last edited:

degibson

Golden Member
Mar 21, 2008
1,389
0
0
It will be hard for a single core in a non-benchmark setting to saturate 26.5 GB/s across one link without first hitting some other bottleneck -- e.g., the maximum number of outstanding misses per core.

Accessing remote memory will be slower, but that's due to a latency difference.

The answer would be different with many cores participating. Performance interference can be a problem, and a lot of care needs to be put into the memory controllers to provide fairness guarantees (which probably aren't there).
 

eLiu

Diamond Member
Jun 4, 2001
6,407
1
0
It will be hard for a single core in a non-benchmark setting to saturate 26.5 GB/s across one link without first hitting some other bottleneck -- e.g., the maximum number of outstanding misses per core.

Accessing remote memory will be slower, but that's due to a latency difference.

The answer would be different with many cores participating. Performance interference can be a problem, and a lot of care needs to be put into the memory controllers to provide fairness guarantees (which probably aren't there).

Hmm how did I know you'd be the one to reply :)

My application is extremely memory bandwidth limited (working with large, sparse matrices--matrix vector multiplication in particular). Running on one (out of 4) cores of an Intel i7 950 (triple channel DDR3), we come fairly close (90%+) to using all available memory bandwidth. And scaling the memory frequency sees a proportional change in the runtime. Tweaking latency (CAS, etc) has little effect.

Here, I've defined "memory bandwidth" as synthetic benchmark results. For example, on that Core i7, triple channel DDR3-1333 has theoretical max bandwidth of like 32GB/s. In practice using DAXPY (y=a*x+y, x,y are vectors) type operations, I can get a max of 23GB/s using 3 cores running independent threads (this is inline with published results from review sites running things like SiSoft Sandra) and about 17GB/s using just 1 core. So my program gets close to that 17GB/s number in serial.

I had always assumed that the difference between the 1 core & many-core results was due to the fact that 1 core could not issue enough read/writes to the memory controller per cycle. But I guess it isn't surprising that other issues can factor in too. Is the max outstanding issues per core tied to the size of the buffers used for OOO exec?

So it sounds like the answer to my original question is that 1 CPU core in say NUMA core 0 can access memory connected to any of the other NUMA cores without loss of bandwidth but with potential increases in latency?

For your last comment about multiple cores, do you mean that if multiple CPU cores located anywhere try to access the same memory, performance loss could ensue? I understand that having multiple CPU cores (regardless of their location in NUMA cores) writing memory can slow things down b/c of concurrency issues.

But can multiple CPU cores in different NUMA cores reading the same memory cause issues? What if there are multiple CPU cores in a *single* NUMA core reading the same memory? Are you saying that... say some CPU cores on NUMA core 0 want memory tied to NUMA core 2. Then NUMA core 2's mem controller has to service these requests in some sequence, so some cores will inevitably have to wait?
 

degibson

Golden Member
Mar 21, 2008
1,389
0
0
Hmm how did I know you'd be the one to reply :)
Because I'm always happy to help, of course! :)

My application is... redacted for brevity ... on that Core i7, triple channel DDR3-1333 has theoretical max bandwidth of like 32GB/s. In practice using DAXPY (y=a*x+y, x,y are vectors) type operations, I can get a max of 23GB/s using 3 cores running independent threads (this is inline with published results from review sites running things like SiSoft Sandra) and about 17GB/s using just 1 core. So my program gets close to that 17GB/s number in serial.
Without doing the detailed math, ballpark numbers seem OK. DAXPY is something I would consider a 'benchmark', but it does have some real-world scientific applications.

I had always assumed that the difference between the 1 core & many-core results was due to the fact that 1 core could not issue enough read/writes to the memory controller per cycle. But I guess it isn't surprising that other issues can factor in too. Is the max outstanding issues per core tied to the size of the buffers used for OOO exec?

Related but not exactly the same. OoO cores have a set of registers called by a variety of names in industry; I call them MSHRs or miss status holding registers. The number of MSHRs upper-bounds the number of outstanding cache misses (usually L1 misses) the core can support concurrently. It's not always a hard upper bound; sometimes misses to the same cacheline can be coalesced on the a single MSHR entry (though I do not know if any real designs do this -- I have read about it, however).

Bottom line is that there's only a handful of MSHRs (I don't have exact numbers; but 4-16 seems to be my recollection). If you have a lot of misses back-to-back, the OoO can't service them all concurrently, which is one reason why a single OoO core often can't saturate memory latency except in extremely controlled situations, e.g., with prefetching operations that don't consume MSHRs (though again, I don't know what is available on a given platform).

So it sounds like the answer to my original question is that 1 CPU core in say NUMA core 0 can access memory connected to any of the other NUMA cores without loss of bandwidth but with potential increases in latency?
Right; assuming other cores are idle, there is a perceptible latency difference and probably not much BW difference.

For your last comment about multiple cores, do you mean that if multiple CPU cores located anywhere try to access the same memory, performance loss could ensue? I understand that having multiple CPU cores (regardless of their location in NUMA cores) writing memory can slow things down b/c of concurrency issues.
It's just queuing theory at this point. Multiple actors accessing the same resource will lead to queuing delays. If one core can pump out 17 GB/s, then 8 cores could easily demand more BW than a memory controller can provide. Requests back up, have to wait, must respect that 32 GB/s bottleneck. I.e., queueing delay = performance interference.

But can multiple CPU cores in different NUMA cores reading the same memory cause issues?
Depends on the cache hierarchy, really. It's not entirely clear from the link you provided, but I would guess that the L3 cache on your chip in question is 'memory-side', i.e., the first access would populate that L3 and shield the memory controller from subsequent accesses to the same cache line.

What if there are multiple CPU cores in a *single* NUMA core reading the same memory? Are you saying that... say some CPU cores on NUMA core 0 want memory tied to NUMA core 2. Then NUMA core 2's mem controller has to service these requests in some sequence, so some cores will inevitably have to wait?

See above; contention always leads to queuing. Where the queuing actually happens is system dependent; it probably happens in lots of places due to back pressure.

In general, caches help if there's temporal or spatial locality. They both improve the latency of cache hits, and shield the rest of the memory system from bandwidth demand.

Cores fanning in to a single memory controller -- whether it's a NUMA or UMA architecture -- have the same effect. Queuing. Contention. Sometimes its not even fair -- sometimes a nearby core will observe twice the bandwidth of a distant core. MCs on the whole a pretty simple entities.

Up a level: actually saturating all of your memory takes a lot -- a LOT -- of tuning and you'll have to repeat it for every chip on which your software runs. It's a lot easier for a DAXPY-like workload than it is for other things, though, so it's not hopeless.