First, I apologize in advance if my terminology isn't quite right but hopefully my meaning is clear.
I'm specifically interested in the AMD Magny-Cours processor as used in this system:
http://www.nersc.gov/users/computational-systems/hopper/configuration/compute-nodes/
but if the answer varies by architecture I'd be curious about that too.
Hopper compute nodes consist of two Magny-Cours processors, each with 12 cores. Each 12-core processor is really two six core processors (each with its own memory controller) "glued" together on a single die. So there are 4 NUMA nodes.
So consider one compute node with 32 GB of RAM. Server setups for this processor allow any core on any NUMA node to access all 32GB of RAM. But each NUMA node (each w/its own mem controller), is only "directly" connected to 8GB of RAM. And the NUMA nodes are connected to each other by some kind of super high bandwidth interconnect (HyperTransport?). That is my understanding anyway based on this:
http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/2
and
http://www.phys.uu.nl/~steen/web10/amd.php
So there's 1.5 full HT channels connecting NUMA nodes on the same package. 1 full HT channel connects cores "across" from each other whereas cores "diagonal" from each other only get 0.5 channels. According to wikipedia, a single HT channel is 16bits wide meaning that the bandwidth is 25.6GB/s (bidirectional), whereas the effective memory bandwidth is 14GB/s per NUMA node (discussed in prev link).
I know that if cores from different NUMA nodes are trying to write to the same block of memory, slowdowns occur due to concurrency issues.
My question is: how is memory performance affected if I use only ONE core on the whole compute node? For example, say I need 32GB of RAM per core, so 23 cores will sit idle. Say that single core sits on NUMA node 0. Will that core interface with NUMA node 0's 8GB of RAM at optimum "speed" but experience slowdowns when reading/writing to RAM associated with the other 3 NUMA nodes?
My guess is that reaching the remaining 24GB of RAM involves going through the memory controllers on the other NUMA nodes and/or lack of HT bandwidth. In particular, accessing the memory tied to NUMA node 3 from NUMA node 0 would have less bandwidth if it uses the half-channel HT connection as opposed to using full-channel connections from node 3 to node 2 then to node 0. Or does something else happen when only 1 core (or even 1 NUMA node) is in use? Doesn't seem like this is possible since NUMA nodes have individual memory connections; thus the HT channels linking the NUMA nodes (well just the diagonal links) are the limiting factor unless they are bypassed in certain scenarios.
I'm specifically interested in the AMD Magny-Cours processor as used in this system:
http://www.nersc.gov/users/computational-systems/hopper/configuration/compute-nodes/
but if the answer varies by architecture I'd be curious about that too.
Hopper compute nodes consist of two Magny-Cours processors, each with 12 cores. Each 12-core processor is really two six core processors (each with its own memory controller) "glued" together on a single die. So there are 4 NUMA nodes.
So consider one compute node with 32 GB of RAM. Server setups for this processor allow any core on any NUMA node to access all 32GB of RAM. But each NUMA node (each w/its own mem controller), is only "directly" connected to 8GB of RAM. And the NUMA nodes are connected to each other by some kind of super high bandwidth interconnect (HyperTransport?). That is my understanding anyway based on this:
http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/2
and
http://www.phys.uu.nl/~steen/web10/amd.php
So there's 1.5 full HT channels connecting NUMA nodes on the same package. 1 full HT channel connects cores "across" from each other whereas cores "diagonal" from each other only get 0.5 channels. According to wikipedia, a single HT channel is 16bits wide meaning that the bandwidth is 25.6GB/s (bidirectional), whereas the effective memory bandwidth is 14GB/s per NUMA node (discussed in prev link).
I know that if cores from different NUMA nodes are trying to write to the same block of memory, slowdowns occur due to concurrency issues.
My question is: how is memory performance affected if I use only ONE core on the whole compute node? For example, say I need 32GB of RAM per core, so 23 cores will sit idle. Say that single core sits on NUMA node 0. Will that core interface with NUMA node 0's 8GB of RAM at optimum "speed" but experience slowdowns when reading/writing to RAM associated with the other 3 NUMA nodes?
My guess is that reaching the remaining 24GB of RAM involves going through the memory controllers on the other NUMA nodes and/or lack of HT bandwidth. In particular, accessing the memory tied to NUMA node 3 from NUMA node 0 would have less bandwidth if it uses the half-channel HT connection as opposed to using full-channel connections from node 3 to node 2 then to node 0. Or does something else happen when only 1 core (or even 1 NUMA node) is in use? Doesn't seem like this is possible since NUMA nodes have individual memory connections; thus the HT channels linking the NUMA nodes (well just the diagonal links) are the limiting factor unless they are bypassed in certain scenarios.
Last edited:
