The NUMA Architecture and It's Future.

AGodspeed

Diamond Member
Jul 26, 2001
3,353
0
0
In a NUMA (Non-Uniform Memory Access) system, each processor has its own memory, but can also access memory owned by other processors (memory access is faster when a processor is accessing its own memory). Compaq's EV67 Alpha processor was, I believe, the first NUMA-based Alpha processor (released in 2000).

Some NUMA systems have suffered from variable latencies in fetching data from memory. In general, what advantages are there to a NUMA-based system architecture over other current system architectures. How does NUMA relate to MIPS, CISC, and RISC designs, and what are the effects on SMP and OS configurations?

What current microprocessors besides the Alpha EV67 are NUMA-based. The upcoming ClawHammer and SledgeHammer processors are based on NUMA architecture, what other NUMA or NUMA variants are coming in the future?
 

CSoup

Senior member
Jan 9, 2002
565
0
0
NUMA is really a characteristic of the architecture of the memory controller, not the processor. Basically in a parallel system you can have either NUMA (non-uniform) or UMA (uniform) memory models. SMP is an example of UMA since all processors in an SMP box access memory with the same speed. There is no concept of far vs. close memory in an SMP box. NUMA is better becase you use the memory at what ever speed you can instead of slowing down everything to make it UMA. There have been machines made that were crippled by not treating local memory specially so that the machines could be marketed as SMP machines. The reason for this is that once memory has different access times, the programmer feels more obligated to optimize their code for the architecture and many did not want to do this.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0


<< Compaq's EV67 Alpha processor was, I believe, the first NUMA-based Alpha processor....How does NUMA relate to MIPS, CISC, and RISC designs >>

I get the impression that you're a bit confused about NUMA...it is by no means a new concept, and it's not necessarily dependent on any particular microprocessor implementation or microarchitecture. It's merely part of the layers of abstraction for parallel processing: programming model (data parallel, shared address, message passing) -> communication abstraction (compilation or library) -> user/system boundary (OS support) -> hardware/software boundary (communication medium and hardware)....NUMA is all the way down there at the bottom and one of the many factors that go into the construction of a multiprocessing system. There has been a UMA/NUMA heirarchy present for shared memory multiprocessing for a long time, which has been the most successful form of parallel computation to date. There's no reason (and it's been done many times) that a large NUMA system could consist of, for example, nodes containing 2-way P3s on a shared bus. HP's latest Superdome series for the PA8700 has (IIRC) 64 nodes, each node supporting 2-way SMP using McKinley's shared bus...when McKinley comes out, it will be supported in the systems.



<< Some NUMA systems have suffered from variable latencies in fetching data from memory >>

Actually, that's the whole point of NUMA. :)



<< In general, what advantages are there to a NUMA-based system architecture over other current system architectures >>

Sticking strictly to UMA vs. NUMA shared address multiprocessing (data parallel and message passing each have their own caveats), NUMA scales higher simply because UMA memory access eventually becomes expensive to implement and performance prohibitive...trying to maintain constant latency as the system scales will only hurt latency. Systems of this type can be configured in a number of ways (bus interconnect, multistage network, crossbar switch) and typically go from 2 - 64 processors (IIRC Sun's Gigaplane bus goes up to 30 MPUs). NUMA systems historically consist of nodes connected to a scalable network, with its own possibilities of funky topologies (ranging up to "4-D" hypercubes :)). NUMA, as the name implies, is characterized by low latency to local memory and higher latency to remote memory. For shared address machines, this can cause difficulties for multiprogramming and synchronization, since you would like critical data to be access locally.

Hammer, among other MPUs such as the Alpha EV7, Sun US-III (IIRC), and IBM Power4, possesses integrated high-bandwidth interconnects for direct CPU -> CPU communication, facilitating "plug-and-play" NUMA-style multiprocessing using a bus network defined by the MPU designer (dubbed "glueless MP", which is a bit of a misnomer since that's an older term specifying an MPU with integrated MP logic). Personally, I wonder whether NUMA is really necessary for Sledgehammer, whose 3 HT links will only enable up to 8-way MP (2 links and 2-way MP for Clawhammer). The EV7, for example, has 4 interconnect links for up to (IIRC) 64-way MP. Although Hammer's dedicated links will provide great bandwidth, the latency issue is a concern. AMD says the average unloaded latency for remote access is 140ns for 4-way systems, and 160ns for 8-way...we'll see if they're right. On the other hand, the integrated memory controllers mean that the average latency could decrease as CPU speeds increase, but by no means with a direct relationship. I do believe, though, that this "glueless MP" style multiprocessing could become universal for small to medium scale systems (after all, that's what Intel has in mind with Infiniband, so I'm sure we'll see the same thing in Itanium in the near future).

edit: ah well, I took too long to write this. :)
 

Elledan

Banned
Jul 24, 2000
8,880
0
0
Okay, so the Hammer CPUs will support NUMA.

Next question: can one just take a couple of these CPUs and link them together? What kind of mainboard is required? Anything else?