- Oct 14, 1999
- 11,999
- 308
- 126
Click here for the original Explanation.
All current Hammers have 3bit PIDs (Processor ID).
Among other things, the PID is used by the HT packet router to identify sender and reciever of coherency traffic.
Each CPU is assigned its own unique PID and also the maximum PID used in the system. The static HT routing tables are then programmed, but only with values less or equal to the "max PID".
What AMD does, is restrict the maximum allowed maxPID of the CPU.
000b is max for the 1xx series, FX-xx and A64.
001b is max for the 2xx series.
111b is max for the 8xx series.
AMD could make a 5xx series by limiting the maxPID value to 100b.
The reason AMD has limited K8 to 8-way is because of the simple and wasteful broadcast coherency protocol. As the number of CPUs go up, so does the coherency overhead, but by a much larger amount.
2-way topology:
O---O-- I/O
Each memory request generates 20 bytes worth of ReadRequest-Probe-ProbeResponse-Done traffic. On top of that comes the cacheline that may need to be transferred, which is 68 bytes. However, it's ignored for this analysis.
The aggregate 1GHz HT link bandwidth to coherency overhead ratio is then 8GB/s / 20bytes = 400 MegaRequests/s = 200 MegaRequests/s/CPU
If two HT links are used it's 800 MegaRequests/s = 400 MegaRequests/s/CPU, but no motherboard does this. Probably because it's expensive for very little gain.
Optimal 4-way topology (not used in any boards currently, AFAIK):
O---O-- I/O
|.\../
|..X
|./..\
O---O-- I/O
Each memory request generates 38.67 bytes on average. If the two I/O links wasn't needed and could be connected to be used for coherency traffic, then it would be 36 bytes for every request. So the extra 2.67 bytes on average is due to the asymmetry.
Bandwidth to overhead ratio:
5*8GB/s / 38.67bytes = 1034.5 MegaRequests/s = 258.6 MegaRequests/s/CPU
4-way topology used in current boards:
O---O-- I/O
|......|
|......|
|......|
O---O-- I/O
Each memory request generates 44 bytes on average.
Bandwidth to overhead ratio:
4*8GB/s / 44bytes = 727.3 MegaRequests/s = 181.8 MegaRequests/s/CPU
Optimal 8-way topology:
+---O---O-- I/O
|.....|.....|
|....O....O
|.....|.\./.|
|.....|..X..|
|.....|./.\.|
|.....O...O
|.....|.....|
+---O---O-- I/O
Only two nodes, the "I/O" nodes, are 3 hops from each other. All other combinations are 2 hops or less.
Each memory request generates 91.6 bytes on average.
Bandwidth to overhead ratio:
11*8GB/s / 91.6bytes = 960.7 MegaRequests/s = 120.1 MegaRequests/s/CPU
With NUMA optimized OS and apps, 8-way presents no problem bandwidthwise. Another issue is latency. A fetched cacheline can only be commited to cache and used when coherency has been resolved. That means waiting for the farthest away CPU to respond. For 6 CPUs this is 2 hops away. For the two "I/O" CPUs it's 3 hops.
However, the biggest problem is with non-NUMA 8-way. In this case the HT links will need to transport 7 out of every 8 cachelines.
The total number of bytes, including cachelines, that needs to be transported per request is 189.3bytes on average.
Bandwidth to total request size ratio:
11*8GB/s / 189.3bytes = 464.8 MegaRequests/s = 58.1 MegaRequests/s/CPU
= 3.46GB/s/CPU
So, in a non-NUMA 8-way Opteron, the theoretically usable max memory bandwidth per CPU is 3.46GB/s, which means almost half the bandwidth of dual-channel DDR400 goes unused.
On top of this comes even worse latency compared to the poor latency of the NUMA system. Much worse.
Going beyond 8-way would create lots of 3 hop "pairs" for each CPU added.
Under optimal NUMA conditions, a request in a 16-way Opteron would consume more than 220 bytes. Only 52 MegaRequests/s/CPU.
For non-NUMA it becomes more than 360 bytes per request on average, and less than 1.9GB/s of usable memory bandwidth per CPU. Not to mention the terrible latency.
Edit: His diagrams didn't come over in preview very well so I added periods for empty spaces.
All current Hammers have 3bit PIDs (Processor ID).
Among other things, the PID is used by the HT packet router to identify sender and reciever of coherency traffic.
Each CPU is assigned its own unique PID and also the maximum PID used in the system. The static HT routing tables are then programmed, but only with values less or equal to the "max PID".
What AMD does, is restrict the maximum allowed maxPID of the CPU.
000b is max for the 1xx series, FX-xx and A64.
001b is max for the 2xx series.
111b is max for the 8xx series.
AMD could make a 5xx series by limiting the maxPID value to 100b.
The reason AMD has limited K8 to 8-way is because of the simple and wasteful broadcast coherency protocol. As the number of CPUs go up, so does the coherency overhead, but by a much larger amount.
2-way topology:
O---O-- I/O
Each memory request generates 20 bytes worth of ReadRequest-Probe-ProbeResponse-Done traffic. On top of that comes the cacheline that may need to be transferred, which is 68 bytes. However, it's ignored for this analysis.
The aggregate 1GHz HT link bandwidth to coherency overhead ratio is then 8GB/s / 20bytes = 400 MegaRequests/s = 200 MegaRequests/s/CPU
If two HT links are used it's 800 MegaRequests/s = 400 MegaRequests/s/CPU, but no motherboard does this. Probably because it's expensive for very little gain.
Optimal 4-way topology (not used in any boards currently, AFAIK):
O---O-- I/O
|.\../
|..X
|./..\
O---O-- I/O
Each memory request generates 38.67 bytes on average. If the two I/O links wasn't needed and could be connected to be used for coherency traffic, then it would be 36 bytes for every request. So the extra 2.67 bytes on average is due to the asymmetry.
Bandwidth to overhead ratio:
5*8GB/s / 38.67bytes = 1034.5 MegaRequests/s = 258.6 MegaRequests/s/CPU
4-way topology used in current boards:
O---O-- I/O
|......|
|......|
|......|
O---O-- I/O
Each memory request generates 44 bytes on average.
Bandwidth to overhead ratio:
4*8GB/s / 44bytes = 727.3 MegaRequests/s = 181.8 MegaRequests/s/CPU
Optimal 8-way topology:
+---O---O-- I/O
|.....|.....|
|....O....O
|.....|.\./.|
|.....|..X..|
|.....|./.\.|
|.....O...O
|.....|.....|
+---O---O-- I/O
Only two nodes, the "I/O" nodes, are 3 hops from each other. All other combinations are 2 hops or less.
Each memory request generates 91.6 bytes on average.
Bandwidth to overhead ratio:
11*8GB/s / 91.6bytes = 960.7 MegaRequests/s = 120.1 MegaRequests/s/CPU
With NUMA optimized OS and apps, 8-way presents no problem bandwidthwise. Another issue is latency. A fetched cacheline can only be commited to cache and used when coherency has been resolved. That means waiting for the farthest away CPU to respond. For 6 CPUs this is 2 hops away. For the two "I/O" CPUs it's 3 hops.
However, the biggest problem is with non-NUMA 8-way. In this case the HT links will need to transport 7 out of every 8 cachelines.
The total number of bytes, including cachelines, that needs to be transported per request is 189.3bytes on average.
Bandwidth to total request size ratio:
11*8GB/s / 189.3bytes = 464.8 MegaRequests/s = 58.1 MegaRequests/s/CPU
= 3.46GB/s/CPU
So, in a non-NUMA 8-way Opteron, the theoretically usable max memory bandwidth per CPU is 3.46GB/s, which means almost half the bandwidth of dual-channel DDR400 goes unused.
On top of this comes even worse latency compared to the poor latency of the NUMA system. Much worse.
Going beyond 8-way would create lots of 3 hop "pairs" for each CPU added.
Under optimal NUMA conditions, a request in a 16-way Opteron would consume more than 220 bytes. Only 52 MegaRequests/s/CPU.
For non-NUMA it becomes more than 360 bytes per request on average, and less than 1.9GB/s of usable memory bandwidth per CPU. Not to mention the terrible latency.
Edit: His diagrams didn't come over in preview very well so I added periods for empty spaces.
