SMP second processor idea

MadRat

Lifer
Oct 14, 1999
12,014
321
126
Intel has supported multiprocessing for awhile now and are gaining alot of headway with each successive generation, even making it somewhat happen on a single core. AMD has been putting alot of work into their next-generation multiprocessor platform, the Opteron-Hammer. This coming after a huge success with their Athlon-MP.

One new idea coming out of the Hammer development is that one processor has the ability to forward information in its cache to the other processor on a super high speed relay system based on hypertransport (HT) technology. (Funny how Intel recycles the abbreviation of "ht" for their "hyperthreading" technology...) Something around 20GB/second of thoroughput will be possible along the HT pathways in an 8-way Opteron system,

With the advent of hypertransport technology into SMP systems it would seem that multiple processors could benefit from a "dummy processor" that contained only old cache information already discarded by previous CPU's, acting as an L3 cache only faster than what you'd normally expect out of an L3 cache. Why faster? Because the entire "dummy processor" could be full-speed on-die cache of around 2MB using up about the same space as a regular core of a CPU. Minimal logic could be built into this "dummy processor" so that it never actually does any work, but merely acts as the keeper of old cache information. An 8-way system using 1 of these could be referred to as maybe an "7+1 way system", meaning 7 CPUs and 1 dummy processor.

Is this idea plausible as a replacement of conventional L3 designs or would it be too complicated to implement?

 

Mday

Lifer
Oct 14, 1999
18,647
1
81
as we saw with the k6-3, L3 cache certainly can help... but there is no onboard L2 cache to be made L3 here. sure they can design for a L3 implementation on motherboards, but why bother?

the money is best spent placing MORE L2 and faster system memory. we already have several levels of memory, including virtual memory (HDD), why complicate things. the system takes a hit every time it has to look for data (outside of accessing slower sources), why add another place to look?

of course there are intentional L3 implementations in the market, but those are mainly for servers which require a faster than available system memory speed, and more capacity than the L2 implementation can provide. some servers dont need it.
 

MadRat

Lifer
Oct 14, 1999
12,014
321
126
Apples are coming with an L3 design. The P4 was originally crippled because the released product omitted the external L3 cache design it once was going to carry. The older processors had an L3 cache and it made huge differences for them. The truth is that an L3 cache is worthwhile and people will pay for it if its available. There were plenty of Socket7 motherboard makers that opted to include rather than exclude the L3 cache becuase to not put it in would have made their product very low performing in comparison to other products. But this argument has nothing to do with my idea.

I am talking about several CPUs in parrallel, not just one.
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
The older processors had an L3 cache and it made huge differences for them. The truth is that an L3 cache is worthwhile and people will pay for it if its available

Exactly, I would love to he able to have 2M cache on my dual Athlon setup. I have an 600Mhz Alpha that came with no L3 cache, once I added 2M L3 to it CPU intensive operations like a Linux kernel compile took just half as long.

As for the dummy processor, I don't know enough about processor or HT design to know if it would be workable. Even though it would be faster than main memory it would still be slower than local cache and would need locking to prevent races so I'm not sure if it's worth it. It would probably be more beneficial to just up the amount of cache on the processors.
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
When both processors want to change the same data from 2 different threads or processes, you either have a lock you need to acquire (one gets the lock, the other sleeps or spins on the lock until it gets its turn) or you risk corrupting data because both processors 'race' to see who gets to the data first.
 

MadRat

Lifer
Oct 14, 1999
12,014
321
126
Why don't they do a split of the data into two new components based on the original data, change the data as they see fit, then compare the changes when they are done? Software has been doing this for twenty five years.
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
Some things need to happen in succession, you can't have both running at the same time so a lock is required.

What you propose requires more code, memory and sounds slower, usually you try to be in critical sections for <10ms so there's very little speed notice by having 1 process spinlock until the data is available.

For data that needs to be read a lot and written seldom seperate read and write locks can be used, so you can have 30 successive reads but if someone wants to update the data all the readers lose their read lock until the writer is finished.

Or maybe I'm misinterpreting what you're saying...
 

m0ti

Senior member
Jul 6, 2001
975
0
0
data races are the programmer's responsibility. Writing code that doesn't use locks (or other synch methods) can cause data races even running on single processor machine (of course on an OS that supports pre-emption).

The bigger problem is maintaining cache consistency! This is likely to happen if a processor reads data from the cache (stores it locally) and a write then happens. The processor then proceeds to read the incorrect data from its own cache on subsequent reads.

This means that in the case of a write to the cache (the dummy processor) it has to notify all the processors to invalidate that block in their respective local caches (unless they are completely exclusive).

AMD has already demonstrated a fairly good ability to maintaining cache consistency with their MP chipset, and I doubt that Hammer will be in a worse situation.

The whole architecture seems to smack of exclusivity. Thus the memory that a processor is in charge of will stay under it's control (i.e. it is responsible for all writes/invalidates (or snooping could be used)). I'd imagine that the "dummy processor" would sort of be used as a large prefetch/victim cache. On a miss, other processors would redirect their query to the dummy processor which would transfer the data to the processor responsible for this piece of memory (i.e. load the data back into the normal cache system).
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
data races are the programmer's responsibility. Writing code that doesn't use locks (or other synch methods) can cause data races even running on single processor machine (of course on an OS that supports pre-emption).

Of course it is, but the 'shared' cache he's proposing would either add extra locking code to the kernel or require hardware locking (like the memory bus on an Intel SMP system) that because things that would be normally in the local cache might now be shared between processors.

I'd imagine that the "dummy processor" would sort of be used as a large prefetch/victim cache. On a miss, other processors would redirect their query to the dummy processor which would transfer the data to the processor responsible for this piece of memory (i.e. load the data back into the normal cache system).

Doesn't that seem like an unnecessary level of indirection? With that miss path you either have the dummy processor give them the data, reply with another miss and they have to go to main memory or get the data from main memory for them.
 

m0ti

Senior member
Jul 6, 2001
975
0
0
Of course it is, but the 'shared' cache he's proposing would either add extra locking code to the kernel or require hardware locking (like the memory bus on an Intel SMP system) that because things that would be normally in the local cache might now be shared between processors.

This could be a problem that exists anyway. As I pointed out, either they have to use some method to maintain cache consistency (with the dummy processor or without it), or they have to maintain exclusivity between the caches on the various processors. As I'd also pointed out, the MP chipset has a pretty good method to maintain cache consistency without paying the performance hit of exclusivity, and I'd expect this to be the case in Hammer, and certainly implementable for a dummy processor as well. And exclusivity is always an option (though a poorer one, IMO).

Doesn't that seem like an unnecessary level of indirection? With that miss path you either have the dummy processor give them the data, reply with another miss and they have to go to main memory or get the data from main memory for them.

Yes, it would count as another miss. The dummy processor would, in fact, be acting like some sort of shared L3 cache for the other processors, and be a separate level in the memory hierarchy.

I'm not saying that this would be recommended. I think Hammer's architecture is great, and by adding in this sort of pseudo L3 cache it could benefit performance, but it would make memory management more difficult, and be unnecessarily complex. If they wanted to, they could probably get more performance out of the platform by either: increasing cache sizes (as has been pointed out), or by increasing the bandwidth on the hyper-transports interconnecting the processors.