I was thinking about all the latency trouble chips have with RAM the other day and came up with this crazy idea:
Currently the processor determines it needs some data. It checks it's caches, potentially checks other processors for latest copies and finally goes to DRAM to get what it needs. This of course is quite time consuming even with all the smart prefetch algorithms and similar stuff implemented on the processors.
Would it actually be useful to have this implementation:
Fully buffered RAM. It takes requests for data, puts them in the appropriate buffer and transmits "when the time is right".
The processor in question would have it's prefetch algorithms optimized in such a way that it would extract all the required RAM addressed from instruction cache at the moment the cache would be getting filled. That way all the RAM addressed could be sorted and checked against caches ahead of the code actually being executed. I suppose the functionality partly already exists in today's prefetch implementations.
Anyway, the addresses that would not already be in caches, would be sent to the RAM modules (along with some priority data) for retrieval which would in turn eventually fill the caches with all the potentially relevant data.
Since the L1 I-caches are rather small (a few microseconds worth of instructions), perhaps a processor could have a separate interface through which OS would send it the code that will be executed in the next time slice. Analyzing this code would achieve the desired effect plus it could in advance break the instructions into micro ops and prepare the whole lot for insertion into i-cache at the instant the code would be scheduled by the OS.
The goal of this implementation is to have all the data in caches by the time the code actually requires this data. Hopefully the implementation itself would work so great that a thread would request the data before being swapped by the OS scheduler - finding all the data in the cache when it would be scheduled for execution again.
Of course, the processor has to have large enough caches not to dump some of prefetched data in favor of currently running threads thus defeating the purpose of such prefetch algorithms. It also kinda makes sense to have shared L2 / L3 caches or at least some super fast method of switching data among caches.
I hope I was clear enough in my rambling. Could this approach pay off in any scenario or are processors simply too fast nowadays and this method would have no tangible effect because the predecoded data set is too small / processors already do that?
Currently the processor determines it needs some data. It checks it's caches, potentially checks other processors for latest copies and finally goes to DRAM to get what it needs. This of course is quite time consuming even with all the smart prefetch algorithms and similar stuff implemented on the processors.
Would it actually be useful to have this implementation:
Fully buffered RAM. It takes requests for data, puts them in the appropriate buffer and transmits "when the time is right".
The processor in question would have it's prefetch algorithms optimized in such a way that it would extract all the required RAM addressed from instruction cache at the moment the cache would be getting filled. That way all the RAM addressed could be sorted and checked against caches ahead of the code actually being executed. I suppose the functionality partly already exists in today's prefetch implementations.
Anyway, the addresses that would not already be in caches, would be sent to the RAM modules (along with some priority data) for retrieval which would in turn eventually fill the caches with all the potentially relevant data.
Since the L1 I-caches are rather small (a few microseconds worth of instructions), perhaps a processor could have a separate interface through which OS would send it the code that will be executed in the next time slice. Analyzing this code would achieve the desired effect plus it could in advance break the instructions into micro ops and prepare the whole lot for insertion into i-cache at the instant the code would be scheduled by the OS.
The goal of this implementation is to have all the data in caches by the time the code actually requires this data. Hopefully the implementation itself would work so great that a thread would request the data before being swapped by the OS scheduler - finding all the data in the cache when it would be scheduled for execution again.
Of course, the processor has to have large enough caches not to dump some of prefetched data in favor of currently running threads thus defeating the purpose of such prefetch algorithms. It also kinda makes sense to have shared L2 / L3 caches or at least some super fast method of switching data among caches.
I hope I was clear enough in my rambling. Could this approach pay off in any scenario or are processors simply too fast nowadays and this method would have no tangible effect because the predecoded data set is too small / processors already do that?