• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Possible design of hyper-threading for VLIW CPUs

m0ti

Senior member
Hi,

Just had an idea which seemed fairly suitable for hyperthreading for VLIW CPU's, scalable as wide as you can get your bus. I'd be interested in any comments about it; if any one is knowledgeable about how Itanium does (? or will do?) this, please add in your $0.02.

Ok, these are my requirements (for 2 threads):

double your amount of registers (one set for each thread),
double your bus width
+ some additional, fairly simple hardware.

I'm going to assume that each VLIW is 128 bits wide, containing 4 32 bit instructions. Today, modern ILP is pretty much limited to a max of 4 instructions at one time (thre are cases with higher ILP, though it's fairly rare); obviously, though, different threads are clearly independant one of the other (and where they're not they synchronize up nicely, with test & set or similar mechanisms). And 4 instructions are still going to leave a lot of functional units free. The bus is 256 bits wide so that the VLIW's for both threads can be brought at once.

I'm also assuming that for the simplicity of implementation that no op is 0x00000000.

Initializtion: The instructions are loaded into 128 bit registers (actually a couple of them).

Every no-op in VLIW of thread 1 is replaced with the appropraite instruction from thread 2 (easy to do with vector calculations), instructions used from thread 2 are set to no-op. (for next iteration of the pipe-line if thread 2 is all no op it's instruction is loaded, and for this iteration a flag bit is set).

The new VLIW is executed... if an instruction that came from thread 2 cannot be executed (i.e. VLIW from thread 1 took (all of) the functional unit(s) needed for the instruction), it waits for the rest of instruction 2 to be executed (catching up in the appropriate stage of the pipe-line), this would require room for saving the instruction for each functional unit / group of functional units.

If an exception has occurred for thread 1 or for thread 2 (when the flag bit is set) then its instructions (already in the pipeline) do no writeback (need to flush them) and the exception is handled.

If the flag bit is not set, then writeback does not occur for thread 2 until the whole VLIW completes, at which time exceptions can be examined and handled.


the 1's should be replaced with 2's and vice versa for alternating iterations of the pipeline.

This should increase IPC somewhat, especially if the two threadsd target different functional units. The overhead (in terms of latency) also isn't that high, and throughput shouldn't be affected at all.

 
Back
Top