• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Anand Article: Intel's Hyper-Threading Technology: Free Performance?

Its been up for a couple of hours. I've been waiting for someone to comment. Do you have anything interesting to say about it?
 
pretty interesting..

too bad it doesn't solve the normal users power hungry needs.. we simply need more pipelines :-( it would be nice if they could make a 'generic' pipeline or something like that which would do both Integer and FPU operations..

anywho, do you really need multiple threads in order to fill up multiple integer piplines? ie, in his last example, which had two integer pipelines, with HyperThreading allowing one thread to use one and the other thread to use the second one.. wouldn't that happen anyway without hyperthreading? or does one thread HAVE to be run in order (from top to bottom so to speak)? I'm guessing that's where alot of code optimization takes place anyway?

how many pipelines does the Athlon have again? I remember the number 3, but don't remember what it refers to.
 


<< too bad it doesn't solve the normal users power hungry needs.. we simply need more pipelines >>

On the contrary, adding more would achieve diminishing returns at a large hit to die area with x86 code. Because of the fewer logical registers and two-operand instruction format, x86's optimal superscalar width is 3, compared to RISC's 4....beyond that, there is little added instruction-level parallelism.



<< how many pipelines does the Athlon have again? I remember the number 3, but don't remember what it refers to. >>


The Athlon and P4 are both 3-way issue superscalar, ie they can fetch up to 3 uops (the "RISC-like" ops to which x86 instructions are decoded) per cycle in to their reorder buffers. Optimally, you would want 3 copies of each execution unit so that there are no resource conflicts between out-of-order issued instructions. The Athlon has 3 integer units, 3 FP units (sort of...they fulfill different roles: FP add, FP multiply, and FP load/store), and 3 in-order load/store units. The P4 has 3 integer units (2 "double-pumped", one for slower ops), 2 FP units (FP add/mult, FP load/store), and 2 out-of-order load/store units. The Athlon can then issue 9 uops/cycle from its reorder buffers to any of its execution units, and can retire 3 uops/cycle in-order. The P4 can issue 6 uops/cycle from its reorder buffers to any of its execution units, and can retire 3 uops/cycle in-order.



<< anywho, do you really need multiple threads in order to fill up multiple integer piplines? ie, in his last example, which had two integer pipelines, with HyperThreading allowing one thread to use one and the other thread to use the second one.. wouldn't that happen anyway without hyperthreading? or does one thread HAVE to be run in order (from top to bottom so to speak)? I'm guessing that's where alot of code optimization takes place anyway? >>

Yes, SMT is necessary to use a single core with multiple threads simultaneously...in conventional superscalar, a single thread is executed at a time, and the scheduling of threads on the processor is left up to the operating system. In an out-of-order superscalar CPU, instructions from a single thread are fetched into reorder buffers, out of which they are issued to execution units. The order in which instructions are executed may be different than the program code, provided that execution conflicts and data & conditional dependencies are checked. After execution, the instructions can be retired in-order (common to x86 CPUs) or out-of-order (common to high-end RISC).

How threads are scheduled on the CPU depends on the operating system....the scheduling algorithm is tailored to the type of computer the OS is design for. When a new thread is scheduled, a context switch into the OS code is performed, and the OS' thread switch code is executed, which saves the state of the thread (the program counter, the general purpose registers, the stack, address registers, segment registers, etc....). A new thread is scheduled, its state variables are loaded, and execution begins where that thread left off.

Here's a shameless plug for BurntKooshies execellent article on the fundamentals of multithreading for anyone who is interested.
 
The Athlon and P4 are both 3-way issue superscalar, ie they can fetch up to 3 uops (the "RISC-like" ops to which x86 instructions are decoded) per cycle in to their reorder buffers.

what is Superscalar??? they can fetch 3 micro-ops per cycle in to their reorder buffers? ok, what is a reorder buffer 🙂

so basically it can take 3 micro-ops (does that mean it can decode x86 as fast?) and put them into a buffer that re-orders them (I guess this is where a branch prediction unit comes into play?) so that if one micro op depends on the results from another micro op, it will not be sent until the result is returned? I'm guessing here..

anywho, so basically the CPU can feed 3 out of 9 pipelines, yet without SMT it can only feed 1 out of 9 pipes per clock cycle?? this is most definately wrong.. can you clarify? what is the point of 9 pipelines ( 3 integer, 3 FPU, and 3 load/store )?

Edit: btw according to that link u gave me, the processor CAN send more than one instruction down a pipe (the link refers to it as a unit) in a clock cycle, it just has to be from the same thread so to speak.

Edit # 2: For a long time, the secret to more performance was to execute more instructions per cycle, otherwise known as Instruction Level Parallelism (ILP), or decreasing the latency of instructions. To execute more instructions each cycle, more functional units (integer, floating point, load/store units, etc) have to be added on. In order to more consistently execute more instructions, a processing paradigm called out-of-order processing (OOP) can be used, and it has in fact become mainstream (notable exceptions are the UltraSparc and IA-64).

This paradigm arose because many instructions are dependent upon the outcome of other instructions, which have already been sent into the processing pipeline. To help alleviate this problem, a larger number of instructions are stored so as to be ready to execute immediately. The purpose is to find more instructions, which are not dependent upon each other. This area of storage is called the reorder buffer.


woohoo I was somewhat correct.. I think.

anywho, the re-order buffer is used to pick out instructions (or micro ops) that are NOT dependant on each other, but this doesn't mention anything about the other instructions that depend on results.. is this trying to say that it looks for the 'first' instruction in a series of instructions (ie, the one that another micro-op or a series of micro-ops requires to be completed in order to begin their computations)?

The MAJC architecture from Sun Microsystems makes use of CMP. It allows one to four processors to share the same die, and for each to run separate threads. Each processor is limited to 4 functional units (each of which are able execute both integer and floating point operations, making the MAJC architecture more flexible).

that's certainly interesting, esp. the part about the units being either Integer or FPU..

so CMP is the way that AMD is going with it's hammer series? I know they're going for SMP for sure (with HyperTransport links to each processor in order to relieve the memory bus of their communications like updates I'm guessing)..

There are problems with CMP, however. The traditional CMP chip sacrifices single-thread performance in order to expedite the completion of two or more threads. In this way, a CMP chip is comparatively less flexible for general use, because if there is only one thread, an entire half of the allotted resources are idle, and completely useless (just as adding another processor in while using a singly threaded program is useless in a traditional SMP system).

now wait a minute, it seems to me that there're always a couple of threads running anyway, so how often does this occur? I mean, say I was an average joe user, I didn't have Seti@Home or RC5 on the computer, but rather all the millions of 'neat' little programs that I found on the internet in my startup folder.. if my OS was capable of assigning different threads to different processors (SMP or CMP), would they ever really be doing nothing??
 
On the contrary, adding more would achieve diminishing returns at a large hit to die area with x86 code. Because of the fewer logical registers and two-operand instruction format, x86's optimal superscalar width is 3, compared to RISC's 4....beyond that, there is little added instruction-level parallelism.

according to Anandtech alot of the time we (the gamer, or the typical user) are simply running mostly integer, OR floating point intensive apps.

that means, if you do SMT on, say an Athlon while you're in windows, you'd probably end up running say, one Floating point unit, and 2, or 3 Integer Units. the floating point unit would be winamp (of course 🙂 ) and the 2 or 3 integer units would be maybe a browser, or whatever other apps you're running.

there'd still be waste, just not as much, which is why pipelines that are sortof generic (integer OR floating point) would make MORE sense in an SMT processor.

btw, say I was running RC5 on an SMT machine.. wouldn't that slow something like RC5 down overall?

interestingly enough, this is the exact reason why it is impossible to achieve a 'full load' on todays desktop CPUs.
 


<< what is Superscalar??? they can fetch 3 micro-ops per cycle in to their reorder buffers? ok, what is a reorder buffer 🙂 >>

Superscalar is just the method of exploiting instruction-level parallelism in code by executing multiple instructions at a time. Modern out-of-order superscalar processors can execute instructions out of the order that they are in the code...After instructions are fetched and decoded, they are put into reorder buffers, or "queues", out of which they can be issued out of order to the appropriate execution unit. There are three types of dependencies which prevent instructions from being executed out-of-order, in which case the sequential instuctions must be executed in the order in which they are presented in the code:

Read-after-write:
a = b + c
d = e + a

Write-after-read:
d = e + a
a = b + c

Write-after-write:
c = a + b
c = d + e

As you may or may not know, CPUs most commonly perform arithmetic on registers, which are temporary locations of fast storage on the CPU, located in the register file and whose number is specified by the instruction set. RISC CPUs typically have 32 logical registers, vs. 8 for x86. The RISC philosophy states that memory contents must be loaded into registers before the arithmetic expression is evaluated, while x86 can perform arithmetic using registers and/or memory locations (not necessarily a good thing). With register remaning, on the other hand, there exists a larger number of physical registers, and the logical registers are mapped onto these (each left-hand side use of a logical register in an arithmetic expression gets a new physical register mapping). With register renaming, write-after-read and write-after-write dependencies are nullified, but read-after-write dependencies must still be followed.



<< I guess this is where a branch prediction unit comes into play? >>

Depends on the architecture....branch prediction and resolution is typically decoupled from the normal execution logic. On the P4, it occurs during a trace segment build.



<< so basically it can take 3 micro-ops (does that mean it can decode x86 as fast?) >>

This depends on the microarchitecture. Modern x86 processors with decoupled execution decode the more complex x86 instructions into the RISC-like micro-ops. In the case where an arithmetic instruction uses both registers and memory locations, it gets decoded into load instruction and an arithmetic instruction using only registers. As a result, I believe the average x86 instruction gets decoded into 1.5 uops. The Athlon, with the traditional instruction cache, has 3 parallel decoders (taking two pipeline stages to complete); thus it can fetch and decode 3 x86 instructions/cycle. The P4 has a single decoder, but it is only used when it builds a trace cache segment (decoded uops are reused in the trace cache)...in normal operation, the P4 can fetch 3 uops out of the trace cache per cycle. The problem is that x86 decoding is very complex....it has to do the x86 -> uop decoding, and x86 instructions are variable length from 1 to 15 bytes (vs. fixed at 4 or 8 bytes for 32-bit or 64-bit RISC), which makes the program counter updating more complex. As a result, x86 decoders are large, hot, and slow...the large amount of logic probably has very high fan-in and fan-out, as well as long wire lengths, and is likely a major source of the critical path length. Thus, while the Athlon's 3 decoders are great for maintaining throughput, they can't be great for heat, die area, and clock speed.



<< anywho, so basically the CPU can feed 3 out of 9 pipelines, yet without SMT it can only feed 1 out of 9 pipes per clock cycle >>

No, the point is that with SMT, all the execution units can be shared among multiple threads. How this is accomplished again depends on the microarchitecture. At the basic level, the hardware changes that are necessary are multiple program counters (to keep track of the instruction address of each thread) and some way of identifying each instruction with each thread. The beauty of register renaming is that multiple physical register sets aren't necessary, the same register set can map logical registers for multiple threads.

How the CPU maintains issue and retire throughput for SMT is up to the architecture. The now defunct Alpha EV8 was going to be a 4-thread SMT, 8-way fetch/issue superscalar CPU. It was designed from the ground-up to support all four threads at once....compared to the EV7, the number of integer units was to be increased from 4 to 6, and the number of float-point units from 2 to 4. The register files and caches had enough ports to handle the fetching and retiring of 4 active threads; it was estimated that for a single thread, the 8-way superscalar EV8 would have an average IPC (instructions/cycle) of between 2 - 2.5, so the 8-way CPU could just about accomodate the 8-10 instructions/cycle being fetched, executed, and retired for 4 threads.

The P4's SMT is initially going to support 2 threads, which is probably ideal. A study once found that for P3-era code, x86 CPUs typically achieve an IPC of around 1 - 1.2....since there's around 1.5 uops/x86 instruction, that's 1.5 - 1.8 uops/cycle on average. Typical integer code is comprised of roughly 50% load/store instructions, 35 - 40% arithmetic instructions, and 10-15% branch instructions. With the P4's two load/store units and two fast integer units, resource conflicts between the two active threads shouldn't be a problem. Floating-point code should cause more difficulty, since there's only one FP add/mult unit and one FP store unit that must be shared between two threads.
 


<<

<< too bad it doesn't solve the normal users power hungry needs.. we simply need more pipelines >>

On the contrary, adding more would achieve diminishing returns at a large hit to die area with x86 code. Because of the fewer logical registers and two-operand instruction format, x86's optimal superscalar width is 3, compared to RISC's 4....beyond that, there is little added instruction-level parallelism.



<< how many pipelines does the Athlon have again? I remember the number 3, but don't remember what it refers to. >>


The Athlon and P4 are both 3-way issue superscalar, ie they can fetch up to 3 uops (the "RISC-like" ops to which x86 instructions are decoded) per cycle in to their reorder buffers. Optimally, you would want 3 copies of each execution unit so that there are no resource conflicts between out-of-order issued instructions. The Athlon has 3 integer units, 3 FP units (sort of...they fulfill different roles: FP add, FP multiply, and FP load/store), and 3 in-order load/store units. The P4 has 3 integer units (2 "double-pumped", one for slower ops), 2 FP units (FP add/mult, FP load/store), and 2 out-of-order load/store units. The Athlon can then issue 9 uops/cycle from its reorder buffers to any of its execution units, and can retire 3 uops/cycle in-order. The P4 can issue 6 uops/cycle from its reorder buffers to any of its execution units, and can retire 3 uops/cycle in-order.



<< anywho, do you really need multiple threads in order to fill up multiple integer piplines? ie, in his last example, which had two integer pipelines, with HyperThreading allowing one thread to use one and the other thread to use the second one.. wouldn't that happen anyway without hyperthreading? or does one thread HAVE to be run in order (from top to bottom so to speak)? I'm guessing that's where alot of code optimization takes place anyway? >>

Yes, SMT is necessary to use a single core with multiple threads simultaneously...in conventional superscalar, a single thread is executed at a time, and the scheduling of threads on the processor is left up to the operating system. In an out-of-order superscalar CPU, instructions from a single thread are fetched into reorder buffers, out of which they are issued to execution units. The order in which instructions are executed may be different than the program code, provided that execution conflicts and data & conditional dependencies are checked. After execution, the instructions can be retired in-order (common to x86 CPUs) or out-of-order (common to high-end RISC).

How threads are scheduled on the CPU depends on the operating system....the scheduling algorithm is tailored to the type of computer the OS is design for. When a new thread is scheduled, a context switch into the OS code is performed, and the OS' thread switch code is executed, which saves the state of the thread (the program counter, the general purpose registers, the stack, address registers, segment registers, etc....). A new thread is scheduled, its state variables are loaded, and execution begins where that thread left off.

Here's a shameless plug for BurntKooshies execellent article on the fundamentals of multithreading for anyone who is interested.
>>



I have no idea what you just said but ok 🙂.
 


<<

<<

<< too bad it doesn't solve the normal users power hungry needs.. we simply need more pipelines >>

On the contrary, adding more would achieve diminishing returns at a large hit to die area with x86 code. Because of the fewer logical registers and two-operand instruction format, x86's optimal superscalar width is 3, compared to RISC's 4....beyond that, there is little added instruction-level parallelism.



<< how many pipelines does the Athlon have again? I remember the number 3, but don't remember what it refers to. >>


The Athlon and P4 are both 3-way issue superscalar, ie they can fetch up to 3 uops (the "RISC-like" ops to which x86 instructions are decoded) per cycle in to their reorder buffers. Optimally, you would want 3 copies of each execution unit so that there are no resource conflicts between out-of-order issued instructions. The Athlon has 3 integer units, 3 FP units (sort of...they fulfill different roles: FP add, FP multiply, and FP load/store), and 3 in-order load/store units. The P4 has 3 integer units (2 "double-pumped", one for slower ops), 2 FP units (FP add/mult, FP load/store), and 2 out-of-order load/store units. The Athlon can then issue 9 uops/cycle from its reorder buffers to any of its execution units, and can retire 3 uops/cycle in-order. The P4 can issue 6 uops/cycle from its reorder buffers to any of its execution units, and can retire 3 uops/cycle in-order.



<< anywho, do you really need multiple threads in order to fill up multiple integer piplines? ie, in his last example, which had two integer pipelines, with HyperThreading allowing one thread to use one and the other thread to use the second one.. wouldn't that happen anyway without hyperthreading? or does one thread HAVE to be run in order (from top to bottom so to speak)? I'm guessing that's where alot of code optimization takes place anyway? >>

Yes, SMT is necessary to use a single core with multiple threads simultaneously...in conventional superscalar, a single thread is executed at a time, and the scheduling of threads on the processor is left up to the operating system. In an out-of-order superscalar CPU, instructions from a single thread are fetched into reorder buffers, out of which they are issued to execution units. The order in which instructions are executed may be different than the program code, provided that execution conflicts and data & conditional dependencies are checked. After execution, the instructions can be retired in-order (common to x86 CPUs) or out-of-order (common to high-end RISC).

How threads are scheduled on the CPU depends on the operating system....the scheduling algorithm is tailored to the type of computer the OS is design for. When a new thread is scheduled, a context switch into the OS code is performed, and the OS' thread switch code is executed, which saves the state of the thread (the program counter, the general purpose registers, the stack, address registers, segment registers, etc....). A new thread is scheduled, its state variables are loaded, and execution begins where that thread left off.

Here's a shameless plug for BurntKooshies execellent article on the fundamentals of multithreading for anyone who is interested.
>>



I have no idea what you just said but ok 🙂.
>>



Interesting nonetheless, right? 🙂 Thanks for your time Sohcan. 🙂
 
<<Because of the fewer logical registers and two-operand instruction format, x86's optimal superscalar width is 3, compared to RISC's 4....beyond that, there is little added instruction-level parallelism.>>

Is it "optimal" because of an x86 limitation or because of traditional programming... i.e. if the programming industry decided parrallelism was the norm, could they overcome this limit?

<<The Athlon can then issue 9 uops/cycle from its reorder buffers to any of its execution units, and can retire 3 uops/cycle in-order. The P4 can issue 6 uops/cycle from its reorder buffers to any of its execution units, and can retire 3 uops/cycle in-order.>>

Is the "retire 3 uops/cycle in-order" written in stone, or could some arbitrary number (i.e. retire x uops/cycle in-order, where x<>3) be chosen that was higher than three? I would guess if its "3-way issue superscalar" then that is because it loads/retires instructions in multiples of threes. Just a guess. 😉

Not that this is related, but I wonder how Transmeta's design would benefit from SMT, with its code morphism. Something tells me that his design would be able to scale better than typical x86 designs using traditional SMP and SMT. Sohcan, you ever studied the Transmeta work?
 


<< Is it "optimal" because of an x86 limitation or because of traditional programming... i.e. if the programming industry decided parrallelism was the norm, could they overcome this limit? >>

Code definitely is a factor...though note that compiler use of instruction-level parellelism (ILP) has been taking place since the Pentium era, it's thread-level parallelism (multithreaded programming) that is just starting to break into consumer programming. There's a number of established optimizations that can be performed by the compiler to improve instruction parallelism; some architecture independent (local-lookup optimizations, copy-propagation, moving loop-invariants), some architecture dependent (strength reduction, taking advantage of architectural quirks). There are obviously lots of situations where code must be executed sequentially...a load might have to be performed before an arithmetic operation, or a number of arithmetic operations might have to be performed sequentially for pointer following.

Multithreaded programming is much more difficult to conceptualize, since breaking a program into multiple threads and ensuring synchronization causes all sorts of headaches. Programming language design and libraries helps a lot for multithreaded programming....Java was built with multithreading in mind and is easy to implement, too bad the language is inherintly slow. 🙂

Aside from programming, the architecture and instruction set has a big impact on ILP. x86's big disadvantage is its lack of logical registers: 8 vs. traditional RISC's 32. This forces x86 processors to rely on memory more, and prevents them from using some cool compiler optimizations (loop unrolling is much more difficult). Since the memory hierarchy (caches + main memory) is slower than using registers, this limits the amount of parallelism that can be extracted from the code by a superscalar processor. Out-of-order execution helps, since non-dependent instructions in the buffer window can be executed out-of-order if an arithmetic instruction stalls because of a load. Faster average memory access times (due to faster caches and better cache hit-rates, an architectural issue) thus helps ILP. Any number of other architectural issues affects ILP: execution resources, fetch and retire rate (to a certain degree), branch misprediction rate and penalty (mispredicted branches wastes resources).



<< Is the "retire 3 uops/cycle in-order" written in stone, or could some arbitrary number (i.e. retire x uops/cycle in-order, where x<>3) be chosen that was higher than three? >>

I guess it could be design to retire any number of instuctions/cycle, though I don't think I've ever seen a processor that can retire more instructions/cycle than it can fetch.



<< Not that this is related, but I wonder how Transmeta's design would benefit from SMT, with its code morphism. Something tells me that his design would be able to scale better than typical x86 designs using traditional SMP and SMT >>

I don't know, I've never thought about that. Crusoe is a statically scheduled (in-order, determined by the code-morphing software) VLIW processor, so I don't know if it has register renaming. That could be a design hurdle for SMT, since Crusoe's aim is to be small and low-powered. Its implementation would probably depend on how well its execution resources are used, and if its fetch & execution width could accomodate SMT without a drastic increase in die-area.
 
Back
Top