HW2050Plus
Member
- Jan 12, 2011
- 168
- 0
- 0
ftp://download.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf (Page 7)
In other words, you are exactly wrong.
It's especially funny because exactly your quotation proves I am correct:
You also should have had a look on figure 3.A second goal was to ensure that when one logical
processor is stalled the other logical processor could
continue to make forward progress. A logical processor
may be temporarily stalled for a variety of reasons,
including servicing cache misses, handling branch
mispredictions, or waiting for the results of previous
instructions. Independent forward progress was ensured
by managing buffering queues such that no logical
processor can use all the entries when two active
software threads2 were executing. This is accomplished
by either partitioning or limiting the number of active
entries each thread can have.
2 Active software threads include the operating system
idle loop because it runs a sequence of code that
continuously checks the work queue(s). The operating
system idle loop can consume considerable execution
resources.
Obviously you just do not understand this text nor my posts but look at least at the figures (3/4) which is more comprehensive.
To explain you the meaning of what you highlighted:
This means that current results of pipeline stages are stored in the mentioned buffering queues so that when a thread switch occurs those buffered results can be used in order to minimize switching penality (if the slow/waiting thread would have to execute again the whole pipeline down to continue). See also figure 4 of this document to better understand the description you highlighted.Independent forward progress was ensured
by managing buffering queues such that no logical
processor can use all the entries when two active
software threads2 were executing.
I'll make another try from the perspective of two programs running in two threads:
Intel Hyperthreading:
priority/fast thread execution - slow thread execution
(priority/fast because it is the one which is allowed to execute)
(slow because it is the one which has to wait)
add rdx,r09 - waiting
add rdx,rax - waiting
mul rdx,rax - waiting
mov rax,rex - waiting
mov rdx, r10 - waiting
shl rex,3 - waiting
add rdx,r09 - waiting
add rdx,rax - waiting
mul rdx,rax - waiting
mov rax,rex - waiting
mov rdx, r10 - waiting
shl rex,3 - waiting
add rdx,r09 - waiting
add rdx,rax - waiting
mul rdx,rax - waiting
mov rax,rex - waiting
mov rdx, r10 - waiting
shl rex,3 - waiting
;Theroretically you could continue this endlessly with result of that the
;fast/priority thread executes always and the slow threads executes never
;(always waiting)
; but in real code the following occures e.g.:
move rax, [qword ptr]xxxxxxxx - waiting
* see below - now left is slow and right is priority
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - mov rdx, r10
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - mov rdx, r10
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - move rax, [qword ptr]xxxxxxxx
stall (L1 cache miss) - stall (L1 cache miss)
stall (L1 cache miss) - stall (L1 cache miss)
* see below - now again left is fast and right is slow
add rdx,r09 - waiting (+ stalling)
add rdx,rax - waiting (+ stalling)
mul rdx,rax - waiting (+ stalling)
mov rax,rex - waiting (+ stalling)
mov rdx, r10 - waiting (+ stalling)
shl rex,3 - waiting (+ stalling)
add rdx,r09 - waiting (+ stalling)
add rdx,rax - waiting (+ stalling)
mul rdx,rax - waiting (+ stalling)
mov rax,rex - waiting
mov rdx, r10 - waiting
shl rex,3 - waiting
add rdx,r09 - waiting
add rdx,rax - waiting
mul rdx,rax - waiting
mov rax,rex - waiting
mov rdx, r10 - waiting
shl rex,3 - waiting
As you see NEVER EVER the two threads execute at the same time! This is just not possible by design. And just count the instructions executed for the fast/priority thread and the slow thread then you know why fast/slow.
I think the above is quite good as it shows as well why and from what you can get a performance benefit from Hyperthreading.
* However as I described already the priority in current Intel HT implementations is switched after each switch that is why for most workloads the two threads appear to work roughly at the same speed. But this is just a statistically sorting out of this fast/slow thread thing.
In comparison a real symetric multithreading (like in UltraSPARC T1):
thread 1 - thread 2
add rdx,r09 - waiting
waiting - add rdx,r09
add rdx,rax - waiting
waiting - mov rdx,r09
mul rdx,rax - waiting
waiting - shl rdx,r08
There (in e.g. UltraSPARC T1) you do not have a priority/fast thread and a slow thread, all threads are equal.
compared with
AMD Core Multithreading (CMT):
thread 1 - thread 2
add rdx,r09 - mul rex, rax
add rdx,rax - add rbx, rcx
mul rdx,rax - xor rdx, rdx
mov rax,rex - move rax, rex
mov rdx, r10 - mov rdx, r10
shl rex,3 - shl rex,3
add rdx,r09 - add rdx,r09
add rdx,rax - d rdx,rax
mul rdx,rax - mul rdx,rax
mov rax,rex - mov rax,rex
mov rdx, r10 - mov rdx, r10
shl rex,3 - shl rex,3
add rdx,r09 - add rdx,r09
add rdx,rax - add rdx,rax
mul rdx,rax - mul rdx,rax
mov rax,rex - mov rax,rex
mov rdx, r10 - mov rdx, r10
shl rex,3 - shl rex,3
move eax, [dword ptr]xxxxxxxx - shl rbx,3
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - mov rdx, r10
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - mov rdx, r10
stall (L1 cache miss) - add rdx,r09
stall (L1 cache miss) - move rax, [qword ptr]xxxxxxxx
stall (L1 cache miss) - stall (L1 cache miss)
stall (L1 cache miss) - stall (L1 cache miss)
add rdx,r09 - stall (L1 cache miss)
add rdx,rax - stall (L1 cache miss)
mul rdx,rax - stall (L1 cache miss)
mov rax,rex - stall (L1 cache miss)
mov rdx, r10 - stall (L1 cache miss)
shl rex,3 - shl rex,3
add rdx,r09 - add rdx,r09
add rdx,rax - add rdx,rax
mul rdx,rax - mul rdx,rax
mov rax,rex - mov rax,rex
mov rdx, r10 - mov rdx, r10
shl rex,3 - shl rex,3
add rdx,r09 - add rdx,r09
add rdx,rax - add rdx,rax
mul rdx,rax - mul rdx,rax
mov rax,rex - mov rax,rex
mov rdx, r10 - mov rdx, r10
shl rex,3 - shl rex,3
As you see both threads and be executed at the same time with CMT and there is never a thread in a waiting condition. That is why this is so fast and why it is so near to a real core that AMD just renamed the threads to cores.
Now you should know everything you need to know about HT and CMT (and even Symetric-MT).
see above and/or my previous posts there is also the explanation why from a macro view it appears that they get roughly half. I also showed above a technique Symetric-MT which ensures that they get half (and not only roughly and only averaged over a long period).It sounds like you've never used a Hyperthreading processor, regardless of time frame, each thread gets roughly half and there is no fast thread or slow thread.
This 20% is an estimation from you but let's assume that.The thing that some keep missing is that HT is -5 to 30%, averaging around 20% of "X"
This 80% is as well an estimation from AMD but let's assume that.while CMT is 80% of "Y".
Then the CPU with 1.2X would be faster.What if 1.2X > 1.8Y?
You are more clever than this. No mentioning that Magny Cours is an especially handicapped part with low clock and flipped together and your statement in addition is wrong:1.2X Westmere already beats 2.0Y of Magny Cours.
specINTrate2006:
Magny Cours 24 core: 392
Intel 24 core: 548
That is just 548/392 => 40% faster and that is including the flipping disadvantage for AMD. So without flipping it would be even less.
Last edited:
