Pulled off aces hardware. WaltC author.
"> Last, I want to briefly address the topic in my heading--multithreaded
> software. Recall that multithreaded software long predates
> in terms of concept and execution the HT P4 (as the P4 itself
> long predates the HT P4.) Its best use to date has been,
> of course, in dual and multi-cpu systems which have the literal
> capacity to execute multithreads simultaneously. But I can
> dimly recall a few pieces of multithreaded software running
> on my old and ancient PII systems--dimly, at least that far
> back...

>
> P4 HT simply won't, because it cannot, do SMT like SMP systems
> can do SMT. Despite the marketing aphorisms, P4 HT has several
> bugaboos which are P4-specific. It's in reality of course
> a single core cpu, and what it does with multithreaded software
> through HT is to essentially accelerate the multitasking
> (time share) of multiple threads in the single core, thereby
> increasing per-clock work efficiency of the P4 above what
> the P4 normaly achieves when running a single thread at the
> same clock speed. At no time does the HT P4 ever actually
> run more than a single thread at the time, but since the
> HT circuitry increases the P4's per-clock working efficiency
> in the advent of multithreaded software, the performance
> within the multithreaded application improves as the internal
> per-clock thread multitasking performance of the cpu improves.
> The extra theoretical per-clock efficiency of the HT circuitry
> is of no use for single threaded software, since there are
> no multiple threads within the software for the HT cpu circuitry
> to (basically) "accelerate the multitasking of those threads
> per clock." I liken the distinction to the HT circuitry being
> piggybacked onto the normal P4, as opposed to the HT circuitry
> being basic to the P4's architecture in some fundamental
> fashion. The bios control for HT on/off, a very clumsy way
> to handle the condition, imo, is probably the only possible
> way to have a single-core cpu effectively fool the OS into
> seeing a second cpu where none exists in order to accelerate
> per-clock multithreaded performance. (There are other considerations
> which I consider peripheral to the discussion here that P4
> HT circuitry enablment creates, such as higher power consumption
> and heat dissipation demands, etc. Also, don't confuse a
> "logical" cpu as reported by the OS with a real physical
> additional cpu, anymore than you would confuse a physical
> hard drive in your system with a logical drive partition
> placed on it.)
>
> Here we can understand the crux of the most serious of the
> P4's HT problems: inconsistency and unpredictability. Because
> we are not dealing with two separate cpu cores but rather
> with a single core which by definition is incapable of executing
> more than a single thread at a time, HT performance increases
> when running multithreaded software are wildly inconsistent
> to the point where they can range from as much as a 50% increase
> in rare cases of heavily P4 HT-optimized software to an actual
> *minus 10-15%* in the performance of other software when
> compared with running that software on the same cpu with
> HT turned off. In fact, as you may have noted recently, Intel
> is now *officially* advocating that HT be *turned off* in
> some cases as it is seen to be a drag on even non-HT performance,
> sometimes. This is all happening because you in effect have
> one cpu core which is trying to pretend it is two cores--but
> only sometimes--and the situation can as easily provoke inefficiency
> in the core as it can provoke the desired efficiency increases
> you look for in processing...

In short, two physical cpu
> cores are always much better than a single physical core
> & a logical core.
>
> So let's consider the case of the A64 and a multithreaded
> software application. It'll run the multithreaded software
> exactly like the HT-enabled P4 runs it, by multitasking the
> threads. The difference is that the A64's fundamental architecture
> is designed to run everything at maximum per-clock efficiency
> all the time and doesn't conditionally and functionally speed
> up per-clock multithreaded multitasking or slow down per
> clock while running single threads exclusively, as happens
> in the P4 HT-enabled cpu--sometimes (as sometimes it's best
> to just turn P4-HT off)...

If a specific application is
> heavily optimized for the A64, for instance, it will run
> the application faster than normal just as a heavily P4-HT
> optimized multithreaded application will run faster than
> normal on the P4 with HT enabled. How "much" faster for
> either depends upon the depth and quality of the respective
> optimizations, of course. Look at the spread in MHz between
> a 3.4GHz HT P4 and a 2.2GHz A64--it's darn near 50%, isn't
> it? Depending on the cache config for these cpus, of course,
> which can vary in either case, these cpus are considered
> very close to each other in IA-32 x86 general application
> performance (with, as you note, Athlon64 winning most contests--except
> those running heavily P4-HT optimized software and not offering
> a heavily A64-optimized version of the test or bench to compare.)
> This tells us that basically the A64 is ~50% faster than
> the HT P4 *per clock.* When you consider that the A64 will
> make further performance gains running in 64-bit mode, well,
> the choice is pretty clear to me.
>
> Last but not least, none of this is lost on Intel which, with
> the Dothan cores and their upcoming dual core cpus, seems
> pretty clearly to have consigned P4 HT to the ash-can of
> history. AMD actually announced a dual-core cpu direction
> years ago when it announced K8 and for awhile there was a
> lot of speculation that k8 would debut as a dual-core cpu.
> Unlike Intel, though, AMD seems to have done a lot more
> prepatory work in setting up the kinds of system buses and
> other things amenable to a dual-core cpu reaching its performance
> potential. Dual cores from AMD and Intel are just far better
> SMT strategies than something like P4 HT ever was, because
> of one very important thing they bring to the table that
> P4 HT never did: not only much better multithreaded performance,
> of course, but mainly dual-core cpus will introduce a consistency
> and predictability into multithreaded software performance
> that P4 HT simply failed to do. That's why P4 HT was never
> the catalyst for smt software development some thought it
> would be--the performance potential is simply too unpredictable
> for many software firms to justify its sometimes much higher
> development costs. SMP, really, has always been a far greater
> catalyst for smt software development than P4 HT ever was,
> imo. It's certain, though, I think, that when dual-core
> cpus become commodity cpus that general smt software development
> will soon after accelerate like a rocket...

"