This allows the impression of a multi-threaded hardware. It is not true multi-threaded as that would involve being able to process 2 or more threads concurrently (i.e. at the same time), while hyperthreading can still only actually process one thread at any given time. AMD's multi-core system would be true multi-threaded and not hyperthreaded. Hyperthreading was simply a way to increase performance of the P4 by utilizing more of the wasted clock cycles.
As I recall, the SMT implementation on the P4 is perfectly capable of multiple threads being processed concurrently. The only limitation to this I can see is the single-decoder front-end which can only decode one x86 instruction at any given clock. However, I don't think this limits it to only one thread at any given time as the majority of time, instructions are issued from the trace cache, not decoded. The trace cache can issue up to 3 micro-ops every cycle (well, 6 micro-ops every 2 cycle), and those 3 micro-ops can be from any of the two threads running.
AMD's CPU's have tended to be more efficient in terms of utilizing more of its clock cycles then Intel's P4 line. Partly due to the pipeline structure as well as allowing more time for each clock cycle to complete (i.e. a faster Hz rating means there is less time availabe for a clock cycle to complete, and with I/O (i.e. reading from memory, or hard drive, etc) taking a certain amount of time, multiple clock cycles are wasted while waiting for that I/O to complete, whereas with a slightly longer clock cycle, more I/O has a chance to complete in less clock cycles, thus not wasting as many cycles as the faster clock cycle system might waste). Hyperthreading was a way to make the P4 more efficient in its operations, and would not necessarily make ALL CPU's more efficient.
While this is true, it is true of all processors, not just P4 or Athlon. A 2 GHz Athlon would be inherently "less efficient" every clock cycle than a 1.8 GHz Athlon. As you'll notice from the scaling of the Athlon and P4 (Anand's
review of the 3.2 P4 showed a 59.65% scaling of the P4 from 3.0 to 3.2 GHz and a 55.93% scaling of the Athlon Barton from 1.83 to 2.20 GHz), this is a problem for all processors.
However, this isn't neccessarily all the impact SMT has. Often, even without I/O or memory latency problems, not all resources on the MPU is neccessarily utilized. Due to many things such as data dependencies and instruction decode limitations, ILP cannot always be extracted to max out the processor's parallel execution resources (in modern superscalar MPU's). SMT solves this as well.
For the Athlon, SMT could be very helpful, perhaps even more so than the P4 as the Athlon has much more parallel execution units than the P4 does (3-6 issue front-end and 3-way decoder vs 3-issue front-end and single decoder, 9 parallel issue ports and execution units vs 6 issue ports and 7 execution units). Assuming (and not accurately) that the Athlon is achieving a 33% higher clock-normalized performance, that still doesn't mean it's utilizing the 33% more execution resources it has compared to the P4.