The web is full of reports from circa 2009 that describe poor HT performance. More recent info is harder to come by.
If you think about the way HT works you are basically getting "free" performance when the CPU would otherwise be sitting there doing nothing. Any app that will completely load up a physical CPU core will not benefit from HT because there is no free time to begin with. There is some overhead associated with HT so clearly there will be some cases of worse performance with HT enabled. On average HT does help, sometime quite a bit.
I'm pretty sure LinPack runs better with HT off...
http://suryarpraveen.wordpress.com/...-effects-of-hyper-threading-software-updates/
http://semiaccurate.com/2012/04/25/does-disabling-hyper-threading-increase-performance/
This is about half the story. The key point to realize is that there are many, many resources inside a processor core that cannot be distilled into a neat little percent utilization metric. You've got a ton of pipeline stages to try to keep full. The classic fetch-decode-execute-retire pipline still applies, but remember that in a modern architecture, each one of these steps is rather complex in and of itself. What HT essentially does is run two parallel pipelines through each core. The heavier (in terms of transistor budget) elements are shared and the lighter elements are duplicated.
For dense floating point math kernels with very predictable memory access (such as Linpack), you're memory prefetchers, cache predictors, and decode blocks are going to be operating at full efficiency and will keep the floating point units pretty much constantly busy with only one pipeline. Obviously, in this case doubling the number of threads is just going to create contention and lower overall performance. (Side note: this is why every reasonable dense FP code lets you as the user choose the number of threads to use.)
However, for heavily branching code (server business logic) or code with poor spatial locality in terms of memory accesses (video encoding), your ALUs and FPUs are going to be spending a ton of time waiting on memory access, be it instructions or data). This is where HT comes in to play. While one thread is stalled out waiting on memory, the CPU can feed the ALUs and FPUs out of the other thread's instruction stream and vis versa.