I assume the code fits within the L1 caches perfectly well and never stress the core routers, the L2 and L3 caches, or even the memory subsystem. Because it fits so easily within the caches, there's almost no bottleneck or contention with Hyperthreading.
Usually hyper-threading sees more of a gain when there are cache misses or branch mispredictions, in which case one thread is fishing in the L2/L3 caches or has to wait to get the data from memory or the pipeline has the be flushed. The other thread has opportunity to take over.
The other case is that the current thread isn't using all of the chip's physical resources (ALUs, etc.) and the hyper-thread can use them.
The > 30% from the benchmark is ridiculously good performance from hyper-threading. I'm not overly familiar with the benchmark or every detail SB architecture, so it is possible that it's a benchmark that works well with hyper-threading, or that the way the SB architecture is designed allows for better performance on hyper-threads and the benchmark can't make full use of the hardware.
