POWER8 SMT8 far surpasses the claims that IBM made!?

MisterLilBig

Senior member
Apr 15, 2014
291
0
76
Reading the new article, on Anandtech, about the Xeon E7 8800 Review I find something interesting that was not mentioned about the POWER8. It surpasses the performance claims given by IBM on SMT8.

From the article:

"According to IBM,
2-threads delivers about 45% performance more than one
4-threads deliver yet another 30% boost
the last 4-threads deliver about 7%"

That would be an 82% increase.


But the POWER8 3.4GHz using 8 threads gets a 120.35% increase over the 1 thread at the LMZA per core Performance: Compression test, and a 144.50% increase on the Decompression test.
And the POWER8 3.7GHz using 8 threads gets a 115.65% increase over the 1 thread at the LMZA per core Performance: Compression test, and a 154.85% increase on the Decompression test.

While Xeon E7 3.3GHz Haswell gets a 42% increase in compression and a 39% increase in the decompression test.


According to IBM, they are quite accurate on the 2 thread prediction, but the 8 thread prediction is way off! We are talking about performance from 47%~76% and 41%~89%, respectively, above the claimed!


Now that I provided the information here, Anandtech did the majority of the work tho! But now, my questions.

I didn't think that added threads could surpass the performance of the actual core/single thread times 2, but now we see that it totally can.
Am I missing something?
Like, this should be a big deal, no?
It's weird, right?
Why hasn't anyone else done this?
It should be cheaper, right?

How much space does SMT actually take? I imagine it's a percentage based one the actual core size, so it would depend, but with this much benefit, wouldn't this be worth it?


(Please no Intel vs Whoever.....stuff)
 
Last edited:

Ajay

Lifer
Jan 8, 2001
16,094
8,112
136
Interesting - now I really need to read that article.

One thing though, the market has spoken and they prefer Intel's solution for better than 95 our of 100 workloads.

By the way, 1.00 x 1.45 * 1.30 * 1.07 = 2.02 (or a little more than a 100% increase in performance0. Still, clearly on some workloads the result is even higher than the average.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
I didn't think that added threads could surpass the performance of the actual core/single thread times 2, but now we see that it totally can.

You bet it can, but the single thread IPC has to be pretty awful to get the most benefit. Something like memory bound or very branchy code (or weaker cores ;))
 

MisterLilBig

Senior member
Apr 15, 2014
291
0
76
One thing though, the market has spoken and they prefer Intel's solution for better than 95 our of 100 workloads.

Unrelated.

By the way, 1.00 x 1.45 * 1.30 * 1.07 = 2.02 (or a little more than a 100% increase in performance0. Still, clearly on some workloads the result is even higher than the average.

I saw it as:
2 thread = +45% of 1 thread
4 threads = +75% of 1 thread
8 threads = +82% of 1 thread

Is there an official way of doing this that you can point me to?
 

mavere

Member
Mar 2, 2005
190
4
81
It's been a while since I looked at Power8 docs, but IIRC, the chip was explicitly designed with the assumption that SMT *will* be used. To phrase it differently, a single thread is physically unable to utilize an entire core.

I think the assumed minimum is SMT2, so I'd be more interested to see how real-world usage scales between SMT2 and SMT8.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,112
136
I saw it as:
2 thread = +45% of 1 thread
4 threads = +75% of 1 thread
8 threads = +82% of 1 thread

Is there an official way of doing this that you can point me to?

It is basic word problem analysis:
4-threads deliver yet another 30% boost

"yet another X% boost" with a percentage problem is typically multiplicative. Sadly, that's not the clearest way of wording the results - so no one can be 100% certain.

**From the same AT page:
So in total, the 8-way SMT doubles the performance of this massive core.
So it looks like I am correct.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,617
12,535
136
POWER8 was also looking pretty good in the SAP benchmarks that were posted in the Xeon E7 thread on these-here forums. Let's face it, that is a major competitor for 18C Xeon. I still want to see whole-system power numbers. The AT article disappointed me by not providing any such data for the IBM systems.
 
Mar 10, 2006
11,715
2,012
126
POWER8 was also looking pretty good in the SAP benchmarks that were posted in the Xeon E7 thread on these-here forums. Let's face it, that is a major competitor for 18C Xeon. I still want to see whole-system power numbers. The AT article disappointed me by not providing any such data for the IBM systems.

POWER has generally provided very good performance in these sorts of benchmarks. Performance isn't IBM's problem here...
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
It is basic word problem analysis:


"yet another X% boost" with a percentage problem is typically multiplicative. Sadly, that's not the clearest way of wording the results - so no one can be 100% certain.

**From the same AT page:
So it looks like I am correct.

My immediate interpretation of IBM's intentional word choice is inline with Ajay's.

"yet another" implies a relative comparison to the base, which is defined by the preceding line of text.

We can corroborate this with the observation made by the OP - the last thing the IBM marketing team would do is forget to mention how awesome their multi-billion dollar (to develop) processor performs.

So, Ockham's razor applied here would leave one to conclude the OP made an honest misinterpretation of IBM's marketing claims. Easy enough, happens to everyone at some point.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
http://www.anandtech.com/show/9193/the-xeon-e78800-v3-review/11

7-zip is a problematic benchmark but it helps show ILP and basically how well the CPU deals with latency and more complex code.

61428.png


66524.png


Its quite interesting to see just how BAD Power does in a benchmark like this.

Using Avoton 1687 * (3.4/2.4) = 2390

Power 8 is only 20% better that Avoton clock for clock in this latency and memory/instruction sensitive test.

61427.png


Here Power also does poorly though with SMT8 it solidly defeats Xeon.

Interesting to see the design focus of Power.

61409.png


Looks like Power is more of a brute force massive throughput monster while Core is more a weaker but more complex and detailed design.

Note that Core will very likely have the perf/W edge in that benchmark. The 10C Power 8 chips looks to use 190W at 3.4 ghz and 247W at 3.9 ghz. Its probably close to 300W for a 10C chip at 4.2 ghz and the POWER8 dram buffers use more power too. The E7-8890 v3 is 165W.

Tyan_Power8_575px.png


IMO Power looks good but at the rate the intel is churning out advances in servers it looks like 14 nm Xeons will solidly take back the lead.

For 8x CPU

E7-8870 -> E7 8890 v2 : +85% (32 nm -> 22 nm)
E7-8890 v2 -> E7 8890 v3 : +23%
E7-8890 v3 -> E7 8890 v4 : ? (22nm -> 14 nm)
36% will bring it on par with 8x power8 (though there are 12C POWER8 chips).
 

DrMrLordX

Lifer
Apr 27, 2000
22,617
12,535
136
Still not bad for a uarch that launched in 2014. If the ORNL supercomputer (Summit) is any indicator, POWER9 should have showed up by 2017. Will it be Skylake-E by then?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,633
2,432
136
It's been a while since I looked at Power8 docs, but IIRC, the chip was explicitly designed with the assumption that SMT *will* be used. To phrase it differently, a single thread is physically unable to utilize an entire core.

Not quite. The design philosophy is based on catering to two different kinds of workloads:

1. Old, single threaded software that cannot be easily or economically parallelized and that is extremely performance-sensitive. They want to eke out every last bit of performance for this kind of software, even well past the "knee of the curve", where additional improvements to the core give rapidly diminishing returns.

2. Well parallelized software that doesn't need such dramatic single-core throughput, and is better served by assigning the resources in a more balanced manner.

They have as much execution resources as they do to serve the first category. Those last few issue ports or pipelines buy only a few percent each of performance for a single thread, but the people purchasing these systems are willing to pay an asp in the millions to have that. It's cheaper than rewriting the software, after all.

The wide SMT capability exists so that those execution resources actually have something to do in normal code, and recovers some perf/watt from the otherwise unbalanced mosntrosity.
 

samboy

Senior member
Aug 17, 2002
223
94
101
I would expect the SMT benefit to be more for the POWER8 since their memory cache system is slower than the Xeon one.

That is as a thread has to wait on a memory fetch (which will be longer for POWER8 than Xeon) another thread can make productive use of the CPU.
 

386DX

Member
Feb 11, 2010
197
0
0
Interesting - now I really need to read that article.

One thing though, the market has spoken and they prefer Intel's solution for better than 95 our of 100 workloads.

By the way, 1.00 x 1.45 * 1.30 * 1.07 = 2.02 (or a little more than a 100% increase in performance0. Still, clearly on some workloads the result is even higher than the average.

The wording is really strange, at first I thought the calculation was the way you have it, then I thought it didn't make much sense as if the last 4 threads only increase performance by 7%, that's less then 2% a thread. It wouldn't be worth the extra complexity to implement 8 SMT for the performance gain. So now I'm thinking the way it's calculated would be like this:

1 thread = 100%
2 thread = +45%
3 thread = +30%
4 thread = +30%
5 thread = +7%
6 thread = +7%
7 thread = +7%
8 thread = +7%

The gives you an overall boost of approx 133% with 8 threads which is about what the Average in the benchmark was showing.
 

ThatsABigOne

Diamond Member
Nov 8, 2010
4,422
23
81
Not x86. Not interested.

I kid I kid. Power8 shows to be a very good cpu with a really good SMT implementation.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,112
136
The wording is really strange, at first I thought the calculation was the way you have it, then I thought it didn't make much sense as if the last 4 threads only increase performance by 7%, that's less then 2% a thread. It wouldn't be worth the extra complexity to implement 8 SMT for the performance gain. So now I'm thinking the way it's calculated would be like this:

1 thread = 100%
2 thread = +45%
3 thread = +30%
4 thread = +30%
5 thread = +7%
6 thread = +7%
7 thread = +7%
8 thread = +7%

The gives you an overall boost of approx 133% with 8 threads which is about what the Average in the benchmark was showing.

While it is possible to have a work load that behaves somewhat similar to what you describe - that's not the typical behavior. There was a paper published by DEC showing the simulated SPEC Int performance as a function of the number of hardware threads (with a fixed additional xtor count for each additional thread). I can't find that paper right now, though I'm sure it's on the web somewhere. In any case, that fundamental behavior is something like a*e^(-kN) where N = the number of threads and a & k are constants - so basically and inverse exponential.

The point was in determining the value of adding to the size of the die to improve performance in multi-threaded workloads. That study showed that 4 threads was optimal in terms of performance/die cost. Obviously, IBM decided to extract more performance at the cost of a larger die and high power usage by implementing the extra hardware (xtors) needed to support 8 hardware threads.