Techreport Review on Shanghai

Cookie Monster · Dec 1, 2008

Link

Conclusion:

The Shanghai Opterons' higher clock speeds, larger and quicker L3 cache, and improved memory subsystem are just what the doctor ordered for AMD's quad-core CPU architecture. These changes, along with lower power consumption both at idle and while loaded, go a long way toward alleviating the weaknesses of the 65nm Barcelona Opterons. The Opteron 2384's ability to outperform the Xeon E5450 in SPECjbb is dramatic proof of Shanghai's potency. Similar server-class workloads are likely to benefit with Shanghai, as well, so long as they are properly NUMA-aware. Both in SPECjbb and in the more difficult case (for the Opteron) of the Cinema 4D renderer, we found our Opteron 2384-based system to be quantifiably superior in terms of power-efficient performance than Xeon systems that employ FB-DIMMs.

The new Opterons are clearly more competitive now, but they were still somewhat slower overall in the HPC- and workstation-oriented applications we tested, with the lone exception of MyriMatch. In many cases, Shanghai at 2.7GHz was slightly behind the Xeon L5430 at 2.66GHz. The Opteron does best when it's able to take advantage of its superior system architecture and native quad-core design, and it suffers most by comparison in applications that are more purely compute-bound, where the Xeons generally have both the IPC and clock frequency edge.

We should say a word here about Intel's San Clemente platform, which we paired with its low-voltage Xeons. It's a shame this platform isn't more of a mainstream affair, and it's a shame the memory controller is limited to only six DIMMs. Even with that limitation, San Clemente may be Intel's best 2P server platform. In concert with the Xeon L5430, it's even more power efficient than this first wave of Shanghai Opterons, and in several cases, the lower latency of DDR2 memory seemed to translate into a performance advantage over the Bensley platform in our tests. For servers that don't require large amounts of RAM, there's no better choice.

AMD argues that it has a window of opportunity at present, while its Shanghai Opterons are facing off in mainstream servers versus current Xeons. I would tentatively agree. For the right sort of application, an Opteron 2384-based system offers competitive performance and lower power draw than a Xeon E5450 system based on the Bensley platform. The Xeon lineup has other options with consistently higher performance or lower power consumption, but the Shanghai Opterons match up well against Intel's mainstream server offerings. (Workstations and HPC, of course, are another story.) If AMD can deliver on its plans for HyperTransport 3-enabled Opterons early next year, along with low-power HE and high-performance SE models, it may have a little time to regain lost ground in the server space before 2P versions of Nehalem arrive and the window slams shut.

Idontcare · Dec 1, 2008

I really like the threading speedup data generated by this review. I tossed the Euler3D data into a spreadsheet so I could do some analyses.

Euler3D Benchmark Scaling

I suspect many folks are not accustomed to analyzing thread scaling data so I overlaid a couple extra lines with some text to aid in understanding what the viewer is looking at.

Using the Almasi/Gottlieb modified Amdahl Law we can do a best fit to the data and extract the apparent interprocessor communication impact for this specific code (Euler3D). Now the exact IPC model I used here is no doubt incorrect as it will be missing cross-terms, but it should be correct to second-order (meaning zeroth, first and second order IPC terms should be correctly represented in the model).

Also understand that the manner in which scaling graphs are generated involves self-referencing performance to that of the single-threaded performance on the CPU in question, as such IPC (instructions per clock) and clockspeed are normalized already. You can readily see this in the fact that both the 2.66Ghz Xeon penryn and the 3.2GHz Skulltrail penryn system result in near identical scaling points on the graph (pink squares and the light-blue circles).

As such we should understand the scaling performance increase of Shanghai over Barcelona in this application is not because of the clockspeed disparity (2.7GHz for Shanghai vs. 2.3GHz for Barcelona) but rather is directly attributable to improvements in the cache and memory subsystems.

Remember these data are from dual-socket systems except the 965 i7.

A few things strike me right off the bat. Shanghai Opteron 2384 (2.7GHz, the green diamonds in the graph) has pretty impressive scaling relative to the Intel architectures (both penryn and nehalem).

And we see a nice scaling improvement in Shanghai over Barcelona Opteron 2356 (2.3GHz, purple circles on the graph). Then we can see how much Shanghai improved this metric of scaling over Barcelona.

For Shanghai we get a broadcast IPC of 0.03 (lower is better) and a point-to-point IPC of 0.0007 (lower is better); whereas for Barcelona we get a broadcast IPC of 0.045 (33% slower than Shanghai) and pt-to-pt IPC of 0.0007 (the same as Shanghai).

Pretty impressive that the cache and memory subsystem has been improved such that the scaling performance losses due to interprocessor communications is reduced by 33%.

On the flip side I am struck by the seemingly poor showing of Nehalem. Sure the absolute performance is blistering, the raw performance blows everything out of the water. But looking at the thread scaling data it turns in the worst results of the bunch.

For Penryn with its FSB we expect the lowered performance, but Nehalem really suffers from interprocessor communication congestion in this test to a degree that Penryn avoids as well as Opterons.

Penryn (Xeon and skulltrail) suffers a broadcast IPC of 0.06 (2x that of Shanghai) and a point-to-point IPC of 0.008 (10x that of Shanghai). The interprocessor communications for this application are really slow on penryn compared to Opteron.

But Nehalem really has trouble scaling this app and the only culprit we can point our finger to is the cache and memory sub-system as the 2 and 4 thread runs show equally challenged scaling results. The broadcast IPC is 0.14 (2x that of Penryn). No telling if this was just a badly configured i7 system or if the total ram in the system was culprit here...but the data currently really do not show Nehalem architecture scaling in a good light.

Nehalem's raw performance makes up for the lack of scaling, but scaling plays to the chip with lower raw performance as more chips communicating more efficiently will eventually deliver enough raw performance to overcome the higher performing lower chip count systems. Can't wait to see those gainestown results in 3-4 months.

BlueBlazer · Dec 1, 2008

If you were referring to this http://www.techreport.com/articles.x/15905/9 If you take into account Hypoerthreading, then the scaling is bad because those are "logical cores" ("virtual cores"). Each logical core or thread is much slower than a real physical core, which affected the scaling in your graph.

However if you take into account the physical cores, Nehalem approaches the near perfect scaling at 4 cores without HT... And if you take into account 4 cores with HT, that's way past perfect scaling.

Idontcare · Dec 1, 2008

Originally posted by: BlueBlazer
If you were referring to this http://www.techreport.com/articles.x/15905/9 If you take into account Hypoerthreading, then the scaling is bad because those are "logical cores" ("virtual cores"). Each logical core or thread is much slower than a real physical core, which affected the scaling in your graph.

However if you take into account the physical cores, Nehalem approaches the near perfect scaling at 4 cores without HT... And if you take into account 4 cores with HT, that's way past perfect scaling.

I don't think you quite understand what I am talking about. This is not about "do virtual cores scale as well as logical cores"?

As you can see in the Euler3D benchmark for 2 and 4 threads the i7 scales markedly less efficiently than Opteron and Penryn. There is no conflagration of logical and virtual cores in those benchmark scores. This is self-evident.

Idontcare · Dec 1, 2008

I threw the MyriMatch benchmark scaling data into the spreadsheet as well.

MyriMatch benchmark scaling

The MyriMatch code is apearantly well optimized and is 99% parallelized (rare to be so highly parallelized).

Interestingly unlike the Euler3D code, both the Shanghai and Barcelona opterons have identical scaling results. So close to identical you can't make out the Opteron 2384 data points in the graph as they are hidden behind the Opteron 2356 data.

Likewise for the two Penryn systems, near identical scaling in this benchmark albeit it negatively impacted by the FSB contention (presumably) relative to the opteron systems. In this case the interprocessor communication (IPC) time is roughly 5x slower for broadcast and 3x slower for point-to-point relative to the same metrics for the opterons.

Nehalem again churns out the worst scaling results. Here we see logical cores performing just as well as virtual cores (evident by the seamless scaling function in blue that best fits the yellow triangle data-points). In this specific application the i7 system experiences a broadcast IPC which consumes a striking 120x more time than the opterons, and the point-to-point IPC is 15x more time consuming. These results hold true whether you consider only the 1, 2, and 4 thread data (logical cores) or include the 6 and 8 threads scaling data which relies on the virtual cores.

Without having a system to play with it is hard to know which particular component of the cache and memory sub-system is causing this bottleneck. Be it L3$ clockspeed or latency or ram latency, etc.

BlueBlazer · Dec 1, 2008

MyriMatch proteomics

Core i7 965 Extreme No HT
1 thread = 306
2 threads = 160, speedup = 1.91, scaling per core = 95.6%
4 threads = 82, speedup = 3.73, scaling per core = 93.3%
(beyond 4 threads is out of order since there are only 4 cores, scores remain the same)

Core i7 965 Extreme HT
1 thread = 307
2 threads = 162, speedup = 1.90, scaling per logical core = 94.7%, scaling per physical core = 94.7%
4 threads = 95, speedup = 3.23, scaling per logical core = 80.7%, scaling per physical core = 80.7% (attribute this anomaly to the thread assigment)
8 threads = 60, speedup = 5.12, scaling per logical core = 64%, scaling per physical core = 128% (maximum physical 4 cores)

The hyperthreading helped boost the scores beyond 4 threads. However logical cores are no match for real physical cores.

Cookie Monster · Dec 1, 2008

Could it be the relatively small size of the L2$?

BlueBlazer · Dec 1, 2008

STARS Euler3d computational fluid dynamics

Core i7 965 Extreme No HT
1 thread = 1.43
2 threads = 2.57, speedup = 1.80, scaling per core = 90%
4 threads = 4.49, speedup = 3.14, scaling per core = 78.5%
(beyond 4 threads is out of order since there are only 4 cores, scores degrades)

Core i7 965 Extreme HT
1 thread = 1.44
2 threads = 2.44, speedup = 1.69, scaling per logical core = 84.7%, scaling per physical core = 84.7%
4 threads = 3.29, speedup = 2.28, scaling per logical core = 57%, scaling per physical core = 57%
8 threads = 5.09, speedup = 3.53, scaling per logical core = 44.2%, scaling per physical core = 88.2% (maximum physical 4 cores)

Opteron 2384 reached 81.6% scaling per core for total of 4 cores, much better than Core i7 no HT, showing Opterons have better scaling at 4 cores. However with HT, Core i7 gets past this when comparing physical cores.

Opteron 2384 reached 64% scaling per core for total of 8 cores, again much better than Core i7 due to more physical cores.

Idontcare · Dec 1, 2008

Originally posted by: Cookie Monster
Could it be the relatively small size of the L2$?

100%. In fact the difference could go one step even further and be well explained as the difference between exclusive vs inclusive shared L3$ cache.

Point-to-point IPC is going to depend on getting info from the other threads, info which will reside in the L1$ and L2$ of the cores to which those threads are processing on an opteron system but that data is also present in the L3$ on i7.

So we could very well be seeing this difference (inclusive vs exclusive) playing out as the snoop filters battle it out for locking contention and all that good stuff.

Idontcare · Dec 2, 2008

Originally posted by: BlueBlazer
8 threads = 5.09, speedup = 3.53, scaling per logical core = 44.2%, scaling per physical core = 88.2% (maximum physical 4 cores)

Opteron 2384 reached 81.6% scaling per core for total of 4 cores, much better than Core i7 no HT, showing Opterons have better scaling at 4 cores. However with HT, Core i7 gets past this when comparing physical cores.

Opteron 2384 reached 64% scaling per core for total of 8 cores, again much better than Core i7 due to more physical cores.

I recommend doing some reading up on Amdahl's Law. Gene Amdahl is regarded as having known a thing or two about multi-core multi-threaded systems...since the sixties.

To analyze scaling data you must break-out the parallelized code component first as it is hardware independent. The second thing one typically does is break-out the interprocessor communications components as these explain (are the root-cause of) the scaling results in real-world applications from reaching the limit imposed by Amdahl's law.

Interprocessor communication (dubbed IPC by computer scientists, easily confused with the other IPC i.e. instructions per cycle) is what makes scaling less than perfect (where perfect is not scaling of 1 but rather is the scaling limit imposed by the fraction of parallelized codepaths in the applications of interest). Also read up on the workings and modifications to Amdahl's law done by Almasi and Gottlieb to see the most commonly employed algorithms for accounting for IPC in scaling limitations, I used the broadcast and point-to-point terms in my modeling here.

Now on to the data...there are simple ways to tell whether virtual cores are performing comparable to or measurably less than an equivalent logical core. It merely requires having scaling data that spans the range of logical cores and virtual cores.

The MyriMatch data provides a textbook case example of this. Note the yellow line in this graph of the scaling data.

Overlaid on the yellow data is the modeled results for an 8-core system (real cores, no logical or hyperthreading labels are needed here) which is in blue. Notice how little difference there is between the solid blue line and the yellow data. That blue line would be the same whether we had data for just 1, 2 and 4 threads on an i7 or if we had the additional data for 6 and 8 threads.

The fact that the scaling results for 6 and 8 threads falls perfectly in-line with the scaling data projection from the "real" cores of 1, 2, and 4 threads is proof enough that an i7 appears to be an 8-core processor for all intents and purposed in this this specific application. If it didn't then there would be a noticeable and obvious discontinuity (a kink in other words) in the scaling data for the first 3 data points (1,2,4) versus the last two (6,8).

The less than stellar scaling data from i7 in the MyriMatch data need not be blamed on hyperthreading, in fact it is completely incorrect to do so. The Alamsi/Gottlieb modified Amdahl equation readily explains the scaling performance for i7 from thread 2 to 8. What we don't know is what is it about the i7 system that was used to generate this data that caused such unsatisfactory scaling results. Was the DDR3 latency abysmal, was the L3$ speed a bottleneck?

In other words can it be improved upon by the end-user configuring the system more judiciously for the app of interest or is this scaling performance more indicative of hard-coded unchangeable attributes in the architecture (snoop filters, cache locking, cache latency and ports, etc). These are the kinds of questions people who build systems that need to scale over multiple sockets are asking themselves, it is how we analyze these things.

edit: fixed quote block.

magreen · Dec 4, 2008

Could the server Nehalem cpu have more L2$ and/or faster L3$, or are those changes too large to make this late in the product development process?

PlasmaBomb · Dec 4, 2008

Originally posted by: magreen
Could the server Nehalem cpu have more L2$ and/or faster L3$, or are those changes too large to make this late in the product development process?

Significant changes will likely wait till the 32nm refresh.

Idontcare · Dec 4, 2008

Originally posted by: magreen
Could the server Nehalem cpu have more L2$ and/or faster L3$, or are those changes too large to make this late in the product development process?

More L3$ is my understanding, another 8MB.

Faster could mean lower latency and/or higher clockspeed. Lowering latency = redesign which is unlikely. Higher clockspeed is feasible provided there was TDP in the budget for it.

Currently i7's have their L3$ clocked at 2.66GHz IIRC. Would be nice to run the scaling test where we increase that to 3GHz or decrease it to 2.33GHz and measure the impact on scaling. Then we'd start to have some answers regarding the observed scaling at stock.

magreen · Dec 4, 2008

Ok, so then faster L3$ doesn't seem possible (I did mean lower latency).

Ok, so since lower latency L3$ is out, what about larger L2$? Since from the clues we're getting, it looks like one of these two areas is the culprit -- larger L3$ might not change anything.

Idontcare · Dec 4, 2008

Originally posted by: magreen
Ok, so then faster L3$ doesn't seem possible (I did mean lower latency).

Ok, so since lower latency L3$ is out, what about larger L2$? Since from the clues we're getting, it looks like one of these two areas is the culprit -- larger L3$ might not change anything.

L2$ impacts the individual core's performance, if it is too small then it could make hyperthreading performance suffer as well as both threads on one core get 1/2 the L2$. So I expect L2$ to be holding back single-thread performance if anything, but not necessarily the scaling of the performance.

Consider that interprocessor communication (or in this case thread-to-thread communication) is what limits the scaling from being exactly that as Amdahl's law predicts. So we should look to whatever hardware is responsible (shared) for communicating data between the core's. To me that points to the shared L3$, the IMC, the X58 chipset, and the DDR3.

If anything I'd expect a larger L3$ on Beckton to weigh in with slightly worse latency (by intent), just as the argument thus far on Nehalem's really small L2$ has been that to make it any larger would require increasing the latency.

There are too many possible points of weakness to contemplate narrowing them down at the moment. Until we see some scaling data with varied L3$ clockspeed we won't really be able to rule it out or in IMO.

magreen · Dec 4, 2008

Got it. L2$ doesn't sound like the culprit acc. to what you're saying.

I guess we have to be patient and wait for more info on what the cause of the interthread communication slowdown is -- slow L3$, too little L3$, or something else such as what you mentioned above. Thanks for the info. :beer:

IntelUser2000 · Dec 19, 2008

MyriMatch proteomics

Core i7 965 Extreme No HT 1 thread = 306 2 threads = 160, speedup = 1.91, scaling per core = 95.6% 4 threads = 82, speedup = 3.73, scaling per core = 93.3% (beyond 4 threads is out of order since there are only 4 cores, scores remain the same)

Core i7 965 Extreme HT 1 thread = 307 2 threads = 162, speedup = 1.90, scaling per logical core = 94.7%, scaling per physical core = 94.7% 4 threads = 95, speedup = 3.23, scaling per logical core = 80.7%, scaling per physical core = 80.7% (attribute this anomaly to the thread assigment) 8 threads = 60, speedup = 5.12, scaling per logical core = 64%, scaling per physical core = 128% (maximum physical 4 cores)

The hyperthreading helped boost the scores beyond 4 threads. However logical cores are no match for real physical cores.

Idontcare, it must be the hyperthreading that slows it down. It's not slowing it down in a usual sense because the application scales to 8 threads and Core i7 1P has 8 thread support.

Look at BlueBlazer's data again. With Hyperthreading off the 4 thread performance is higher than the performance with Hyperthreading on and 4 threads.

What's likely happening is on 4 threads and hyperthreading on the cores have almost half the resources of 4 threads and hyperthreading off. What may be happening is that it really has 8 threads taking up resources with the 4 thread test, therefore robbing the cores of resources.

Opteron 2384 reached 81.6% scaling per core for total of 4 cores, much better than Core i7 no HT, showing Opterons have better scaling at 4 cores. However with HT, Core i7 gets past this when comparing physical cores.

Much better?? 78.9% vs. 81.6% isn't a big difference.

We need the Xeon versions of Nehalem to see how good the SMT implementation really is. It could be the extra bandwidth and less caveats(no replay feature) helps SMT, but the Core architecture in general is a substantially higher performing core than Netburst. That might mean the benefits won't be as big.

Idontcare · Dec 19, 2008

Originally posted by: IntelUser2000
Idontcare, it must be the hyperthreading that slows it down. It's not slowing it down in a usual sense because the application scales to 8 threads and Core i7 1P has 8 thread support.

Look at BlueBlazer's data again. With Hyperthreading off the 4 thread performance is higher than the performance with Hyperthreading on and 4 threads.

What's likely happening is on 4 threads and hyperthreading on the cores have almost half the resources of 4 threads and hyperthreading off. What may be happening is that it really has 8 threads taking up resources with the 4 thread test, therefore robbing the cores of resources.

Until the test is run in which threads are locked to cores we can't make any conclusions about why 4 threads on the 8thread enabled CPU was slower than 4 threads on a 4 thread enabled CPU.

Yes your assertion is plausible, of course, as this is the downside potential of sharing resources when implementing SMT.

But such a slowdown is also expected from thread migration and cache thrashing too...I'd expect these results even if the CPU had a 8 full cores. It would be worse (4 thread results) if the system had 16 cores, and even worse still if it had 32 cores but only four active threads.

4 threads migrating between cores and caches every 100ms on an 8 core (or 8 thread) system will impart a slowdown.

I don't have any issues accepting the fact that SMT will not enable every 8-thread application to speedup as much as would happen on "true" 8-core hardware...but the scaling data published to date (Euler3D and myrimatch) precisely show scaling results that one would expect from a "true" 8-core hardware system which has lackluster interprocessor communications.

For these two applications there is absolutely no need to invoke conclusions of inadequate SMT implementation (the shared resources do not appear to bottleneck or limit these two specific apps). However the rather poor interprocessor communications that come from analyzing the scaling data does suggest something with the memory sub-system (any of the L1-L3 caches or the ram itself) is severely hampering scaling performance (not to be confused with absolute raw performance) in these two apps.

To dissect the scaling results any further requires access to the hardware and the software so that the test can be ran in which threads are affinity locked with HT on and run 4 threads with each thread on a unique core in one test, then repeat the test but force 2 threads to occupy one unique core (so two cores, four threads, are tested).

This will eliminate the performance degradation that comes with thread migration while simultaneously testing the efficacy of SMT and L1$/L2$ on the Nehalem cores.

IntelUser2000 · Dec 19, 2008

The problem with your assessment is that without Hyperthreading the scaling is no less than Core 2's. You did compare to Core 2.

And from there the Skulltrail system doesn't have bad scaling either, only worse single thread performance.

If Tech Report only showed i7 without HT, you wouldn't be saying this.

piesquared · Dec 22, 2008

How are those VM tests comming along, Johan?

Techreport Review on Shanghai

Diamond Member

Elite Member

Senior member

Elite Member

Elite Member

Senior member

Diamond Member

Senior member

Elite Member

Elite Member

Golden Member

Lifer

Elite Member

Golden Member

Elite Member

Golden Member

Elite Member

Elite Member

Elite Member

Golden Member