[hexus.net]AMD claims it will power another gaming device

Page 16 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
What game code can't be made parallel? Or easier, what game code must only be serialized?

Most games rely on a "world" concept, the simplest form of this is having the main program be a loop in which everything else occurs. Each run of the loop is a certain segment of game time. Most games use some version of this concept and it is usually the main limit to how parallelized a game can be.
 
Last edited:

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Digital Foundry on their latest article doesn't sound so impressed by the Jaguar CPU
http://www.eurogamer.net/articles/digitalfoundry-2014-xbox-one-vs-ps4-year-one

basically blaming it for the lack of innovation on the next gen games, so I think you would find a lot of people considering jaguar weak, specially when combined with such a strong GPU on the PS4.
As opposed to the great innovations from the last gen, like bad textures from day one until EOL, short draw & AI distances, and AIs far dumber than what we had with PC games made Pentium III class CPUs? Publishers making AAA games don't care about gameplay, and try to justify not being able to read and plan with excuses about the hardware, by and large (with 900p being a great example of failure to plan, and/or no consideration for IQ); but those are some big frilly formerly-owned-by-Sir-Elton-John rose-colored glasses they've got there.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,697
2,622
136
What game code can't be made parallel? Or easier, what game code must only be serialized?

Nothing must be serialized. However, designing complex parallel systems is significantly more expensive and time-consuming than building those same systems in serial. For some applications, the added expense can be justified -- for many others, it can't.

There is some amount of easily extractable parallelism in the form of separate audio and rendering and the likes. I expect that this console generation, these will be done, and also the 3rd party game engines will get very good at rendering in parallel, as the expense can be justified when it can be recouped from many sold games.

However, the typical game developer won't make their own game code parallel, simply because the profit isn't there.
 

Dribble

Platinum Member
Aug 9, 2005
2,076
611
136
Most games rely on a "world" concept, the simplest form of this is having the main program be a loop in which everything else occurs. Each run of the loop is a certain segment of game time. Most games use some version of this concept and it is usually the main limit to how parallelized a game can be.

Depends how much you do in the game loop, if you can split it out so all the operations (rendering, sound, physics, ai, etc) are happening independently then there's so little left happening in the main loop one thread is fine. Most of those other operations are parallelizable but it's non trivial to code it, solution there is too buy some 3rd party toolkit that's sets it all up for you. Hence you use unreal engine to render, physx for physics, etc. Even then it adds significantly to complexity - more bugs, means longer dev cycle, slower time to market, etc. Often it's better not to bother and limit the game (worse graphics, physics, ai) so it can release in reasonable time instead.
 

Headfoot

Diamond Member
Feb 28, 2008
4,444
641
126
Not to mention the new consoles went multi-core AND dedicated silicon... they have hardware audio for example, which reduces the need for extra cores as you can already offload much of the audio processing
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,145
4,027
136
Digital Foundry on their latest article doesn't sound so impressed by the Jaguar CPU
http://www.eurogamer.net/articles/digitalfoundry-2014-xbox-one-vs-ps4-year-one

basically blaming it for the lack of innovation on the next gen games, so I think you would find a lot of people considering jaguar weak, specially when combined with such a strong GPU on the PS4.

Again this is stupid rubbish based on peak numbers, not time to first pixel, ( a significant reason for the PS4 design according to Mark Cerny). Not cost for debugging or updating code. The reason for lack on innovation is because for the last 8 years we have had Consoles that can't because they are so bad at anything with any logic, game engines are iterative its going to take time to build that logic.

again let me quote from someone who actually has written a significant amount of code for last gen and next gen consoles:

https://forum.beyond3d.com/threads/...hnical-discussion.47227/page-322#post-1386073

I wouldn't say that. The jump is significant when you take into account the code optimization and maintenance cost.

The PPC in-order CPU bottlenecks have been talked to death, but it's always good to look back to see how the modern CPUs (including Jaguar) make our life much easier.

The 3 biggest time sinks when programming older in-order PPC CPUs:

1. Lack of store forwarding hardware
Store buffer stalls (also known as load-hit-store stalls or LHS stalls) are the biggest pain in the ass when programming a CPU without store forwarding hardware. The stall is caused by the fact that memory writes must be delayed by ~40 cycles (because of buffering & possible branch misses). If the CPU reads a memory location that was just written to, it stalls for up to 40 cycles. These stalls are everywhere. C/C++ compilers are notorious in pushing things to stack and reading the data back from the stack right after that (register spilling, function calls push parameters to stack, read & modify on class variables in loops, etc). Normally you just want to hand optimize the most time critical part of your program (expert programmers are good at this), but LHS stalls effect every single piece of code, so you must teach every singe junior programmer techniques to avoid them... or you are in trouble. LHS stalls are a huge time sink.

Modern CPUs have robust store forwarding hardware. You no longer need to worry about this at all. Result: Lots of saved programmer time.

2. Cache misses and lack of automatic data pre-fetching
The second largest time sink in writing code for older CPUs. Caches have been the most important thing for CPU performance for long long time. If the data you access is not in L2 cache (cache miss), you have to wait for up to 600 cycles. On old in-order CPUs the CPU does nothing during this time (you lose up to 600 instruction slots). Modern out-of-order CPUs can reorder some instructions to hide the memory stalls partially. Modern CPUs also have automatic data cache pre-fetchers that find patterns in your load addresses, and automatically load the cache lines you would likely access before you need them. Unfortunately the old PPC cores didn't have automated data pre-fetching hardware. You had to manually pre-fetch data, even in linear accesses (going through an array for example). Again every programmer must know this, and add the manual cache pre-fetch instructions in their code to avoid the up to 600 cycle stalls.

Modern CPUs have robust fully automatic data prefetching hardware that does better job almost every time than a human, and with no extra coding & maintenance cost. Modern CPUs also have larger caches (Jaguar has 2 MB per 4 core cluster) that are faster (lower load to use latency).

3. Lack of out-of-order execution, register renaming + long pipeline
Long pipeline means that instructions have long latencies. Without register renaming the same register cannot be used if it is already used by some instruction in the pipeline. Without out-of-order execution this results in lots of stalls. The main way to avoid these stalls is to unroll all tight loops. Unfortunately unrolling needs to be often done manually, and this takes time and leads to code that is often hard to maintain and modify.

Modern Intel CPUs (and AMD Jaguar) have relatively short pipelines (and loop caches). All modern CPUs have out-of-order execution and register renaming. On these CPUs, loop unrolling often actually degrades performance instead of improving it (because of extra instruction footprint). So, the choice is clear: Keep the code clean and let the compiler write a proper loop. Save lot of time now, and even more time when you need to modify the existing code.

Answer

Jaguar is a fully modern out-of-order CPU. It has good caches, good pre-fetchers and fantastic branch predictor (that AMD actually adopted later to Steamroller, according to Real World Tech: http://www.realworldtech.com/jaguar/). With Jaguar, coders can focus on optimizing things that actually matter, instead of writing boilerplate "robot" optimizations around the wast code base.

Jaguar pushes though huge majority of the old C/C++ code hand optimized for PPC without a sweat. You can actually remove some of the old optimizations and make it even faster. Obviously in vector processing loops, you need to port the VMX128 intrinsics to AVX (they wouldn't even compile otherwise), but that's less than 1% of the code base. It's not that hard to port really, since AVX instruction set is more robust (mostly it's 1:1 mapping and sometimes a single AVX instruction replaces two VMX128 instructions).

You asked me about the FP execution units. All I can say that I am very happy that the Jag FP/SIMD execution units have super low latency. Most of the important instructions have just one or two cycle latency. That's awesome compared to those old CPUs that had 12+ cycles of latencies for most of the SIMD ALU operations. If you are interested in Jaguar, the AMD 16h Optimization Guide is freely available (download from AMD website). It includes an Excel sheet that list all instruction latencies/throughputs. It's a good read, if you are interested in comparing the Jaguar low level SIMD performance to other architectures.

https://forum.beyond3d.com/threads/...hnical-discussion.47227/page-322#post-1386081

greed. If you have a super optimized FMA heavy vector crunching loop (heavily unrolled of course to utilize all 128 VMX registers) you will reach similar throughput on XCPU (the whole CPU). In general however it's very hard to even reach a FMA utilization rate of 50% (pure linear algebra does of course). XCPU had a vector unit that was way better than any x86 CPU released during the last decade (and VMX128 instruction set was awesome, except it lacked int mul). But SSE3 -> AVX is a huge jump. And Jaguar's unit in particular is nice, because the latencies are so low (and even the int mul is fast). On the old PPC cores you had to move data between vector<->scalar registers through memory (LHS stalls everywhere), on modern PC CPUs you have direct instructions for this (1cycle vs ~40 cycles). This combined with low latency vector pipeline allow you to use vector instructions pretty much everywhere. On XCPU you had to separate you vector code to long unrolled loops and be extra careful that all instructions touching the data were vector instructions inside the loop (or pay the heavy LHS stall costs). That pretty much limited vector instructions to special cases.

Microsoft had an good presentation about AVX2 / FMA integration to Visual Studio. At first try their FMA support reduced the average performance, because FMA has higher latency than mul. For example if you do two adds and mul based on the two add results, the total latency is add + mul (as the two adds will execute simultaneously). If you replace this with add + FMA, the latency will be add + FMA (since the FMA requires the add result before it can start, they can't execute simultaneously). This is a general problem for instructions that require more inputs. The more inputs you need, the harder it is to execute other instructions concurrently.
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
As opposed to the great innovations from the last gen, like bad textures from day one until EOL, short draw & AI distances, and AIs far dumber than what we had with PC games made Pentium III class CPUs? Publishers making AAA games don't care about gameplay, and try to justify not being able to read and plan with excuses about the hardware, by and large (with 900p being a great example of failure to plan, and/or no consideration for IQ); but those are some big frilly formerly-owned-by-Sir-Elton-John rose-colored glasses they've got there.

Ya, that article is a total disappointment. The author hardly explains what next generation gameplay innovation is like/should be like, and sites one example of BF4 having 64-player MP as "innovation." Having 64 players in MP is not innovation, all it is is just making the game feel bigger. Has he not played Braid, World of Goo or Limbo? You don't need a 5960X and 980 Quad-SLI to make innovative games. Besides, some of the best games in the world are those which stick to old school fundamentals. Would we want the next Zelda or Mario Cart or Uncharted game to be way different? Probably not because the reason fans of those series exist is because they love that type of gameplay. Why drastically change something that works well? IMO, things like health regeneration, save anywhere, QTEs, too many cut scenes have made modern PC/console gaming worse than gaming of the 80s, 90s and early 2000s. It's now almost impossible to die in a game and there is hardly any challenge in SP games.

What about AI and physics? Why haven't many games passed Crysis 1 physics/destruction and FEAR's AI despite nearly a decade passing since those games were made? Hardware is far more powerful but physics and AI haven't moved much. It probably costs a lot more $ to advance AI and physics and those aspects are far harder to portray / sell in ads/trailers while you can create massive hype with "next gen" graphics. :D

His article would be a lot stronger if he stated that despite new consoles being much more affordable and more popular then Xbox 360/PS3, we haven't seen many new IPs on either consoles. Developers are not taking many risks this generation despite the early start of PS4/XB1 being the most successful of all time (i.e., outselling PS360 consoles by nearly 2:1 so far).

Based on this poorly translated version at Expreview:

"Prior to AMD executives have repeatedly confirmed in interviews they got two custom processor business orders, the industry analysis is one of the new host processor Nintendo, Nintendo now implies the existence of a new generation of the host. But AMD talked about the new host orders will start generating revenue in 2016, taking into account the host processor first production, which means that Nintendo's host probably be available later in 2016, while the Wii U is 2012 11 released at the end of almost four years interval."
http://www.expreview.com/37883.html
 
Last edited:

Blitzvogel

Platinum Member
Oct 17, 2010
2,012
23
81
I wonder if AMD's Zen Core could be to them what Intel's Core M is for Intel. Seems to me like AMD wants to merge their x86 CPU architectures under one line that can be deployed across many platforms, including use in an APU for Nintendo. Hopefully it's at least PS4 class, if not, well beyond.
 

Headfoot

Diamond Member
Feb 28, 2008
4,444
641
126
No it would not because games have to run under an OS which uses processing cycles, multi core is perfect for this. And it also depends on the game code. There are a lot of background tasks going on especially with some games, were are not talking about an unconnected box here.

Wrong.

It's trivially easy to serialize parallel code, in fact your operating system does this constantly and every operating system since the first one ever written has done this. Parallelizing serial code is difficult and non trivial.

An incredibly fast single core (e.g. 15ghz Haswell) would dominate multi-core in any task where Amdahl's law controls (e.g. anything not embarrassingly parallel). Of course, you can't make a single core that fast or they would have.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
Depends how much you do in the game loop, if you can split it out so all the operations (rendering, sound, physics, ai, etc) are happening independently then there's so little left happening in the main loop one thread is fine. Most of those other operations are parallelizable but it's non trivial to code it, solution there is too buy some 3rd party toolkit that's sets it all up for you. Hence you use unreal engine to render, physx for physics, etc. Even then it adds significantly to complexity - more bugs, means longer dev cycle, slower time to market, etc. Often it's better not to bother and limit the game (worse graphics, physics, ai) so it can release in reasonable time instead.

It is that need to all run within that game "world", a serial process where the threads can potentially interact with each other or a shared resource, that means the complexity goes up drastically as you add more threads. The tasks that are 'embarrassingly' parallel are those that can be split into walled off chunks with pretty much zero concern for anything being calculated in the other chunks.
 
Last edited:

Dribble

Platinum Member
Aug 9, 2005
2,076
611
136
It is that need to all run within that game "world", a serial process where the threads can potentially interact with each other or a shared resource, that means the complexity goes up drastically as you add more threads. The tasks that are 'embarrassingly' parallel are those that can be split into walled off chunks with pretty much zero concern for anything being calculated in the other chunks.

The world is not a thread, it's data - objects. Those objects use locking so they can work in a multi-threaded environment, processing on those objects can be done in any thread. The actual main loop thread (i.e. processing not data) that syncs everything together can be pretty lightweight.
 

MisterLilBig

Senior member
Apr 15, 2014
291
0
76
Parallel programming is nothing new in the gaming world. PS3 and X360 have many threads and many cores also. It had been done, it is being done and it will continue to be done.
If you google "parallel game loops", the first link is to an Intel blog.
In any case, I think they should had gone with 12 cores.

(There's a nice talk from Mike Acton, Engine Director at Insomniac Games, on youtube about Data oriented design and C++, from the CppCon 2014.)



There's too much news of no news with Nintendo now, I am sure Miyamoto and Nintendo has been "working" with new hardware designs since the moment the Wii U was done.
 
Last edited:

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
Parallel programming is nothing new in the gaming world. PS3 and X360 have many threads and many cores also. It had been done, it is being done and it will continue to be done.
If you google "parallel game loops", the first link is to an Intel blog.
In any case, I think they should had gone with 12 cores.

(There's a nice talk from Mike Acton, Engine Director at Insomniac Games, on youtube about Data oriented design and C++, from the CppCon 2014.)



There's too much news of no news with Nintendo now, I am sure Miyamoto and Nintendo has been "working" with new hardware designs since the moment the Wii U was done.

So after 10years of multicore coding (xbox360 6 threads, PS3 8 threads). How do you think its going? Not to mention all the current console releases.

We got one game that is relatively good multithreaded and thats BF4. But its far from scaling well and largely lives on the heavy multiplayer part.
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,145
4,027
136
So after 10years of multicore coding (xbox360 6 threads, PS3 8 threads). How do you think its going? Not to mention all the current console releases.

We got one game that is relatively good multithreaded and thats BF4. But its far from scaling well and largely lives on the heavy multiplayer part.

And yet we are talking about "game devices" and they seem to have no problems........

So whats at fault on the PC side ;)
 

NTMBK

Lifer
Nov 14, 2011
10,525
6,050
136
So after 10years of multicore coding (xbox360 6 threads, PS3 8 threads). How do you think its going? Not to mention all the current console releases.

We got one game that is relatively good multithreaded and thats BF4. But its far from scaling well and largely lives on the heavy multiplayer part.

The PS3 had 1 CPU core plus a bunch of obtuse and awkward to utilise coprocessors.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
And yet we are talking about "game devices" and they seem to have no problems........

So whats at fault on the PC side ;)

Considering how low the bar is set, they better not have problems. Yet they do have plenty of problems even reaching that low bar.
 

SlowSpyder

Lifer
Jan 12, 2005
17,305
1,002
126
So after 10years of multicore coding (xbox360 6 threads, PS3 8 threads). How do you think its going? Not to mention all the current console releases.

We got one game that is relatively good multithreaded and thats BF4. But its far from scaling well and largely lives on the heavy multiplayer part.


Were any games at all ever built to take advantage of six threads in the Xbox 360? I am not overly familiar with the CPU architecture of the Xbox 360, but from what I've read in the past the CPU was capable of using multiple threads per core, but did it very poorly due to hardware limitations. Seems like it would be a bad choice of hardware to really start working on truly multithreaded games. With the current consoles, multithreaded game engines will be come much more mature and polished.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
Were any games at all ever built to take advantage of six threads in the Xbox 360? I am not overly familiar with the CPU architecture of the Xbox 360, but from what I've read in the past the CPU was capable of using multiple threads per core, but did it very poorly due to hardware limitations. Seems like it would be a bad choice of hardware to really start working on truly multithreaded games. With the current consoles, multithreaded game engines will be come much more mature and polished.

If they failed to code for SMT tripple cores. Why do you think its somehow radically changed now just becasue of some new cores?

Any of the current console ports that sparks your faith?
 

SlowSpyder

Lifer
Jan 12, 2005
17,305
1,002
126
If they failed to code for SMT tripple cores. Why do you think its somehow radically changed now just becasue of some new cores?

Any of the current console ports that sparks your faith?


For the reason I said. The Xbox 360's ability to execute multiple threads on a core was severely lacking from what I understand, even if technically it was able to do so.

I'm not familiar with the game engines enough to answer your question. But, I don't consider it 'faith' to imagine games are going to be more and more multithreaded as time goes on, especially given the hardware in the consoles. To take something on faith would be to not really think about it, just to accept it regardless of evidence because that is how you want it to be, what you want to believe. I would say evidence and history are pointing towards a more multithreaded future. I don't see games two or three years from now running on one or two 1.6-1.75GHz Jaguar cores as the norm for a 'next gen' experience, do you?
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
For the reason I said. The Xbox 360's ability to execute multiple threads on a core was severely lacking from what I understand, even if technically it was able to do so.

I'm not familiar with the game engines enough to answer your question. But, I don't consider it 'faith' to imagine games are going to be more and more multithreaded as time goes on, especially given the hardware in the consoles. To take something on faith would be to not really think about it, just to accept it regardless of evidence because that is how you want it to be, what you want to believe. I would say evidence and history are pointing towards a more multithreaded future. I don't see games two or three years from now running on one or two 1.6-1.75GHz Jaguar cores as the norm for a 'next gen' experience, do you?

But the Xbox360 still had 3 cores.

Some code just cant be paralized in a meaningful way. Its Mitosis all over again if you try.

I honestly expect the exact same case as we got today with console ports in 2-3 years. The issue is the scaling. It doesnt help much to have twice the cores if it cant scale in any meaningful way.

Just look at BF4, the multithreading marvell.
350x700px-LL-b0fc3bbb_http--www.gamegpu.ru-images-stories-Test_GPU-Action-Battlefield_4_Naval_Strike_-test-bf_4_proz.jpeg


The 8350 is 25% faster than the 4300. The 4770K is 36% faster than the 4330.

Even if we use Core based scaling. A PS4 would only have to run at 2.2Ghz with half the cores to deliver the same.
 

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
But the Xbox360 still had 3 cores.

Some code just cant be paralized in a meaningful way. Its Mitosis all over again if you try.

I honestly expect the exact same case as we got today with console ports in 2-3 years. The issue is the scaling. It doesnt help much to have twice the cores if it cant scale in any meaningful way.

Just look at BF4, the multithreading marvell.
350x700px-LL-b0fc3bbb_http--www.gamegpu.ru-images-stories-Test_GPU-Action-Battlefield_4_Naval_Strike_-test-bf_4_proz.jpeg


The 8350 is 25% faster than the 4300. The 4770K is 36% faster than the 4330.

Even if we use Core based scaling. A PS4 would only have to run at 2.2Ghz with half the cores to deliver the same.


Aside from the absolute cpu perf the console apus do have an major advantage in lower cpu-gpu io latency.
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
^There are plenty of games where a quad-core is significantly faster than a dual-core/i3, despite you implying that BF4 is really the only game that scales well with quads.

Try playing Crysis 3 or Ryse Son of Rome on a G3258 at 4.5Ghz. Moving forward dual cores are dead. If developers never try to learn how to make multi-threaded code, we'll eventually hit a brick wall. We need multi-threaded CPUs and games for advancements in AI/graphics to take place.

We already have enough modern titles that show a dual-core and an i3 are severely bottlenecking modern GPUs. While scaling is nowhere near commensurate to the increase in the number of threads/cores, every little bit helps.

http--www.gamegpu.ru-images-stories-Test_GPU-Action-Watch_Dogs_Bad_Blood_-test-wd_proz.jpg

http--www.gamegpu.ru-images-stories-Test_GPU-RPG-Kingdom_Come_Deliverance_Alpha_-test-kkd_proz_2.jpg

http--www.gamegpu.ru-images-stories-Test_GPU-RPG-dragon_age_inquisition-test-DragonAgeInquisition_proz_amd.jpg

http--www.gamegpu.ru-images-stories-Test_GPU-Action-Ryse_Son_of_Rome-test-Ryse_proz.jpg


Since PS3 was really a single core CPU with 7 SPE engines, it was hard for developers to target 3 cores of Xbox 360 when most games were made specifically for a combination of PS3 and 360. Now that we have a new generation of consoles with 6 available cores, we should start to see improved multi-core CPU scaling but it will depend on adoption of newer/more advanced game engines in games to take advantage of multi-threading too.
 

SlowSpyder

Lifer
Jan 12, 2005
17,305
1,002
126
But the Xbox360 still had 3 cores.

Some code just cant be paralized in a meaningful way. Its Mitosis all over again if you try.

I honestly expect the exact same case as we got today with console ports in 2-3 years. The issue is the scaling. It doesnt help much to have twice the cores if it cant scale in any meaningful way.

Just look at BF4, the multithreading marvell.
350x700px-LL-b0fc3bbb_http--www.gamegpu.ru-images-stories-Test_GPU-Action-Battlefield_4_Naval_Strike_-test-bf_4_proz.jpeg


The 8350 is 25% faster than the 4300. The 4770K is 36% faster than the 4330.

Even if we use Core based scaling. A PS4 would only have to run at 2.2Ghz with half the cores to deliver the same.


BF4 is a step in that direction, not the end game pinnacle of multithreaded engines. I think this generation is unique, much different than the PS3 / XBox 360 generation. Those were really odd hardware implementations if creating multithreaded games was the goal. With today's consoles sharing the same number of cores and CPU architecture, I think things will happen at a quicker pace.