Ashes of the Singularity User Benchmarks Thread

Page 39 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
Latency stalls are no problem for a GPU, as there are thousands or tens of thousands of threads in flight, so if one stalls, then another will simply take it's place due to the workload being "embarrassingly parallel."

Latency is a bigger problem for CPUs with their more serial workloads..

What you are describing is the domain of GCN architecture. Maxwell has in-order pipeline with in-order execution.

GCN has in-order pipeline with out of order execution. With GCN you will not get lagged by stalled process, but that is exactly the case of Maxwell GPUs. That is why Oxide did not implement Asynchronous Compute code for Maxwell GPUs in their game.
 

casiofx

Senior member
Mar 24, 2015
369
36
61
AMD screwed Fury X owners with Fury. And screwed everyone with Fury Nano price.

Now you being completed troll. Blaming Nvidia because AIB partners using better components. So why 290X after market coolers are better?

Gimp Kepler? Stop using Reddit/4chan for sources.

perfdollar_1920.gif


If $20 is atrocious there is nothing we can help you.

Fury X overclockers dream is dream. <That's perhaps the most hilarious stunt of 2015.

The media is covering this worldwide? You mean shills quoting AMD's PR? Looks like Nvidia should stop sending review samples to them? No? :whiste:

Hypocritical.
Double standards.
Rekt.



You and all your AMD kind said who need GPU physx if it can be done on CPU.

Hypocritical.
Double standards.
Rekt.
Is this sarcasm?

I can't take your statement seriously.

Fury X have higher shader count, 4096 vs 3584. For $100 more the extra 12.5% shader count and AIO cooling with manufacturer's warranty is worth it to some buyers.

Nvidia charged $350 more for Titan X with just 9.1% more CUDA cores + lousier cooling.
 

antihelten

Golden Member
Feb 2, 2012
1,764
274
126
You must not know much about CPUs. Modern CPUs have an entire arsenal of latency reducing technologies on hand. Branch predictors, on die memory controllers, multi level cache hierarchies, massive caches (up to 20 MB now), SMT, registers and God knows what else..

Like I said, latency is a bigger deal for CPUs because of their workload, so engineers have come up with all kinds of ways to reduce it as much as possible over the years.

And you must not know much about computers.

You do understand that a CPU and a GPU are two separate entities that do not sit on the same die (except for iGPU), and as such communication between them (as you would have with software based scheduling), has to pass through the PCIe bus, with all the latency penalties that this might incur?
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
You and all your AMD kind said who need GPU physx if it can be done on CPU.

Hypocritical.
Double standards.
Rekt.

Good luck finding where I said anything like that. Besides, GPU PhysX is proprietary special effects. Too bad nVidia got a hold of it. There was potential.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
What you are describing is the domain of GCN architecture. Maxwell has in-order pipeline with in-order execution.

GCN has in-order pipeline with out of order execution. With GCN you will not get lagged by stalled process, but that is exactly the case of Maxwell GPUs. That is why Oxide did not implement Asynchronous Compute code for Maxwell GPUs in their game.

This was already covered in the thread several pages back.

All modern GPUs are in order, and do not use OoO execution for processing.. AMD's ACEs use OoO to check for errors on completed tasks from what I have been able to discern.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
Simple answer: no.

You are wrong. ACE is not only for correction, but scheduling the work within the pipeline.
And yes, GCN is OoO Execution architecture. It cannot switch context in the pipeline(yet). But it can switch context WITHIN pipeline. Because Maxwell has 1 ACE it cannot do this. Graphics has to end before compute can be done.

Nvidia designed really simple architecture to program for. GCN starts to fly IF you understand what you are doing, what you have in hardware and program for it, without any layer of abstraction.
 

Keysplayr

Elite Member
Jan 16, 2003
21,219
55
91
You still didn't answer my question about market share so now would be a good time to repeat it I guess. :)



Do you have a breakdown of Nvidia's sales by graphics card? I'm sure that most sales are slow pre-Maxwell cards that are sold in OEM machines from Dell and HP but still count as a discrete card. So even if Nvidia has 80% discrete market share that number means nothing if 80% of the cards they sell are not faster than APUs.

Market share is market share buddy. I don't care what is faster than what and neither should anyone else.
You're asking me for a breakdown, then say "I'm sure that most sales are slow pre-Maxwell cards". Where is your breakdown? hehe.

And, APUs are not dGPU so why even bring it up? Shrugs.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
And you must not know much about computers.

You do understand that a CPU and a GPU are two separate entities that do not sit on the same die (except for iGPU), and as such communication between them (as you would have with software based scheduling), has to pass through the PCIe bus, with all the latency penalties that this might incur?

Which is why CPUs and GPUs have useful things called registers, cache, RAM and VRAM that hold data to dramatically reduce the latency of intercommunication.

It's funny that you seem to think that hardware schedulers are so critical that performance would just nosedive without them. But what have we been using all this time to for compute shaders in DX11?

Hint, it starts with a C :whiste:
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Simple answer: no.

You are wrong. ACE is not only for correction, but scheduling the work within the pipeline.

Way to completely misinterpret what I said. I never said that ACEs are only for correction. I said they only use OoO for correction.

And this comes straight from AMD themselves.

And yes, GCN is OoO Execution architecture. It cannot switch context in the pipeline(yet). But it can switch context WITHIN pipeline. Because Maxwell has 1 ACE it cannot do this. Graphics has to end before compute can be done.

Nvidia designed really simple architecture to program for. GCN starts to fly IF you understand what you are doing, what you have in hardware and program for it, without any layer of abstraction.

If you want to believe GCN is an OoO architecture, then go ahead.. Makes no difference to me.. :colbert:
 
Feb 19, 2009
10,457
10
76
GCN is a mix of in-order and out-of-order. The CP (graphics or compute, requires context switch but AMD labels this "fast context switch", most likely due to hardware implementation vs software) is in-order, the ACEs are out-of-order, but only for compute tasks.

AMD's video actually explains that awhile ago, as well as their programming documentation.
 

antihelten

Golden Member
Feb 2, 2012
1,764
274
126
Which is why CPUs and GPUs have useful things called registers, cache, RAM and VRAM that hold data to dramatically reduce the latency of intercommunication.

It's funny that you seem to think that hardware schedulers are so critical that performance would just nosedive without them. But what have we been using all this time to for compute shaders in DX11?

Hint, it starts with a C :whiste:

None of the things you mentioned reduce the communication latency between CPUs and GPU in the slightest, they will only affect inter-CPU and inter-GPU communication, but not between them. At most you can try and use them to reduce the amount of communication, but that only gets you so far (since some communication cannot be avoided)

Sigh, I never said performance would nosedive without hardware schedulers, you might want to pay attention.

I'm merely contending that without hardware schedulers, chances are that Nvidia will not see the performance improvements that can be attained with async compute.
 
Last edited:

.vodka

Golden Member
Dec 5, 2014
1,203
1,538
136
Interestingly and as a sidenote, ATI/AMD did software scheduling on Terascale 1-3 (HD2000-6000) and went for a hardware implementation on GCN. nV did hardware scheduling up to Fermi, Kepler and later uses a software solution like AMD's old VLIW GPUs.

Terascale vs Tesla/Fermi was AMD's trophy (by a big difference, too) on perf/W and perf/mm² throughout all these generations, and the positions inverted later. Yes, GCN ended up matching/beating Kepler and competing well enough with Maxwell through driver work, but it wasn't like that at first on each new generation. Power efficiency is still nV's.



There is clearly a tradeoff, although I'm not convinced it matters too much in this case... a software scheduler could do all the tricks GCN's does at the cost of CPU power, yes, yet after all it's just software, and that can change or be improved. A driver update could solve nV's deficiency on the matter.

Now, if there's something inherently wrong with the hardware that doesn't lend itself to a software patch, well... game over for AC until Pascal.



If only nV could make a statement so we can stop speculating in the dark... It's nice to see their seemingly flawless PR machine has found something they can't cope with as well as they usually do: be it the 3.5GB fiasco, bumpgate, the driver that turned off fans, mobile overclocking as a bug (later a feature, then a bug again), etc. These were all handled more gracefully IIRC. This isn't nV's usual style, it's concerning.

It's also nice to see AMD taking advantage of the situation and revealing their chess moves leading up to what seems to be a nice 2015-2016 for them (until Pascal is released). Maybe they aren't dead yet, maybe there's still some competent people left in there too.
 

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
If only nV could make a statement so we can stop speculating in the dark... It's nice to see their seemingly flawless PR machine has found something they can't cope with as well as they usually do: be it the 3.5GB fiasco, bumpgate, the driver that turned off fans, mobile overclocking as a bug (later a feature, then a bug again), etc. These were all handled more gracefully IIRC. This isn't nV's usual style, it's concerning.
Well they made this mistake right at the beginning when they stated that it was (that there was? ) a bug on oxides side,so of course now they are more careful about statements.

It could also be that they are working hard on a driver and are close to a solution so instead of doing a statement that nobody will believe and everybody will call PR bull their answer will be like more "in your face" by running the bench faster then the fastest GCN (or very close to it)
 

dogen1

Senior member
Oct 14, 2014
739
40
91
It could also be that they are working hard on a driver and are close to a solution so instead of doing a statement that nobody will believe and everybody will call PR bull their answer will be like more "in your face" by running the bench faster then the fastest GCN (or very close to it)

They already do though.
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
NVidia doesn't need to be defended.

That's all you been doing wrt to AC for the last week, isn't it?

NVidia designs GPUs that perform well for the here and now, not 4 or 5 years down the road. That tactic has cost AMD a lot of market share.

Ya, so more evidence that you think GPUs should focus on maximizing the short-term performance, not worry about any next generation games, any next generation VRAM requirements. Again then why do you care so much about how Maxwell will perform in DX12 or its ACE functionality? It contradicts your statements that you don't think GPUs should be forward-thinking in their design.

Sure, the ACEs are now going to be very useful, but how long have they been sitting there wasted and taking up die space? Apparently for years..

This has been covered years ago -- AMD cannot afford to spend billions of dollars to redesign brand new GPU architectures like NV can given AMD's financial position and also the fact that their R&D has to finance CPUs and APUs. NV can literally funnel 90%+ of their R&D into graphics ONLY.

Therefore, AMD needed to design a GPU architecture that was flexible and forward looking when they were replacing VLIW. That's why GCN was designed from the start to be that way. When HD7970 launched, all of that was covered in great detail. Back then I still remember you had GTX580 SLI and you upgraded to GTX770 4GB SLI. In the same period, HD7970 CF destroyed 580s and kept up with 770s but NV had to spend a lot of $ on Kepler. Then NV moved to Maxwell and you got 970 SLI and then 980SLI but AMD simply enlarged HD7970 with key changes into R9 290X. Right now Fury X is just an enlarged HD7970 more or less and in 5 years NV already went through 3 separate architectures just to be ahead. If AMD had almost no debt, primarily focused on graphics and had a lot more cash in the bank and could design new GPU architectures every 2 years like NV, I am sure they would. I don't think you are looking at it from a realistic point of view.

But that's why I keep asking, why do you in particular care about DX12 and AC? It's not as if you'll buy an AMD GPU and it's not as if you won't upgrade to 8GB+ HBM2 Pascal cards when they are out. Therefore, for you specifically, I am not seeing how it even matters and yet you seem to have a lot of interest in defending Maxwell's AC implementation, much like to this day you defend Fermi's and Kepler's poor DirectCompute performance. That's why it somewhat comes off like PR damage control for NV or something along those lines. Since you will have upgraded your 980s to Pascal anyway, who cares if 980 hypothetically loses to a 390X/Fury in DX12? Doesn't matter to you.

AMD's long term strategy was brilliant in many ways, but it cost them dearly as well.

Even when AMD had HD4000-7000 series and had massive leads in nearly every metric vs. NV, AMD's GPU division was hardly gaining market share, and in rare cases where they did market share (HD5850/5870 6 months period), it was a loss leader strategy long term with low prices and frankly by the end of the Fermi generation NV gained market share. In other words, NONE of AMD's previous price/performance strategies worked to make $. Having 50-60% market share and making $0 or losing $ is akin to having 50-60% of "empty market share." In business terms, that's basically worthless market share. It's like Android having almost 90% market share worldwide by Apple makes 90% of the profits.

Agreed, although I disagree with you about GW. You severely overestimate the impact of GW on games. Time and time again reality has shown us that it simply does not matter.

No, time and time again when checking many reviews, it has been shown that the most poorly optimized and broken PC games released in the last 2 years have been GW titles.

HardOCP recently tested the Witcher 3 and the Radeons performed very well. The performance penalty for enabling hairworks was very close between them even..

That's BS, especially now that HardOCP has shown its true face. We have known for a fact that Hairworks has a bigger impact on AMD's cards for 3 reasons:

1) AMD implemented an optimization in the drivers to vary the tessellation factor since the performance hit was much greater on AMD's hardware that cannot handle excessive tessellation factors;

2) Actual user experience. I trust that far more than any review HardOCP does.

3) 3rd party reviews from sites other than HardOCP. Why do you think so many sites turned off HairWorks in the TW3? If the performance hit was the same, then it wouldn't be unfair to test with HW on. The reason some sites turned it off is because it unfairly penalized AMD's cards by NV having excessive tessellation full well knowing that AMD's cards suffer at high tess. factors.

Good thing there are objective professional sites we can rely upon to tell us the truth:

HairWorks.png


"With HairWorks disabled the minimum frame rate of the R9 290 is 3x greater, while the GTX 780 saw a 2.5x increase and the GTX 980 a 2.2x increase. The average frame rate of the GTX 980 was boosted by 20fps, that's 36% more performance. The GTX 780 saw a 36% increase in average frame rate which was much needed going from just 36fps at 1080p to a much smoother 49fps. The R9 290X enjoyed a massive 75% performance jump with HairWorks disabled, making it 17% slower than the GTX 980 -- it all started to make sense then. We should reiterate that besides disabling HairWorks, all other settings were left at the Ultra preset. The issue with HairWorks is the minimum frame rate dips."

TressFX seems far more efficient than HairWorks as well (or alternatively it doesn't use worthless tessellation factors to kill performance).

TressFX.png


"With TressFX disabled the R9 290's minimum frame is 1.5x greater while the GTX 780 and GTX 980 saw a 1.6x increase -- roughly half the impact we saw HairWorks have in The Witcher 3. We realize you can't directly compare the two but it's interesting nonetheless. The average frame rate of the R9 290X was 49% faster with TressFX disabled, that is certainly a significant performance gain, but not quite the 75% gap we saw when disabling HairWorks on The Witcher."

http://www.techspot.com/review/1006-the-witcher-3-benchmarks/page6.html

You and all your AMD kind said who need GPU physx if it can be done on CPU..

Considering I've owned many AMD/ATI/NV cards and currently have 2 rigs with both AMD and NV cards, and keep recommending good NV cards to gamers, you trying to labeling me as "your AMD Kind" is just another baseless post I heard every week. The difference between NV loyalists is they don't recommend an AMD card no matter what. Like you pointing out worthless benchmarks of the "only $20 more expensive" EVGA 950 card and ignoring that for barely more R9 290 stomps over it, as well as R9 280X. 950 is an overpriced turd no matter how much you defend it. 960 with a free MGS V game makes 950 irrelevant even without touching R9 200/300 cards. Funny how you didn't even mention anything about 950 being DOA on arrival due to 2GB of VRAM. Nice knowing you have no problem defending 2GB cards and think gamers flush $150-160 into the toilet though.

------------------

And yes, I am all for physics being done on the CPU if it cannot be done in a brand agnostic way. If NV released stand-alone PhysX AGEIA cards that I can buy for $100-150 and they worked even if I had AMD/NV/Intel GPU in my rig, I'd consider buying one if games looked way better with PhysX on. But instead, NV locked this feature and thus wiped out any advancements and long-term potential AGEIA had.

======

Anyway, all of this is getting off-topic.
 
Last edited:

Shivansps

Diamond Member
Sep 11, 2013
3,918
1,570
136
Mahigan at overclock.net keep ignoring me when i mention the posibility of using DX12 Asymmetric Multi-GPU capability for something as simple as sending the compute tasks to a secondary DX12 device, i find that very very suspicious.
 
Last edited:

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
Mahigan at overclock.net keep ignoring me when i mention the posibility of using DX12 multiadapter capability for something as simple as sending the compute tasks to a secondary DX12 device, i find that very very suspicious.

How would that work? Do you mean have a slave NV/AMD card to support AC functions to your primary card, ala PhysX slave card? That would actually be pretty cool. So hypothetically running a 980Ti with a 750/750Ti/950/960 as a slave AC card? I think though even if that were possible, at that point it's better to sell the 980Ti and get Pascal. Also, this solution wouldn't work for gamers on a budget buying $150-300 cards that wouldn't have additional funds to buy a 2nd adapter.
 

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
Mahigan at overclock.net keep ignoring me when i mention the posibility of using DX12 Asymmetric Multi-GPU capability for something as simple as sending the compute tasks to a secondary DX12 device, i find that very very suspicious.
Because your question doesn't really make sense.
 

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
How would that work? Do you mean have a slave NV/AMD card to support AC functions to your primary card, ala PhysX slave card? That would actually be pretty cool. So hypothetically running a 980Ti with a 750/750Ti/950/960 as a slave AC card? I think though even if that were possible, at that point it's better to sell the 980Ti and get Pascal. Also, this solution wouldn't work for gamers on a budget buying $150-300 cards that wouldn't have additional funds to buy a 2nd adapter.

It already works and oxide already uses this,Dan Baker talks about it at 2:30 in the video,at 3:55 he shows the screen and you can see the difrent objects that are rendered on the diferent cards.
https://www.youtube.com/watch?feature=player_detailpage&v=9cvmDjVYSNk#t=152
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
None of the things you mentioned reduce the communication latency between CPUs and GPU in the slightest, they will only affect inter-CPU and inter-GPU communication, but not between them.

Not true. With larger caches, VRAM, registers etcetera, CPUs and GPUs can hold more data, which reduces the latency penalty dramatically.

Give you an example. VRAM stores only graphics related data, like textures, shaders, meshes etcetera.. During boot up, the game loads some of its data into VRAM from system memory and storage. And when you load a saved game or start a new game, more data is again loaded into VRAM during the loading screen and shaders are being compiled.

So basically, a well programmed game or any software is already loading it's working set into memory for fast access, rather than pulling it from storage when it needs it.

And in gameplay, there is a lot of shuffling of data between VRAM and system RAM and CPU and GPU. Having access to larger memory pools reduces this shuffling, and thus minimizing access latency which can cause stuttering or frame drops.

Yet with all of this shuffling of data, we are still able to get very high performance in PC games compared to consoles like the PS4 which have much lower theoretical latency due to being integrated on the same die and having HUMA..

Even low end gaming PCs are outperforming the consoles despite being handicapped by an API with much higher overhead and using NUMA architecture..
 
Feb 19, 2009
10,457
10
76
It's possible with multi-adapter and async compute hardware supported iGPU, in theory they can do almost anything with that feature of DX12. They can break the screen up into quadrants and send 1 of them into the iGPU, or they can send compute only.

If the iGPU is powerful enough, the extra latency should be minimal.

This feature is one of the reasons I am excited about future gaming on DX12. I hate the thought of my iGPU doing jack all while gaming so if it gets put to use, thats great.
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
The most talked about feature of DX12 up until now has been low CPU overhead. Now, we're going to take something that can be done on the GPU faster and shift it back on to the CPU. This not only negates the performance advantage DX12 has of doing it in hardware, lower driver overhead, etc... but takes up CPU cycles (magic CPU cycles that won't add latency) that DX12 frees up. Let's add to that people calling it "supporting async compute" in the feature list? Sorry, but it's verbal acrobatics to try and claim this is DX12 async compute support in the GPU.