Nvidia: Not Enough Money in a PS4 GPU for us to bother

BallaTheFeared · Mar 18, 2013

Will Robinson said:
Oh I see,an NVDA dominated market would be better for consumers then...I get it.

Kind of like Intel without AMD huh?:whiste:

Intel isn't competing with AMD, hasn't at the top since Core2, how have prices been?

Heck they gave us $220 unlocked chips because they took out bclk with SB, prior to that you had to spend $1000 for an unlocked Intel.

Evidently even with SB destroying Thuban, Intel still felt the need to give buyers in my bracket an incentive to buy and recommend.

Will Robinson · Mar 18, 2013

Yeah,you used to see those prices for AMD FX60-62 as well(if I recall) when Conroe and its successors swept in and destroyed its value.
Unlocking the chips has been very cool...note I use Intel CPUs exclusively.
AMD has some bright people,I hope they prosper from the Consoles contract....that helps fund better GPUs.

VulgarDisplay · Mar 18, 2013

3DVagabond said:
There would be no money made by anyone except for the BoP OEMs. If the parts cost $200 they'd still have to make the product, package, ship, distribute, etc.. Keep in mind that each of those steps have multiple costs attached to them. Manufacturer has a markup, distributors have a mark up, retailers have a mark up. Then there's support to the developers and after sales support. There was the cost of developing the product and then marketing it. $100 per item isn't going to cover all of that.

I was saying that everything on microsofts end only cost them $100. Case, PSU, and packaging. Either way, they'd take $20 profit per console at the start because it would be more than they made on the xbox360 and ps3 at launch.

3DVagabond · Mar 18, 2013

VulgarDisplay said:
I was saying that everything on microsofts end only cost them $100. Case, PSU, and packaging. Either way, they'd take $20 profit per console at the start because it would be more than they made on the xbox360 and ps3 at launch.

This is what you said:

VulgarDisplay said:
Even at $100 bucks to AMD, Microsoft and Sony could feasible get the rest of the parts for a single console for around $100, and still have a profit of $100 per console with a MSRP of $300. They will not be selling these consoles for a loss which is huge for them, because they lost money in the beginning for every console sold.

$100 to AMD + $100 for the rest of the parts - sell it for $300 = $100 profit.

Now, if you are saying total expense to Microsoft/Sony was $200, then $100 for Sony/Microsoft, distributor, and retailer profit is possibly close to correct. A retailer, for example, making 10% gross profit at full retail, would be typical. Apple product is typically in low single digits (3% to 5%), but they are the biggest "pimps" in the industry.

For the $200 total cost though, AMD isn't getting anywhere near $100.

cplusplus · Mar 18, 2013

3DVagabond said:
This is what you said:

$100 to AMD + $100 for the rest of the parts - sell it for $300 = $100 profit.

Now, if you are saying total expense to Microsoft/Sony was $200, then $100 for Sony/Microsoft, distributor, and retailer profit is possibly close to correct. A retailer, for example, making 10% gross profit at full retail, would be typical. Apple product is typically in low single digits (3% to 5%), but they are the biggest "pimps" in the industry.

For the $200 total cost though, AMD isn't getting anywhere near $100.

IIRC, consoles don't get 10% profit at retail. I think it is (or at least was) about 5-8%, and the profit, just like for the console makers, is on the accessories. My sister used to work at CompUSA, where their employee discount was cost + 10%, and I remember that the discount for consoles wasn't really worth it (and this was back in 2006-7, when the consoles were $400 and $500).

And most of the rough BOMs for the PS4 come out around $450-500 right now, and that's without the new PS Eye in it. If they include a new Kinect in each Xbox (which it looks like they're going to do), I don't see them having a BOM of less than $300 (and it'll probably be closer to $400).

BFG10K · Mar 18, 2013

BallaTheFeared said:
Intel isn't competing with AMD, hasn't at the top since Core2, how have prices been?

Not in terms of top performance maybe, but AMD certainly competes in other areas, such as perf/$ and onboard graphics.

As far as graphics go, the 4xxx series was directly responsible for nVidias large price cut on the 260/280: http://www.dailytech.com/NVIDIA+Takes+ATI+to+the+Mattresses+with+Lower+Pricing/article12360.htm

Make no mistake, if AMD goes out of business (or ceases to be relevant), everyone will be screwed, regardless of which camp they belong to.

NTMBK · Mar 18, 2013

Keysplayr said:
"Will I be playing RE7 and saying to myself, "Wow, I'm so glad this console has the CPU and GPU in a single die, if it hadn't I might not be able to play this game the exact same way."

Address this ^ .

Why not address the case I brought up earlier? GPU accelerated physics being able to actually affect the world simulation, instead of just floating around on top of it.

NTMBK · Mar 18, 2013

sontin said:
The graphics pipeline is straight forward. There is no reason to communicate with the CPU after the processing again. All the shader compiling for the GPU happens before the GPU is seeing the data.

For graphics workload a APU is not better than a discrete GPU.

For a current graphics workload, which has been designed for a discrete GPU model, sure. But how about using the GPU to calculate occlusion culling, then passing that result off to the CPU to determine which polygons to render on the GPU? Same for frustum culling. How about doing asset decompression in shared memory using the GPU? There are plenty of tasks in the current graphics pipeline which are done better with a GPU, but currently get executed on CPU because of the communication overhead.

BallaTheFeared · Mar 18, 2013

BFG10K said:
Not in terms of top performance maybe, but AMD certainly competes in other areas, such as perf/$ and onboard graphics.

As far as graphics go, the 4xxx series was directly responsible for nVidia’s large price cut on the 260/280: http://www.dailytech.com/NVIDIA+Takes+ATI+to+the+Mattresses+with+Lower+Pricing/article12360.htm

Make no mistake, if AMD goes out of business (or ceases to be relevant), everyone will be screwed, regardless of which camp they belong to.

Nvidia is already doing to AMD what Intel did to AMD, the GTX Titan is their 3960x and there is no response from AMD nor will there likely be one either in the foreseeable future.

This is exactly what Nvidia always wanted, the $1000 "Extreme Edition", with their mid-range taking the bulk of the sales from $250 to $500+.

Kepler completely rewrote the books on AMD, they virtually have nothing to hang their hat on right now. It's just a simple fight for survival and trying to get to the next node with something better at this point imo.

The problem with saying everyone would be screwed is the fact that Intel shows us that companies with dominance need to create compelling products for existing customers to upgrade into.

Just look at how Ivy Bridge went over in our community with Sandy users, like a lead balloon...

If Nvidia was the only real player at $200+ on 20nm with Maxwell, what reason would people have to buy the cards, if they weren't offering compelling perf/w and perf/$ increases? None, and Nvidia would be stuck with stockpiles of products they couldn't move because they couldn't offer a compelling reason for people to buy them.

sontin · Mar 18, 2013

NTMBK said:
For a current graphics workload, which has been designed for a discrete GPU model, sure. But how about using the GPU to calculate occlusion culling, then passing that result off to the CPU to determine which polygons to render on the GPU? Same for frustum culling.
How about doing asset decompression in shared memory using the GPU? There are plenty of tasks in the current graphics pipeline which are done better with a GPU, but currently get executed on CPU because of the communication overhead.

Every recommunication with the CPU will introduce latency. So you don't want that for every frame. To miminize that you need a communication way with very low latencies.

The future for graphics workload is to let the GPU do more work and have less communication with the CPU.

Everything you wrote should be done on the GPU.

NTMBK · Mar 18, 2013

sontin said:
Every recommunication with the CPU will introduce latency. So you don't want that for every frame. To miminize that you need a communication way with very low latencies.

The future for graphics workload is to let the GPU do more work and have less communication with the CPU.

Everything you wrote should be done on the GPU.

It should be, but it isn't currently. Because there is no low latency communication between the CPU and GPU, parts of the pipeline which are highly parallel vector maths problems still get executed on the CPU. APU gets you that very low latency communication. So... thanks for agreeing with me, I guess?

Keysplayr · Mar 18, 2013

NTMBK said:
Why not address the case I brought up earlier? GPU accelerated physics being able to actually affect the world simulation, instead of just floating around on top of it.

What about it? That it can run on a single die GPU/CPU but not a CPU and PCIe GPU? This simply isn't true if that's what you're hinting at.

Keysplayr · Mar 18, 2013

NTMBK said:
For a current graphics workload, which has been designed for a discrete GPU model, sure. But how about using the GPU to calculate occlusion culling, then passing that result off to the CPU to determine which polygons to render on the GPU? Same for frustum culling. How about doing asset decompression in shared memory using the GPU? There are plenty of tasks in the current graphics pipeline which are done better with a GPU, but currently get executed on CPU because of the communication overhead.

Give us numbers. The differences in performance, or efficiency. What you're saying might make more sense if you have perhaps and example. Other than that, IMHO, the difference between the two solutions is minimalistic. Are there any benchmarks showing improvements using AMD A4,A6,A8 etc. over say a Bulldozer and discrete AMD GPU?

itsmydamnation · Mar 18, 2013

sontin said:
The graphics pipeline is straight forward. There is no reason to communicate with the CPU after the processing again. All the shader compiling for the GPU happens before the GPU is seeing the data.

For graphics workload a APU is not better than a discrete GPU.

because im a complete sucker for pain......

1. here is some HLSL, a Screen space global illumination shader to be exact.

Code:

// Screen Space Indirect Illumination
// Created and Implemented by Tomerk
// Optimized by Ethatron

// ---------------------------------------
// TWEAKABLE VARIABLES.


#undef     TEST_MODE
// Toggle for test-mode. If enabled, you can see the raw ssgi
//"#define ..." is enabled, and "#undef ..." is disabled.

#define N_SAMPLES    9
// number of samples, currently do not change.

iface float giRadiusMultiplier
< string help = "Linearly multiplies the radius of the II/AO Sampling"; >
     = 10;

iface float iiStrengthMultiplier
< string help = "Linearly multiplies the strength of the II"; >
     = 3.0;

iface float aoStrengthMultiplier
< string help = "Linearly multiplies the strength of the AO"; >
     = 1.0;

iface float giClamp
< string help = "The maximum strength of the AO, 0 is max strength, 1 is weakest"; >
     = 0.0;

iface float ThicknessModel
< string help = "Units in space the AO assumes objects' thicknesses are"; >
     = 100;


// END OF TWEAKABLE VARIABLES.
// ---------------------------------------

#include "includes/Random.hlsl"
#include "includes/Resolution.hlsl"
#include "includes/Depth.hlsl"
#include "includes/Fog.hlsl"
#include "includes/Position.hlsl"

texture2D obge_LastRendertarget0_EFFECTPASS;
texture2D obge_PrevRendertarget0_EFFECTPASS;

sampler2D FrmeSamplerL = sampler_state {
    Texture = <obge_PrevRendertarget0_EFFECTPASS>;

    AddressU = Mirror;
    AddressV = Mirror;

    MINFILTER = LINEAR;
    MAGFILTER = LINEAR;
    MIPFILTER = NONE;
};

sampler2D PassSamplerL = sampler_state {
    texture = <obge_LastRendertarget0_EFFECTPASS>;

    AddressU = CLAMP;
    AddressV = CLAMP;

    MINFILTER = LINEAR;
    MAGFILTER = LINEAR;
};

struct VSOUT
{
    float4 vertPos : POSITION;
    float2 UVCoord : TEXCOORD0;
};

struct VSIN
{
    float4 vertPos : POSITION0;
    float2 UVCoord : TEXCOORD0;
};

VSOUT FrameVS(VSIN IN)
{
    VSOUT OUT = (VSOUT)0.0f;    // initialize to zero, avoid complaints.

    OUT.vertPos = IN.vertPos;
    OUT.UVCoord = IN.UVCoord;

    return OUT;
}

static const float2 sample_offset[N_SAMPLES] =
{
//#if N_SAMPLES >= 9
    float2(-0.1376476f,  0.2842022f ),
    float2(-0.626618f ,  0.4594115f ),
    float2(-0.8903138f, -0.05865424f),
    float2( 0.2871419f,  0.8511679f ),
    float2(-0.1525251f, -0.3870117f ),
    float2( 0.6978705f, -0.2176773f ),
    float2( 0.7343006f,  0.3774331f ),
    float2( 0.1408805f, -0.88915f   ),
    float2(-0.6642616f, -0.543601f  )
//#endif
};

static const float sample_radius[N_SAMPLES] =
{
//#if N_SAMPLES >= 9
    0.948832,
    0.629516,
    0.451554,
    0.439389,
    0.909372,
    0.682344,
    0.5642,
    0.4353,
    0.5130
//#endif
};

float4 Illumination(VSOUT IN) : COLOR0 {
    float depth = LinearDepth(IN.UVCoord);

    [branch]
    if (depth >= 0.99)
        return float4(0.0, 0.0, 0.0, 1.0);

    float3 pos = EyePosition(IN.UVCoord, depth);
    float3 dx = ddx(pos);
    float3 dy = ddy(pos);
    float3 norm = normalize(cross(dx, dy));
    norm.y *= -1;

    float sample_depth;

    float4 gi = float4(0, 0, 0, 0);
    float is = 0, as = 0;

    float2 rand_vec = rand_2_10(IN.UVCoord);
    float2 sample_vec_divisor = g_InvFocalLen * depth * rangeZ / (giRadiusMultiplier * 5000 * rcpres);
    float2 sample_center = IN.UVCoord + norm.xy / sample_vec_divisor * float2(1, aspect);
    float  ii_sample_center_depth = depth * rangeZ + norm.z * giRadiusMultiplier * 20;
    float  ao_sample_center_depth = depth * rangeZ + norm.z * giRadiusMultiplier *  5;

    [loop]
    for (int i = 0; i < N_SAMPLES; i++) {
        float2 sample_vec = reflect(sample_offset[i], rand_vec) / sample_vec_divisor;
        float2 sample_coords = sample_center + sample_vec * float2(1, aspect);
        float  sample_depth = rangeZ * LinearDepth(sample_coords);

        float ii_curr_sample_radius = sample_radius[i] * giRadiusMultiplier * 20;
        float ao_curr_sample_radius = sample_radius[i] * giRadiusMultiplier *  5;

        gi.a += clamp(0, ao_sample_center_depth + ao_curr_sample_radius - sample_depth                 , 2 * ao_curr_sample_radius);
        gi.a -= clamp(0, ao_sample_center_depth + ao_curr_sample_radius - sample_depth - ThicknessModel, 2 * ao_curr_sample_radius);

        if ((sample_depth < ii_sample_center_depth + ii_curr_sample_radius) &&
            (sample_depth > ii_sample_center_depth - ii_curr_sample_radius)) {
            float3 sample_pos = EyePosition(sample_coords, sample_depth);
            float3 sample_dx = ddx(sample_pos);
            float3 sample_dy = ddy(sample_pos);
            float3 sample_norm = normalize(cross(sample_dx, sample_dy));

            sample_norm.y *= -1;

            float3 unit_vector = normalize(pos - sample_pos);

            gi.rgb += tex2D(FrmeSamplerL, sample_coords).rgb;
            //* saturate(dot(norm, unit_vector))
            //* saturate(dot(sample_norm, unit_vector));
        }

        is += 1;
        as += 2 * ao_curr_sample_radius;
    }

    gi.rgb /= is * 5;
    gi.a   /= as;

    gi *= FogDecay(depth);

    gi.rgb = 0.0 + gi.rgb * iiStrengthMultiplier;
    gi.a   = 1.0 - gi.a   * aoStrengthMultiplier;

    return gi;
}


float4 BlurNCombine(VSOUT IN) : COLOR0 {
    float3 color = tex2D(FrmeSamplerL, IN.UVCoord).rgb;
    float4 gi = tex2D(PassSamplerL, IN.UVCoord) * 4;

    gi += tex2D(PassSamplerL, IN.UVCoord + float2( rcpres.x, 0)) * 2;
    gi += tex2D(PassSamplerL, IN.UVCoord + float2(-rcpres.x, 0)) * 2;
    gi += tex2D(PassSamplerL, IN.UVCoord + float2(0,  rcpres.y)) * 2;
    gi += tex2D(PassSamplerL, IN.UVCoord + float2(0, -rcpres.y)) * 2;

    gi += tex2D(PassSamplerL, IN.UVCoord + rcpres                );
    gi += tex2D(PassSamplerL, IN.UVCoord - rcpres                );
    gi += tex2D(PassSamplerL, IN.UVCoord + rcpres * float2(1, -1));
    gi += tex2D(PassSamplerL, IN.UVCoord - rcpres * float2(1, -1));

    gi /= 16;

#ifdef    TEST_MODE
    return gi;
#endif

    return float4((color + gi.rgb) * gi.a, 1);
}

technique main
<
    int group = EFFECTGROUP_PRE;
    int fxclass = EFFECTCLASS_LIGHT;
    int conditions = EFFECTCOND_ZBUFFER | EFFECTCOND_ACHANNEL;
>
{
    pass {
        VertexShader = compile vs_3_0 FrameVS();
        PixelShader  = compile ps_3_0 Illumination();
    }

    pass {
        VertexShader = compile vs_3_0 FrameVS();
        PixelShader  = compile ps_3_0 BlurNCombine();
    }
}

here is what that shader looks like after its been compiled.

Code:

    ÿþø
             H                    A         <              help       IIRadiusMultiplier         ¤                   @@                       help       IIStrengthMultiplier          ì                                 #   oblv_ReciprocalResolution_MAINPASS        @                                   oblv_ProjectionFoV_MAINPASS                                       #   oblv_ProjectionDepthRange_MAINPASS                                                                                        "   oblv_ProjectionTransform_MAINPASS         P                 oblv_CurrDepthStencilZ_MAINPASS       ¤                                                                                                                                                                          ¤          ¥      ¤     ¦      Ä  À  ª      ä  à  ©           «      $        DpthSampler       ä                                                                                                                                                                         ¤      Ì  È  ¥      ä  à  ¦           ª      $     ©      D  @  «      d  `  
   DpthSamplerL          X                                                                                                                                                                      GET4                           ¤          ¥      (  $  ¦      H  D  ª      h  d  ©          «      ¨  ¤  ¬      È  Ä  
   DpthSamplerG                            úE ( F PCF ¨ÞF   oblv_FogRange         Ø                                   oblv_FogColor                      "   obge_LastRendertarget0_EFFECTPASS         D             "   obge_PrevRendertarget0_EFFECTPASS                                                                                                                                                                                       ¤          ¥          ¦      ¼  ¸  ª      Ü  Ø  ©      ü  ø  «          
   FrmeSamplerL          à              
                                                                                                                                                                ¤      È  Ä  ¥      à  Ü  ¦             ü  ª                   ©      @      <      «      `      \      
   PassSamplerL              
                   group             @
                   fxclass           l
                   conditions                                                   
                                                    main                              (   $   `   |                À   Ü            0         `  |         ´  Ð         8  L          t  @          ´            ø  ä          l           ¬  È         ì             ,  @          l  8          °  |              ì
        ø      ô      $
   
  P
  L
  ¬
                
  |
         
  
  ä
                ¸
  ´
         Ð
  Ì
                                    +   Linearly multiplies the strength of the II     2   Linearly multiplies the radius of the II Sampling          ÿÿÿÿ       x   ÿÿþÿÓ DBUG(            <   "   @      °  P  K:\oblivion\memory «(     ÿÿ<    ÿÿT    ÿÿl    ÿÿx    ÿÿ  ©     ¬      ¬   ¬  «   À  ¬   Ð  «   à  «   ð  ¬     ®     *   ,  ®   <  *   L  ®   `  °   t  °     °     ±   ¤  ±   ´  ±   Ä  ²   Ô  ²   ä  ²   ô  ³     ³     ³   $  µ   4  ¨   D  ¼   T  ¼   h  BlurNCombine «««                   ÿÿ!   ÿÿÿÿÿÿ II «                  ÿÿ       ÿÿ       ÿÿ       ÿÿ       ÿÿ       ÿÿ       ÿÿ       ÿÿIN vertPos «           UVCoord            ÿ                0       ÿÿÿÿcolor ««       ÿÿoblv_ReciprocalResolution_MAINPASS «                 ÿÿÿÿ    P  `     p               P  ü  @     P      \       d      p       ¤  Microsoft (R) HLSL Shader Compiler 9.29.952.3111 «««þÿY CTAB   /   ÿÿ         (  l         |                ¤   ´   Ä        Ô       ä            FrmeSamplerL «««           IIStrengthMultiplier «««              @@            PassSamplerL «««           oblv_ReciprocalResolution_MAINPASS «                           ps_3_0 Microsoft (R) HLSL Shader Compiler 9.29.952.3111 þÿ_ PRESXFþÿ+ CTAB   w   XF        t   0         T   d   oblv_ReciprocalResolution_MAINPASS «                           tx Microsoft (R) HLSL Shader Compiler 9.29.952.3111 þÿ PRSI                                       þÿ CLIT                                         ð?      ð¿                þÿ FXLC    P                                   ððððÿÿ  Q      ?      ¿  @Q       @  =                           B      ää     ä       d   DB    ää B    îä     ä ä      ä ÿ  ä     ä     ä    D    DB    ää B    îä      ä     ä     ä     ä    ä   äB    ää       ä ä    ä¡  äB    ää       ä ä    ä   äB    ää       ä ä    ä¡  äB    ää       ä ä      ä    B     ä ä      ä U  ä      ÿÿ         ÿÿÿÿ           þÿþÿ[ DBUG(   4         <      @        p   K:\oblivion\memory «(     ÿÿÐ    ÿÿÜ    ÿÿè    ÿÿô  J      K     FrameVS vertPos            UVCoord            x                   ¨                ÿÿÿÿIN «       ¨                 ÿÿÿÿ    p   ¸      È   p   à   ä      ô   Microsoft (R) HLSL Shader Compiler 9.29.952.3111 «««þÿ CTAB   #    þÿ              vs_3_0 Microsoft (R) HLSL Shader Compiler 9.29.952.3111                    à     à    à  ä   à äÿÿ          ÿÿÿÿ          ÿÿþÿ4DBUG(               n   ¤          includes/Depth.hlsl K:\oblivion\memory includes/Position.hlsl includes/Random.hlsl includes/Fog.hlsl «««(   <   O   f   {     ÿÿ    ÿÿ(    ÿÿ@    ÿÿX    ÿÿp    ÿÿ    ÿÿ     ÿÿ¸    ÿÿÐ    ÿÿè    ÿÿ     ÿÿ    ÿÿ0    ÿÿH    ÿÿ`    ÿÿx    ÿÿ    ÿÿ¨    ÿÿ´    ÿÿÀ  [   Ì  v   Ü  v   ì  v   ü  v     r    r  $  r  8  s  D  r  P  r  T  r  `    d  
  x  
    v    w  ¤  x  °  x  À  x  Ô    à    ô            (    4    D    P    d    p    |            ¨    ´    À    Ð    ä    ø            ,    8    @    P    `    p            ¬    À    Ô    è    ü        $    4    H    X    l    |        ¤    ¸    Ì    à  [   ð  v      v     v      v   ,    <    P    d    x            ¬    À    Ô    è    ô     ø            (    <  ¡  P  u  `  u  l  FogDecay «««            j   ÿÿ  ÿÿÿÿII «           =       ÿÿ`      ÿÿ c       ÿÿf       ÿÿk       ÿÿIllumination IN vertPos            UVCoord              ¤  ´  ¼         Ì       ÿÿÿÿ                   m        LinearDepth    ÿÿ  ÿÿÿÿZ   ÿÿÿÿÿÿ  RawDepthGS «   ÿÿÿÿÿÿ V   ÿÿÿÿÿÿ dx «#       ÿÿdy «$       ÿÿi ««            =   ÿÿÿÿÿÿ  A   ÿÿ  ÿÿÿÿd   ÿÿÿÿÿÿ  noiseX «            .     ÿÿÿÿÿÿnoiseY «1   ÿÿ  ÿÿÿÿnorm «««'       ÿÿoblv_ProjectionDepthRange_MAINPASS «           g   ÿÿÿÿ ÿÿpos "   ÿÿÿÿ   sample_center ««9   ÿÿÿÿ   sample_center_depth <   ÿÿÿÿ  ÿÿsample_coords ««N      ÿÿÿÿsample_vec «M      ÿÿÿÿsample_vec_divisor «3   ÿÿÿÿ   EyePosition uv «    ÿÿÿÿ               0      <  @     P      Ü     ì        ø                    ,      D  ø     P      h  @     l      x  @     |                   À  È     Ø      ä  È     ì      ø  @              0     @      L  @     P      \  ¼     l      x  È             ¼     ¨      ´  ¼     À      Ì  ¼     à  ì  ø  ¼     ü  Microsoft (R) HLSL Shader Compiler 9.29.952.3111 «««þÿ CTAB      ÿÿ           ¨      > ¸       È         Ø       è        ü        
    ,  <  L           ,  p          ¤  ´  ô       ,  p  DpthSamplerG «««           FrmeSamplerL «««           IIRadiusMultiplier «               A            oblv_FogRange ««             úE ( F PCF ¨ÞFoblv_ProjectionDepthRange_MAINPASS «                oblv_ProjectionTransform_MAINPASS ««                                                                           oblv_ReciprocalResolution_MAINPASS ps_3_0 Microsoft (R) HLSL Shader Compiler 9.29.952.3111 «þÿ÷ PRESXFþÿp CTAB     XF          l                       °   À   Ð         ô     D       h  x  IIRadiusMultiplier «               A            oblv_FogRange ««             úE ( F PCF ¨ÞFoblv_ProjectionTransform_MAINPASS ««                                                                           oblv_ReciprocalResolution_MAINPASS «                           tx Microsoft (R) HLSL Shader Compiler 9.29.952.3111 þÿ PRSI                                       þÿ2 CLIT                                                                                                                                                                        ³@                        þÿB FXLC    0                         0                        P                                    P                                    0                        0                        0                       ððððÿÿ  Q       À   @  ?  ¿Q    9ÖÏALwC    î*GQ  
     A      À  ?Q    ù">   ?ÛÉ@ÛIÀQ       À  ÀÀ  àÀ   ÁQ         ¿   À  @ÀQ    a¶<            Q    ¤p}¿¤p}?      ?Q    ~ó¾ó>§ær?    Q    
j ¿ü7ë>ö'!?Ìh?Q    ëc¿n?p½2ç>®.?Q    C>$æY?÷à>io?Q    /¾f&Æ¾¤§2?Ìæ^¾Q     û;?é>Á>úB>Vc¿Q    
*¿o)¿¤ßÞ>øS?Q      À   À  ÀÀ  àÀ0    ð                                      B      ää       ÿ ª¡            ª                ª    
 ä      ª      )   U U     ê *       ª +          D  ä      U ä       ä [     í\     ö    ä ä    Ò É ä$    äZ      ä ä  ª       ª    U       ª      ª ª  ÿ %     ª     U ÿ      ª         ÿ       ª     ª     D     ª       ä D      ª     ÿ    U     D Ä    ª  ä ´      ä ä  D     U     ª     ª        
    ª    ª &    äð    ÿ ä     ÿ ä     ÿ
 î X      ä  ÿ X    U ä  äX    ª ä  äX    ÿ ä  äX      ä  äX      î  äX    U ä  äX    ª î  äX    ÿ ä  ä        P P    ä ÿ ä    ä ä    ä î  î    ÿ ä X      ÿ  ªX    U ÿ  ªX    ª ÿ  ªX    ÿ ª  ªX    ÿ ÿ  ª    ª   B    ää     ÿ ª¡    ÿ     ª     ÿ    ÿ ª     ª
    ª        ª  ÿ      ª
  ¡ ª        ª  ÿ ªB    ä ä     ¤ ¤X    ª ª¡ ÿ¡X      ª¡ ªX    ª ä ô    U'       ä            ª       U  ª
  ¡      U  ¡ ÿX      U  ª ÿ      U ä    ÿ X       ä äÿÿ          ÿÿÿÿ           þÿþÿ[ DBUG(   4         <      @        p   K:\oblivion\memory «(     ÿÿÐ    ÿÿÜ    ÿÿè    ÿÿô  J      K     FrameVS vertPos            UVCoord            x                   ¨                ÿÿÿÿIN «       ¨                 ÿÿÿÿ    p   ¸      È   p   à   ä      ô   Microsoft (R) HLSL Shader Compiler 9.29.952.3111 «««þÿ CTAB   #    þÿ              vs_3_0 Microsoft (R) HLSL Shader Compiler 9.29.952.3111                    à     à    à  ä   à äÿÿ  ÿÿÿÿ              "   obge_LastRendertarget0_EFFECTPASS   ÿÿÿÿ              "   obge_PrevRendertarget0_EFFECTPASS   ÿÿÿÿ                      oblv_CurrDepthStencilZ_MAINPASS ÿÿÿÿ                  oblv_CurrDepthStencilZ_MAINPASS ÿÿÿÿ                  oblv_CurrDepthStencilZ_MAINPASS

this is completely irrelevent to my point, i just put it in here so you can see I AM NOT TALKING ABOUT SHADER COMPILING!!!!!!!!!

now here is how long it takes to EXECUTE the shader on a 6970 @ 1000/5600 running 3840X1024 for EVERY FRAME

http://www.users.on.net/~rastus/oblivion/files/oblivion SSGI time resize.png
http://www.users.on.net/~rastus/oblivion/files/oblivion SSGI resize.png

that's right 42 milliseconds per frame, thats what im talking about speeding up with an APU, actual execution. Now how will an APU do this:

1. the PS4/XBOX3 APU's have a unified memory
2. the PS4/XBOX3 APU's have a unified memory address space
3. the PS4/XBOX3 APU's still require the CPU to control every call made to the GPU.

So now there are two ways to do this,

1. the CPU predictors and pre-fetchers can be used to fetch data for branch heavy shaders to reduce stalls. As i said before ( which you promptly ignored) there are already published research papers detailing methods for this (http://www4.ncsu.edu/~yyang14/hpca2012.pdf)

2. CPU can fetch 'GPU" data as there is now no longer any GPU or CPU data only APU data and work on it. this is the one where a high performance APU could do things a much more powerful discrete GPU cant because GPU's don't predict or prefetch they switch "threads" until the data becomes available, if you have no threads to work on then you have stalled. A Dev on an APU with a unified address space can write shaders that complete the non branchy, good locality operations on the GPU then write C/ assembler (or whatever) code to complete the branchy , bad data locality operations.

The exact same is true for "CPU" data, if it is easy vectored, isn't branchy and has good data locality, execute it on the GPU shader. The end result is that the dev can now do something else with that CPU or GPU time which would otherwise be a stall or inefficient use of execution resources. This pushes up average utilisation across the APU.

It might shock you but even without a single address space, Cell already did this for Occlusion Culling because RSX was very slow at it.

On the PS4/XBOX3 there is no GPU or CPU data only APU data, execute it where you see fit.

sontin · Mar 18, 2013

NTMBK said:
It should be, but it isn't currently. Because there is no low latency communication between the CPU and GPU, parts of the pipeline which are highly parallel vector maths problems still get executed on the CPU. APU gets you that very low latency communication. So... thanks for agreeing with me, I guess?

Current APUs from AMD are not able to communicate onChip execpt of a very small register memory.
For data transfer the CPUs/GPUs need to go over the offChip memory bus. And that is not low latency.

The whole thing makes only sense when both function units can communicate over a very fast, low latency onChip cache like the compute units of Kepler and GCN.

NTMBK · Mar 18, 2013

Keysplayr said:
Give us numbers. The differences in performance, or efficiency. What you're saying might make more sense if you have perhaps and example. Other than that, IMHO, the difference between the two solutions is minimalistic. Are there any benchmarks showing improvements using AMD A4,A6,A8 etc. over say a Bulldozer and discrete AMD GPU?

Of course there aren't those examples, because software for PCs is written for a classic discrete GPU + CPU setup. Devs aren't going to write optimised algorithms for APUs when they are a small subset of the market, and generally ones who aren't that serious about gaming. I'm talking about the PS4, where every single game will be written for that fixed platform, and as such it makes sense to implement algorithms which are only feasible on an APU setup.

itsmydamnation · Mar 18, 2013

sontin said:
Current APUs from AMD are not able to communicate onChip execpt of a very small register memory.
For data transfer the CPUs/GPUs need to go over the offChip memory bus. And that is not low latency.

The whole thing makes only sense when both function units can communicate over a very fast, low latency onChip cache like the compute units of Kepler and GCN.

You are completely wrong here there is a 128bit coherrent link on Zacate and Llano. there are quite alot of limitations around acess byt it was the first step for HSA. PS4/XBOX3 are further along the HSA path.

sontin · Mar 18, 2013

itsmydamnation said:
You are completely wrong here there is a 128bit coherrent link on Zacate and Llano. there are quite alot of limitations around acess byt it was the first step for HSA. PS4/XBOX3 are further along the HSA path.

It's offChip memory and it's only 64bit wide for the gpu and the cpu.
In your documentation they have a model with a shared L3 cache. There is a reason why you don't want use the offchip memory for this kind of work.

itsmydamnation · Mar 18, 2013

sontin said:
It's offChip memory and it's only 64bit wide for the gpu and the cpu.
In your documentation they have a model with a shared L3 cache. There is a reason why you don't want use the offchip memory for this kind of work.

explain Zero Copy then?

You have consistently shown you dont understand how even LLano works yet alone future HSA topologies. If you look at the afds presentation you can see that both the CPU and GPU can access without going off chip. slide 35 sums it up nicely.

http://www.realworldtech.com/fusion-llano/2/

The Fusion Control Link (or Onion) is a 128-bit (16B) bi-directional bus that feeds into a memory ordering queue shared with the coherent requests from each of the 4 cores. Onion runs at up to 650MHz for notebook variants of Llano (10.4GB/s read + 10.4GB/s write) and 492MHz for Zacate. An arbiter in the IFQ is responsible for selecting coherent requests (based on memory ordering) to send to the memory controller. Desktop versions of Llano will probably run Garlic and Onion faster still, given the extra power budget.

http://amddevcentral.com/afds/assets/presentations/1004_final.pdf

Llano introduces two new buses for the GPU to access memory:

AMD Fusion Compute Link (ONION):
This bus is used by the GPU
when it needs to snoop the CPU
cache, so is a coherent bus
This is used for cacheable system memory

itsmydamnation · Mar 18, 2013

how about a whole lot of these apples;
http://developer.amd.com/wordpress/media/2012/10/2011CedecTakahiroFinal.pdf

Lonbjerg · Mar 18, 2013

Funny...poeple talk about profits...whne everyon know the console are sold at loss....but they earn that money back on higher game-prices.

Everytime a new console comes out, it appears the community gets a collective memory loss and the same failed used arguments as when the least generation console came are regurgitated over and over again.

People were all in arms about the RSX....then this happend:

http://forums.anandtech.com/showthread.php?t=1685433

Arachnotronic · Mar 18, 2013

Will Robinson said:
Oh I see,an NVDA dominated market would be better for consumers then...I get it.

Kind of like Intel without AMD huh?:whiste:

Intel has not needed to compete with AMD for years, and yet prices on high end SKUs have been flat to down, with increasing performance, higher integration, and decreasing power envelopes. Guys like Intel/AMD/Nvidia rely on people upgrading in order to keep the lights on.

If AMD went bankrupt tomorrow, Intel and Nvidia would keep on plugging on doing what they do.

NTMBK · Mar 18, 2013

Lonbjerg said:
Funny...poeple talk about profits...whne everyon know the console are sold at loss....but they earn that money back on higher game-prices.

Everytime a new console comes out, it appears the community gets a collective memory loss and the same failed used arguments as when the least generation console came are regurgitated over and over again.

People were all in arms about the RSX....then this happend:

http://forums.anandtech.com/showthread.php?t=1685433

You're right, people get collective memory loss. They think that just because there is a PC GPU with better specs on paper, the console is going to be "gimped" and worthless. And yet that "less powerful than a 7800" GPU has lasted 8 years, still outputting games as good looking as God Of War III.

Meekers · Mar 18, 2013

Intel17 said:
Intel has not needed to compete with AMD for years, and yet prices on high end SKUs have been flat to down, with increasing performance, higher integration, and decreasing power envelopes. Guys like Intel/AMD/Nvidia rely on people upgrading in order to keep the lights on.

If AMD went bankrupt tomorrow, Intel and Nvidia would keep on plugging on doing what they do.

Intel has completely stagnated as far as increased power. Their focus has shifted almost entirely to lower power usage. They have also refused to take more than 4 core mainstream. If AMD were actually competing I think it is reasonable to think we would see more focus on increasing performance from Intel. So while Intel has maintained their level of innovation it has gone in a direction that does not help raw performance. Just read the Haswell performance preview which is showing 3-8% increase. That will be two generations in a row of tiny increases in power.

Nvidia has a history of extreme price gouging when there is not a direct competitor from AMD, just look at Titan. If they did not have strong competition from AMD at all the other performance levels do you really think they would not slot everything in under Titan? $800 for a 680, $700 for a 670 ect.

ShintaiDK · Mar 18, 2013

Lonbjerg said:
Funny...poeple talk about profits...whne everyon know the console are sold at loss....but they earn that money back on higher game-prices.

Last gen consoles are still a loss with all gaming and online services added. -3B$ for MS and -5B$ for Sony.

Nvidia: Not Enough Money in a PS4 GPU for us to bother

Diamond Member

Golden Member

Diamond Member

Lifer

Member

Lifer

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Elite Member

Elite Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Lifer

Member

Lifer