PileDriver Performance only 10% -15% (maybe)

SickBeast · Apr 7, 2012

I don't see why the PS4 would spur on GPGPU apps.

pelov · Apr 7, 2012

SickBeast said:
I don't see why the PS4 would spur on GPGPU apps.

Because in a console you always spend more $$ on the GPU than you do on the CPU so leveraging that powerful GPU makes sense. You also have to consider that the next gen consoles are more than just straight gaming machines.

And yea, Intel is quite a bit behind as far as GPGPU goes and they will always be behind because they don't make high end discrete GPUs 😛 Haswell might shake things up but Trinity is apparently on schedule (already shipping to OEMs) and this new CEO has been all about efficiency and punctual releases, so I still doubt that Intel will catch up to AMD in GPGPU just as I doubt AMD will catch up to Intel in CPU performance. It might get close and certain apps might swing the the other way but there will always be a clear favorite.

I'm itching for a Trinity laptop with a nice display. Apparently it's packing a 50% increase in GPU performance (this I actually believe) and ~20-30% CPU performance (this I'm not so sure about) over equal TDP Llanos. Even if the CPU performance stood still I'd still buy it over IVB. Like another poster before said, I don't care if my video encodes 40secs slower if my game gets +40 FPS

SickBeast · Apr 7, 2012

It will be interesting to see Trinity in a crossfire setup with the IGP and a discreet GPU. Intel cannot do this.

That said, I think Ivy Bridge is great for most people who do casual gaming on a laptop. The CPU performance is worth something.

What I would really like to see is more quad core CPUs in laptops. Hopefully Intel will make the i5s quad cores this time. It's deceptive that they make them dual cores in laptops IMO.

Arzachel · Apr 7, 2012

Khato said:
I'll give you tessellation seeing as how it's not implemented at all in SNB, but neither OpenCL nor 'UVD' can be counted against SNB die space. OpenCL is more a matter of a few bug fixes than actual die size, and last I checked the SNB decode was on par with both AMD and NVIDIA offerings. As for IQ, no question that it's behind there due to its lack of proper anisotropic filtering, but that's effectively fixed with IVB and again didn't affect area much as all the logic was present already, just not working properly.

Wait, are you trying argue that all these Sandy Bridge are essentially being sold with broken functionality? Wouldn't that get fixed by a respin or a driver revision? That their waiting for Ivy might just mean that SB doesn't actually have the hardware capabilities. And anisotropic filtering still isn't on par with AMD/Nvidia on Ivy. And GPU-Z tells me that the HD 2000/3000 is actually 66mm^2.

It is? That must by why according to notebookcheck.net the 2557m in a macbook air gets 1360 on 3dmark vantage P GPU while the 2620m in a macbook pro gets 1477... Oh wait, that's pretty much exactly the difference between their 1.2GHz vs 1.3GHz turbo speeds.

I'll concede this point to you. While the list you quote has some fairly large discrepancies, I can't find any comparison of two different wattage SKUs with the same GPU specs benched on the same drivers.

t_jason_t · Apr 7, 2012

There's Alot of room for improvement on this cpu. The high latency cache. The longer pipline. The L1 data cache was cut for PH2 64k to 12k. The L2 and L3 caches are 40 percent slower than phenom, and since thats were the cpu pulls its instructions whos to say this cpu my be hamstringed by 40 percent. Imagine if it had 40 percent more performance from its current state.

t_jason_t · Apr 7, 2012

I think amd just threw out single thread performance and keeps adding core for a software market thats not quite there yet. I cant think of any game that runs 8 threads right now.

SickBeast · Apr 7, 2012

What's sad is that even if Piledriver is 25% faster than Llano it will still not be enough.

t_jason_t · Apr 7, 2012

But graphic trinity will dominate any thing intel has which is considerable since pc are multimedia machines.

t_jason_t · Apr 7, 2012

Trinity looks pretty good to me if the benches are true. Graphics should be pretty awsome

t_jason_t · Apr 7, 2012

29 percent increase in cpu speed as well.

t_jason_t · Apr 7, 2012

I hope vishera is 29 percent faster than BD that would be great.

t_jason_t · Apr 7, 2012

I hate to admit it but i actually been considering buying and intel, but i am gonna wait it out for a bit.

Please use the edit button instead of 5 posts in a row.
Markfw900
Anandtech Moderator.

Shaydza · Apr 8, 2012

Man I am desperate to see what trinity can do. It may very well be the perfect HTPC processor if you want to game on a small lounge box.

BLaber · Apr 8, 2012

AMD A10 Trinity APU first impression

http://tbreak.com/tech/2012/04/amd-a10-trinity-apu-first-impression/

Shaydza · Apr 8, 2012

Thanks but I was looking to get something with A bit more meat.will have to wait for official launch

Lepton87 · Apr 8, 2012

SickBeast said:
It will be interesting to see Trinity in a crossfire setup with the IGP and a discreet GPU. Intel cannot do this.

Will they actually implement aCF? Trinity GPU will be VLIW4 and the only other VLIW4 products are 6970 and 6950. They would have to do VLIW4+VLIW5 CF.

Dresdenboy · Apr 8, 2012

t_jason_t said:
There's Alot of room for improvement on this cpu. The high latency cache. The longer pipline. The L1 data cache was cut for PH2 64k to 12k. The L2 and L3 caches are 40 percent slower than phenom, and since thats were the cpu pulls its instructions whos to say this cpu my be hamstringed by 40 percent. Imagine if it had 40 percent more performance from its current state.

That's an oversimplification and it's 16, not 12kB L1.

IBM's new HPC chip also has 16kB per thread. Even Sandy and Ivy only have 16kB on avg. per thread w/ HT.

Throughput (incl. instructions) is not bound by latency. Many people incl. Agner Fog, Andreas Stiller and me think, the decoder's throughput is the main bottleneck.

IntelUser2000 · Apr 8, 2012

Arzachel said:
And GPU-Z tells me that the HD 2000/3000 is actually 66mm^2.

GPU-Z is absolutely inaccurate for Intel GPUs. Not even including the empty spot on the Sandy Bridge does it take 60mm2. The exact die size of Sandy Bridge's 12EU GT2 IGP is 38mm2. HD 2000 is even smaller, because its a seperate die!

And anisotropic filtering still isn't on par with AMD/Nvidia on Ivy.

Let me tell you why that Ivy Bridge's AF differences won't show up in real world scenarios.

Here's the 5870 vs 4870 and GTX 285.
http://www.anandtech.com/show/2841/13

Even AMD admits there's no difference between the 4870 and 5870 AF quality in real life:

What you won’t see however is a difference, particularly with our static screenshots. When discussing the matter, AMD noted that the difference in perceived quality between the old algorithm and the new one was practically the same.

Now the Ivy Bridge one: http://www.anandtech.com/show/5626/ivy-bridge-preview-core-i7-3770k/16

It may not be up to 5870 standards, but exceeding Nvidia's.

pelov · Apr 8, 2012

Dresdenboy said:
Throughput (incl. instructions) is not bound by latency. Many people incl. Agner Fog, Andreas Stiller and me think, the decoder's throughput is the main bottleneck.

http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2

For a single instruction thread, Bulldozer offers more front end bandwidth than its predecessor. The front end is wider and just as capable so this makes sense. But note what happens when we scale up core count.

Since fetch and decode hardware is shared per module, and AMD counts each module as two cores, given an equivalent number of cores the old Phenom II actually offers a higher peak instruction fetch/decode rate than the FX. The theory is obviously that the situations where you're fetch/decode bound are infrequent enough to justify the sharing of hardware. AMD is correct for the most part. Many instructions can take multiple cycles to decode, and by switching between threads each cycle the pipelined front end hardware can be more efficiently utilized. It's only in unusually bursty situations where the front end can become a limit.

4-4-8-16 for single-dual-quad-6/8 respectively is probably limiting it but would only increase its performance in more heavily threaded scenarios. Allowing for more instructions at each stage after single means a decreased CMT tax and better multi-threaded performance. The benchmarks where BD was supposed to be heavily favored, and either didn't win or didn't win by large enough margins, would see their biggest bumps. The wider front end means higher costs, though. But there's another way of looking at this...

The multi-threaded performance, despite that occasional pileup at the decoder, isn't the problem here. The architecture was planned for mid 4ghz clock speeds so that longer pipeline and slower caches (like the poster above noted) make sense given the tradeoffs. They didn't hit those clock speed goals and it all went down hill. Given a higher clock speed (and maybe enhanced branch prediction which is much more important now than it was with previous AMD architectures. Historically AMD have always lagged behind in branch prediction so they've got a lot riding on this) the crowded front end wouldn't seem as big of a problem as we'd assume it to be.

Given the clock speeds of the leaked Trinity desktop parts I'd guess they managed to ramp up the clock speeds with the help of the RCM licensing. I'd hope for both, a wider front end and a bump in clock speeds, but given their direction I doubt we'll get the prior and they'll mostly focus on cost-cutting and the latter.

Cyclos and AMD didn't go into too much detail about Piledriver, though they did say it will consist of a 4GHz+ x86-64 core built on a 32nm CMOS process.

http://hothardware.com/News/AMD-Wil...nd-4GHz-Using-Resonant-Clock-Mesh-Technology/

I'd imagine that with Turbo we're looking at clock speeds that were intended for Bulldozer. Just how much they managed to increase that IPC, though, will determine whether it's a big increase in performance (2 generations) or what Bulldozer should have been but a year late (a single generation).

StrangerGuy · Apr 8, 2012

We are gonna see this mostly put into 15.6" 720p TN bottom barrel laptops anyway. PC laptop makers just don't get it and thinks their cost cutting SKU flood strategy is going to work forever.

SickBeast · Apr 8, 2012

Lepton87 said:
Will they actually implement aCF? Trinity GPU will be VLIW4 and the only other VLIW4 products are 6970 and 6950. They would have to do VLIW4+VLIW5 CF.

AFAIK Crossfire will work with any two AMD GPUs. So long as they're similar in performance it should work fine.

sequoia464 · Apr 8, 2012

Dresdenboy said:
Throughput (incl. instructions) is not bound by latency. Many people incl. Agner Fog, Andreas Stiller and me think, the decoder's throughput is the main bottleneck.

How much, if at all, can this be improved, if it is the bottleneck?

Dresdenboy · Apr 8, 2012

sequoia464 said:
How much, if at all, can this be improved, if it is the bottleneck?

Possibly up to +100% (extreme case): Agner Fog found, that 256 bit AVX instructions can only be decoded at a rate of 1 per cycle while the hardware would actually allow to decode 2.

On average an improvement of 10-20% could still be possible (needs some tests). Further this has a self-propelling effect since decoding deficiencies would lead to one thread being decoded at a slower throughput, which affects the decode throughput of the second thread, since the decoder would be blocked by thread one.

Also in case of a branch misprediction, which needs decoding of a new stream of instructions, the overall latency would increase due to a lower effective throughput.

(sic)Klown12 · Apr 8, 2012

SickBeast said:
AFAIK Crossfire will work with any two AMD GPUs. So long as they're similar in performance it should work fine.

As long as the two GPUs are of the same chip design, like 6970/6950(Cayman XT and Pro) or 7970/7950(Tahiti XT and Pro) you can use Crossfire. This is to be remedied(will allow Crossfire despite being different architectures) when Trinity comes out. The current driver actually could allow this, but AMD hasn't allowed us to do it yet.
http://forum.beyond3d.com/showpost.php?p=1569541&postcount=594

Olikan · Apr 12, 2012

i just found another rumor... this one is a huge win for amd, if true

http://newsper.net/ru/article/region/3/theme/11?id=662971&date=2012-04-12

PileDriver Performance only 10% -15% (maybe)

Lifer

Diamond Member

Lifer

Senior member

Junior Member

Junior Member

Lifer

Junior Member

Junior Member

Junior Member

Junior Member

Junior Member

Member

Member

Member

Platinum Member

Golden Member

Elite Member

Diamond Member

Diamond Member

Lifer

Senior member

Golden Member

Senior member

Platinum Member