Rumor Section: About the new GPU's

Idontcare · Jun 2, 2009

Originally posted by: Scali

Originally posted by: Idontcare
In summary...the error was my connecting so few dots in a needlessly cynical fashion without accounting for the myriad of other articles that have graced AT's website which in toto overwhelmingly support a conclusion that reviews can hardly be construed as showing favoritism or playing to just the strengths of the hardware being reviewed.

Click to expand...

True... I can't quite shake the feeling that the 3870X2 review was a bit of an outlier though.
Then again, as you said earlier, there's not necessarily something wrong with that.
It often happens that a vendor only lets you review their products if you agree to a few basic rules. nVidia did it recently, I think with the GTX275 introduction. I believe the rule was something like: you may benchmark 6 games, and we specify 5 out of the 6.
This led to all sites having a virtually identical benchmark suite, and obviously nVidia had cherry-picked games that ran well on their architecture.
When vendors give you such "offers you can't refuse", what is a website going to do? Either you take the offer, or you don't get the hardware, so you don't get a review, and you don't get hits. And as you say, it's about the hits in the end, they bring in the advertising money.

I suppose that because of deals like these in the past, we get a bit jaded.

Well said. :thumbsup:

BFG10K · Jun 4, 2009

Originally posted by: evolucion8

But I'm not talking about the P4, I'm talking about the C2D/Athlon 64 generation, is the Phenom 2 much faster in a per clock basis against the Athlon 64? Is the Core i7 much faster than the C2Q in a per clock basis? NOO. They reached a point that is much more expensive and very hard to increase the parallelism inside of a CPU (IPC), so the best way is going multi core and that's what Intel is currently doing with it's Core i7 architecture.

In the case of the i7, the IPC improvements were there (e.g. better TLB, larger OoO window, integrated memory controller, etc), but it was neutered with the 256 KB L2 cache per core compared to 3 MB per core on the Penryn.

Of course when 32 nm comes along, Intel will no doubt increase the L2 cache and it?ll not only be faster clock-for-clock, but it?ll also run at higher clocks. As an example, look at the i5?s much higher turbo boost with the same manufacturing process as the i7.

I haven?t been following the Phenom closely but IIRC there were some architectural issues that held it back, along with very low initial clock speeds.

In any case, my original point still stands: even if the IPC stays the same, a die-shrink almost certainly guarantees higher clock speeds.

Why Intel didn't make the Core i7 a Quad Core/4 thread CPU?

Because HT is very cheap in terms of die cost and design complexity. Most of the hardware is already there, you just need a few extra pointers to keep track of the concurrency.

But that's far too fetched. The same could be said with the CPU's.

Yes exactly, that?s my point. A single CPU/GPU is the building block of multi-GPU/CPU; if the former hits a wall then so does the latter.

But the GPU graphic work is highly parallel, while there is no silver bullet yet, the posibility is there. CPU's aren't and yet the benefits are there, but definitively the developers are far behind from exploiting the new technology.

Again, I?m not sure if you understand how multi-GPU works, especially AFR. Yes, graphics rendering is inherently more parallel than general purpose code, and that?s exactly why adding extra execution resources to a single GPU nets performance with very little effort.

However multi-GPU breaks that parallelism since AFR is serial in nature. That means you run into all sorts of interdependencies not present on a single GPU, and these need to be managed by hand on a per-application basis. Even SFR isn?t optimal since there?s still duplicate data storage and processing happening.

This is much like how you can?t simply take any general purpose code, throw it into some kind of ?threading machine?, and expect meaningful performance gains. Most of it is done by hand on a per-application basis, and it?s extremely complex compared to traditional code.

Idontcare · Jun 4, 2009

Originally posted by: BFG10K
In the case of the i7, the IPC improvements were there (e.g. better TLB, larger OoO window, integrated memory controller, etc), but it was neutered with the 256 KB L2 cache per core compared to 3 MB per core on the Penryn.

Of course when 32 nm comes along, Intel will no doubt increase the L2 cache and it?ll not only be faster clock-for-clock, but it?ll also run at higher clocks. As an example, look at the i5?s much higher turbo boost with the same manufacturing process as the i7.

I haven?t been following the Phenom closely but IIRC there were some architectural issues that held it back, along with very low initial clock speeds.

In any case, my original point still stands: even if the IPC stays the same, a die-shrink almost certainly guarantees higher clock speeds.

Just a quick chime-in to say can you believe it that westmere keeps the cache sizes (L1, L2, and L3) per core the same as 45nm nehalem!? I was astounded to learn this when someone took the time to convince me of it.

The "rumor" is the L1 latency is improved, remember it took a step back in latency in the transition from penryn to nehalem, so an improvement there will help IPC directly in westmere.

And the clockspeeds...yeah!

BFG10K · Jun 4, 2009

Originally posted by: Idontcare

Just a quick chime-in to say can you believe it that westmere keeps the cache sizes (L1, L2, and L3) per core the same as 45nm nehalem!? I was astounded to learn this when someone took the time to convince me of it.

That's certainly surprising to hear.

Scali · Jun 4, 2009

Originally posted by: Idontcare
Just a quick chime-in to say can you believe it that westmere keeps the cache sizes (L1, L2, and L3) per core the same as 45nm nehalem!? I was astounded to learn this when someone took the time to convince me of it.

Well, perhaps the underlying idea is too keep the chip compact enough to easily be able to bump it up to 8 cores/16 threads (perhaps via multi-chip module technology again?).
You can spend your transistors only once, so perhaps they figured it's better to go for more cores than to just add loads of cache to it.

They may be on to something, as Nehalem already has less cache than Penryn, but it is very rarely that Penryn has any advantage from the larger cache.
Cache and diminishing returns are closely related... perhaps Nehalam just hit the sweet-spot anyway, and you need to add way too much cache to see any significant improvement from there.

Creig · Jun 4, 2009

Originally posted by: Idontcare
(thanks Creig for keeping it real while I fumbled around with some poorly conceived interpretations of past events)

Well, as I said, I've really enjoyed reading your posts and the information and insights you bring to these forums. I hope you will continue to post here regardless whether we tend to agree or disagree about various subjects.

Originally posted by: Idontcare
At any rate I had some helpful and polite pm's that inquired about my line of thinking in a Socratic method manner of teaching, resulting (as intended I suspect) in me having an epiphany regarding the root of the error in my logic.

Actually, I was more attempting to see if I was the one overlooking something. You have a very matter-of-fact way of looking at things that I've come to appreciate. Keep it up. :thumbsup:

Forumpanda · Jun 4, 2009

I have no idea what I am talking about so hopefully someone can correct me.

But wouldn't larger cache for the individual cores mean that more data has to be transfered around inside the CPU each time a thread moves from running on one core to another? Or is the speed of this process not tied to the size of the cache?
That at least to me would seem the main reason to keep the cache size for the individual cores down and rather bump up the collective cache size.

However I guess smarter schedulers would also remove most of this problem (I hate you windows).

Scali · Jun 4, 2009

Originally posted by: Forumpanda
I have no idea what I am talking about so hopefully someone can correct me.

But wouldn't larger cache for the individual cores mean that more data has to be transfered around inside the CPU each time a thread moves from running on one core to another? Or is the speed of this process not tied to the size of the cache?
That at least to me would seem the main reason to keep the cache size for the individual cores down and rather bump up the collective cache size.

Well, you have to see it in the right proportions...
Threads don't move from one core to the other all the time. At most it can happen on the timeslice, which is around 20 ms on a modern system. In those 20 ms you can execute millions of instructions which could all benefit from the cache.

And no, the size of the cache doesn't necessarily dictate the time it takes to synchronize caches. It's very hard to predict... If you only read from memory, you don't need to write back the cache. If both threads generally use the same memory, and they only read, then they will both have valid copies of the data.
Also, it's not just the juggling of threads that can cause cache problems. Even if you run two threads that never switch cores... if they both read/write to the same memory, then they will need to synchronize so the other thread doesn't work on outdated copies of the data.
The shared cache between multiple cores makes it much more efficient to synchronize cache contents.

Originally posted by: Forumpanda
However I guess smarter schedulers would also remove most of this problem (I hate you windows).

No need to hate Windows, it actually has a very good multi-core/CPU scheduler.

Idontcare · Jun 4, 2009

Originally posted by: Scali
Also, it's not just the juggling of threads that can cause cache problems. Even if you run two threads that never switch cores... if they both read/write to the same memory, then they will need to synchronize so the other thread doesn't work on outdated copies of the data.
The shared cache between multiple cores makes it much more efficient to synchronize cache contents.

It's in this line of thinking that blows me away that AMD created from scratch the Athlon II X2 which has zero shared cache for the dual-cores. Why!?

Bumping up the L2$ to 1MB is great for single-threaded apps, but having zero shared cache is dreadful for any truly multi-threaded apps...as an armchair architecture critic I would have much preferred to have seen the standard 512KB L2$ combined with a 1MB shared L3$...same total cache but different hierarchy.

Rumor Section: About the new GPU's

Idontcare

Elite Member

BFG10K

Lifer

Idontcare

Elite Member

BFG10K

Lifer

Scali

Banned

Creig

Diamond Member

Forumpanda

Member

Scali

Banned

Idontcare

Elite Member

TRENDING THREADS