• We should now be fully online following an overnight outage. Apologies for any inconvenience, we do not expect there to be any further issues.

Kabini Rumors

Page 15 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Yes, OS support. You'd need new CPUID functions to identify the "fast" core and the OS would need to know how to schedule workloads for that. In embedded that would be rather easy, but on the desktop... pretty much no way.

You must have something in mind that's more complex than what I described, if you're bringing up scheduling.

If you look at AMD's CPUs right now it's impossible (outside of overclocking) to run all modules at the same clock speed you can run a single module at. That's a limitation of turbo. The CPU controls turbo. I'm saying that if the CPU can't set the clock higher than some limit once more than one module is active, due to TDP conservation, then all but one CPU can be synthesized for a lower clock target (if it wins anything).

The only thing the OS needs to know is to use the fast core when only one is on. If the OS doesn't know that and powers on one of the other ones in isolation then the CPU turbo mechanism would prevent it from clocking as fast. The CPU wouldn't need to communicate which core is the fast one, it could make it core #0 by convention and the OS could choose #0 when it's only powering one core (which is probably already what happens most if not all of the time).
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
@Idontcare

Fair enough. Do you know if the jaguar cores are hand drawn?
Could they too see 15-30% gains from simply being redrawn by a automated program?

Does Intel do the same thing? how long have they had it if so?
if not, why not?

Considering that the Bobcat core was basically 100% done by synthesis (it wasn't really 100%, but it was a large portion), I expect Jaguar was as well. But I can't confirm that.

Synthesis isn't an "all or none" proposition. You can have a varying admixture of both synthesis and hand-layout circuits.

Don't know how much synthesis goes on in Intel's chips, but I recall they used synthesis in portions of the Prescott P4.

....but one of the big flaws as stated by certain engineers in the press about bulldozer was synthetic design?
That particular engineer had questionable expertise in both the areas of synthesis design as well as bulldozer itself, having left AMD many years before bulldozer was brought to market.

I think what people got there was an outdated but educated opinion on the state of things five years prior, which made it seem kinda like a compelling story (it was believable) but at the same time there were parts of it that just failed standard sanity checks (like the fact the chip clocked crazy high, not a hallmark of traditional synthesis which would suggest a "game changer" in synthesis had transpired in the creation of bulldozer itself).

Automated processes - because it was cheaper - not more effective.

Using more of that - seems dubious if you want to maximize whatever process your on?

Remember your project management triangle.

250px-The_triad_constraints.jpg


Automated processes enable the scope and the schedule to do things that would not otherwise be feasible within the given cost envelope.

Had bulldozer (or bobcat) not heavily relied on synthesis for computer-determined circuit optimizations then the respective development schedules would have needed to be even lengthier or the scope (complexity and performance of the cores) would have had to have been dialed way back.

As enthusiasts it is easy to get the cart-before-the-horse with synthesis and see it as a way to turn out sub-par designs. But it really is the opposite, it is an enabler. Were it not for synthesis the designs would have been even more sub-par.

We only think of the existing synthesis designs as being sub-par because (1) self-proclaimed experts tell us to, and (2) we forget that not everyone gets to have Intel-like R&D budgets.

The trivial solution is the one in which we define the optimal product sans all fiscal restrictions during development. "Bulldozer would have been teh awesome if only it had been 100% hand-designed!"

Yawn, not interesting. AMD didn't have another billion dollars to throw at bulldozer's development. And if they didn't reach for synthesis to get the job done then the final product would have been even more derpdozerish.

In a practical world, in a budget-constrained and time-constrained world, synthesis enables the scope of your project to reach degrees of complexity that would not otherwise be attainable if you limited your design team to hand-design methods of the 90's.

Everybody must evolve in how they do their job, synthesis is the future.

edit: Just saw/read CTho9305's post, his +1
 

Pilum

Member
Aug 27, 2012
182
3
81
The only thing the OS needs to know is to use the fast core when only one is on. If the OS doesn't know that and powers on one of the other ones in isolation then the CPU turbo mechanism would prevent it from clocking as fast. The CPU wouldn't need to communicate which core is the fast one, it could make it core #0 by convention and the OS could choose #0 when it's only powering one core (which is probably already what happens most if not all of the time).
Okay, yes, you could always use core 0. But still, many if not most operating systems don't prefer any core for scheduling and will wildly switch threads around. So you'd need to establish this convention in the first place and wait until all important OS versions/distributions support this before bringing out CPUs with this feature. Otherwise the CPU performance would suffer for OSs without "prefer core 0". And such a change can take many years until it's widely supported.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
8 GiB is not technically possible at this time, as 4 GBit (512 MiB) GDDR5 chips are just entering the market. The PS4 has a 256-bit bus and can access 16 of these chips for a total of 8 GiB. Kaveri with its 128-bit bus is limited to 8 RAM chips, so the maximum is 4 GiB. Bigger chips will certainly appear, but I'd guess that will take a year.

And for a potential 4 GiB system, you have to substract the RAM dedicated for GPU use. If you really want to play modern games, you need to reserve 1 GiB. So you're left with 3 GiB for Windows and a game. That will be fun...

Completely forgot that the RAM was shared. Yeah, 4GB shared is simply not going to cut it. 512 MB-1GB Vram, 1GB for windows (probably more for oem systems with crapware) is going to leave you with ~2-2.5 GB free. Windows likes keeping 15% RAM free (caches a lot and you sure notice it). Its not going to be pretty.

They already tested with a 15.31 driver because Luxmark made a huge jump due to the new OpenCL 1.2 driver. I don't expect a speedup from a newer 15.31 driver. I expect a small speedup from a final platform (their ES platform had bandwidth issues which isn't nice for the integrated iGPU).

15.31 was released in april. the haswell preview was in march.
 

Pilum

Member
Aug 27, 2012
182
3
81
Synthesis isn't an "all or none" proposition. You can have a varying admixture of both synthesis and hand-layout circuits.

Don't know how much synthesis goes on in Intel's chips, but I recall they used synthesis in portions of the Prescott P4.
Check the ITJ, they have articles on the design of several of their CPUs. Volume 14 Issue 3 centers on NHM/WST, and the article "The Toolbox for High-performance CPU Design" goes into detail on the design methodology used for NHM. As far as I can understand that (which is very little), the CPU design gets broken up into functional blocks (FUBs) whose placement and interconnects are manually controlled (but use auto-layout). Within these blocks, automation is often used extensively.

One of the biggest problems seems to be the creation of the right tools for aiding in the design process, as you have to reach ever higher levels of abstraction without loosing the ability to control the design at the lower levels. So this turns big parts of CPU design into software engineering, requiring more personnel. And of course you need huge server farms to run all that software in a timely manner.

So this isn't about "automation" vs. "hand-designed" but rather about "proper and efficient use of automation in the right places of the design process".
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Check the ITJ, they have articles on the design of several of their CPUs. Volume 14 Issue 3 centers on NHM/WST, and the article "The Toolbox for High-performance CPU Design" goes into detail on the design methodology used for NHM. As far as I can understand that (which is very little), the CPU design gets broken up into functional blocks (FUBs) whose placement and interconnects are manually controlled (but use auto-layout). Within these blocks, automation is often used extensively.

One of the biggest problems seems to be the creation of the right tools for aiding in the design process, as you have to reach ever higher levels of abstraction without loosing the ability to control the design at the lower levels. So this turns big parts of CPU design into software engineering, requiring more personnel. And of course you need huge server farms to run all that software in a timely manner.

So this isn't about "automation" vs. "hand-designed" but rather about "proper and efficient use of automation in the right places of the design process".

I view it as an analog to what happened in the software world when it transitioned from coding in machine language with assembly to programming in user-friendly codes and having the complexity one-step removed by virtue of the compilers (which then become the "innovation" vortex as resources must be poured into compiler development to make the magic happen).
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Okay, yes, you could always use core 0. But still, many if not most operating systems don't prefer any core for scheduling and will wildly switch threads around. So you'd need to establish this convention in the first place and wait until all important OS versions/distributions support this before bringing out CPUs with this feature. Otherwise the CPU performance would suffer for OSs without "prefer core 0". And such a change can take many years until it's widely supported.

The OSes already need to turn off 3/4 of the modules to get the turbo speed. You can't wildly switch which core a thread is on if all but one of the cores have to be off. Cycling which core is powered makes no sense, that's just an arbitrary waste of performance and power.

If OSes aren't already favoring core #0 I don't really think this is going to be a very big change, considerably smaller than the changes AMD already got MS to implement pretty quickly..
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
On clock speed, I'm still remembering those slides that compared Clover Trail with Bay Trail and listed frequency as 1.5GHz vs 2.1GHz, who knows what really applies for the comparison they made...

I'm not sure if I said this, but the Base frequency of Clover Trail Z2760 is 1.5GHz.

Also the same slide that says "50-60%" gain is in the same list that has "2 to 4" cores gain, meaning loss of Hyperthreading relegates double core gain to that amount rather than 2x.

They are obviously being ambiguous on the recent slide, with the launch still many months away.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Yeah, base frequency 1.5GHz - although this has been a little confusing since ark still doesn't list it like this (http://ark.intel.com/products/70105/Intel-Atom-Processor-Z2760-1MB-Cache-1_80-GHz)

You said that the comparison could have been vs turbo @ 1.8GHz since it can sustain that. But in comparison tables they've used 1.5GHz. It could be either. Or something in between. Who knows.

I agree with that interpretation of what the 50-60% means, I said something similar a few pages ago.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
Not a surprise, when your product will be launching a few months after a competitor it's a bit foolish to pre-announce in too much detail. The vague "it will be amazing" style seems to be the standard operating procedure in computing. Jaguar was in that stage until the AMD embedded announcement, lots of imprecise talk about great efficiency and speedup from Brazos. Now we know the embedded direct replacement of the dual core 1GHz keeps the clockspeed and 9W TDP while integrating the chipset.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Jaguar was in that stage until the AMD embedded announcement, lots of imprecise talk about great efficiency and speedup from Brazos. Now we know the embedded direct replacement of the dual core 1GHz keeps the clockspeed and 9W TDP while integrating the chipset.

Let's not judge based on these embedded parts. At least for Intel, the embedded SKUs are specced less than the standard consumer parts. That may be true for AMD, I don't know.

Also since we are talking about small details, its worth mentioning that TDP in a chipset is less often reached than those on a CPU or GPU. The reason is because a chipset's TDP is usually characterized when most of the ports(if not all) are used, while for a CPU its relatively easy to max it out. That means the real power reduction in practical load due to integration of the I/O chip is certain % of the chip's TDP, rather than being 100% of it.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
(Bobcat has ~70% higher IPC in integer than the first-gen Atoms). Even with the higher speeds, +50% IPC would mean rough performance parity with Jaguar@2.0. I don't think Intel is aiming at being only as good as the competition... but we'll see. they'd be the best foundry in the world).

In no world Bobcat has 70% average higher IPC over Atom. It happens in certain scenarios where the Atom is especially weak, like with SSE operation intensive applications like in Cinebench. But redesigning FP to gain big is lot easier than doing the same to average IPC(its almost the same logic with why CPU gains lot less generational percentage-wise than GPUs do). That's why we can't rule out that gap significantly closing with Silvermont cores.

Otherwise, there'd be a bigger gap between similarly clocked Atom and Bobcat. I'd presume that 40% is enough to close the gap with Bobcat.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
Okay, yes, you could always use core 0. But still, many if not most operating systems don't prefer any core for scheduling and will wildly switch threads around. So you'd need to establish this convention in the first place and wait until all important OS versions/distributions support this before bringing out CPUs with this feature. Otherwise the CPU performance would suffer for OSs without "prefer core 0". And such a change can take many years until it's widely supported.

That's why they call it "SMP - Symmetric Multi-Processing". The idea that they can use any core or any CPU to run a thread, that the OS isn't dependent on certain threads only being able to be run on certain cores.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
In no world Bobcat has 70% average higher IPC over Atom. It happens in certain scenarios where the Atom is especially weak, like with SSE operation intensive applications like in Cinebench.

But Bobcat's SIMD capabilities are weaker than Atom's..

Bobcat has 64-bit FADD and FMUL, and 2x 64-bit simple integer ALU, probably 1x64-bit multiply.

All Atoms so far have had 128-bit FADD, 64-bit FMUL, 2x 128-bit simple integer ALU, and again probably 1x64-bit multiply - don't remember 100% and have to check that.

I tried looking for programs to quantify this some posts back but found it was really hard to find a lot of things that reliably isolated the difference in cores vs threads (really nothing that can utilize > 2 cores). But from what I found there were a number of things that did get much higher than 70%, I'd call 70% a fair average.

It is however possible that the newer Atoms have improved things, these comparisons were with Pine Trail or even older.
 

CHADBOGA

Platinum Member
Mar 31, 2009
2,135
833
136
Roy Taylor was right, this could be another Apple style turnaround.
I think a VIA style outcome is far more likely.

AMD have some outstanding products on and coming to the market that largely mitigate any process node advantage of intel (although that node advantage could be narrowing significantly).
How would Intel's node advantage be narrowing significantly, when it is increasing? :confused:
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Exophase, the Atom chip shares some of the Integer SSE operations with the FP ports.

Also, the Linpack results show 0.4GFlops for the single core 1.6GHz Atom while a dual core E-350(1.6GHz) gets 2.4GFlops. Doubling cores on Atom should get 0.8GFlops, which is still only 1/3rd the result. Linpack is a good benchmark for isolating pure FPU power.

70% is again for best case scenarios. Average gain of E-350 versus similar clocked Atom is only 10-20% better. Hyperthreading is best case, 35-40% faster(coincidentally in the application where the gap is greatest with Bobcat), while we see places where there's almost no gain at all(http://www.tomshardware.com/reviews/Intel-Atom-Efficient,1981-13.html).
 

Arzachel

Senior member
Apr 7, 2011
903
76
91
How would Intel's node advantage be narrowing significantly, when it is increasing? :confused:

Because each node provides increasingly diminishing returns with increasingly higher R&D and fab costs.

On GDDR5, I doubt the power usage would be higher than for DDR3, as far as I know they're using conservatively low clocked low voltage GDDR5m chips to hit bandwidth similar to what Intel is aiming with GT3e. Capacity isn't much of an issue either, they could do 8GB with 512MB chips running in clamshell mode (16 with 1GB chips). The cost is the only unknown, but it's a pill several magnetudes easier to swallow than stacking embedded DRAM.
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,079
3,915
136
This is going off memory but when bobcat was realised the key differentiator was that windows 7 ran well enough on bobcat but sub par on atom. The way i remember it, which anandtech benchmarks all but confirms http://www.anandtech.com/show/4023/the-brazos-performance-preview-amd-e350-benchmarked/3 was that in multithread benchmarks throughput was approx the same per core. Single thread bobcat was dominant which is what resulted in the bobcat core being far more passable for win7 OS.

Jaguar is another beast entirely, there is a reason they are scaling it to 25watts, im only a "layman". But from what i have read all around the net since the publications of the jaguar core details, from peoples opinions i have come to respect, is that jaguar is anything but the anaemic CPU core that anandtech strong crowd of pro intelers what to believe it is.

the phrase i would use is pocket battleship . It is lowish clocked but its instruction latencies are low to match.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
On top of that, it takes a lot of work to get the best results from the tools; if you take an optimized hand design and compare it to a one-off automation experiment, you're likely going to conclude you should stick with hand design. If you do a quick and dirty hand design and compare it to an optimized automated design, you're going to conclude you should stick with automation. A fair comparison requires a large effort.

What's your opinion on the matter? Are synthesis tools up to hand-designed parts or is there still a way to go?
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
These embedded models do support ECC, it's just you'll have to do some real digging to find a board you can buy in single quantity for a reasonable price (due to being targeted at embedded applications). Going to be keeping an eye out, will update if I have any luck.
We aren't that market. We may never see them. If you ever do see the don't get sticker shock, when the price could be $300-1000/each.
 

Pilum

Member
Aug 27, 2012
182
3
81
Capacity isn't much of an issue either, they could do 8GB with 512MB chips running in clamshell mode (16 with 1GB chips).
The 4 GiB limit for a 128-bit controller is for clamshell mode:
128 bit ÷ 16 bit = 8.
8 × 512 MiB = 4 GiB.

And the 4Gbit chips are brand-new and just entering the market. It's unlikely we'll see 8Gbit for another year. Okay, that may end up being the point at which Kaveri enters the market, so it all fits together.
 

Asterox

Golden Member
May 15, 2012
1,058
1,864
136
Not a surprise, when your product will be launching a few months after a competitor it's a bit foolish to pre-announce in too much detail. The vague "it will be amazing" style seems to be the standard operating procedure in computing. Jaguar was in that stage until the AMD embedded announcement, lots of imprecise talk about great efficiency and speedup from Brazos. Now we know the embedded direct replacement of the dual core 1GHz keeps the clockspeed and 9W TDP while integrating the chipset.

And this comparison what happened here, someone is deliberately badly screwed up right?:biggrin:

Dual Core Bobcat APU E-450, 1.65ghz/ 18W TDP-a

Quad Core Jaguar APU GX-415GA,1.5ghz/15W TDP-a



Regarding this Jaguar APU 9W TDP, this model is still 9W but there is a much better graphics core and all other improvements brought by Jaguar APU or is that completely irrelevant?:biggrin:



 

Pilum

Member
Aug 27, 2012
182
3
81
Jaguar is another beast entirely, there is a reason they are scaling it to 25watts, im only a "layman". But from what i have read all around the net since the publications of the jaguar core details, from peoples opinions i have come to respect, is that jaguar is anything but the anaemic CPU core that anandtech strong crowd of pro intelers what to believe it is.

the phrase i would use is pocket battleship . It is lowish clocked but its instruction latencies are low to match.
Let's be realistic here. Jaguar is a nice improvement over Bobcat, but it is just that: an improved Bobcat. That is, an affordable low-power x86 CPU with adequate performance. It will be vastly better for FP and multithreaded workloads, but single-threaded integer it will be a 35-40% improvement. It certainly is no Wunderwaffe which will magically make AMD competitive in the enthusiast market again. Its competition is Silvermont Atom and ARM, not Ivy Bridge or Haswell.