Richland & Kabini rumours

Abwx · Feb 8, 2013

ShintaiDK said:
Any instruction addons AMD have developed (Besides x64) was stillborn.

If you cant make it universal, then you shouldnt waste time and money on it. Another set of AMDs blunders.

Still trashing AMD without even checking about the said instructions?.

It wasnt adopted because Intel want to maintain its grip on the
instructions sets , nothing else , but they do not mind using some
of thoses instruction after rebranding them with in house names..

One advantage of 3DNow! is that it is possible to add or multiply the two numbers that are stored in the same register. With SSE, each number can only be combined with a number in the same position in another register. This capability, known as horizontal in Intel terminology, was the major addition to the SSE3 instruction set.

http://en.wikipedia.org/wiki/3DNow!

ShintaiDK · Feb 8, 2013

3DNow! never became universal. So its completely irrelevant. Just like SSE4A and SSE5.

NTMBK · Feb 8, 2013

Abwx said:
Still trashing AMD without even checking about the said instructions?.

It wasnt adopted because Intel want to maintain its grip on the
instructions sets , nothing else , but they do not mind using some
of thoses instruction after rebranding them with in house names..

http://en.wikipedia.org/wiki/3DNow!

That wasn't the same instruction. 3DNow! had horizontal add for 2 floats, whereas SSE3 had hadd using 4 floats. Totally different instruction- it's like the difference between SSE2 and AVX2. http://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.100).aspx

ShintaiDK- I wasn't saying that AMD's extensions are going to become widely used at all, simply pointing out that they are compatible with the VEX encoding.

EDIT: Although given that AMD have won the next gen console contracts, I at least expect the instructions used in Jaguar to become more widely used in console games and compilers, if not Windows applications. http://semiaccurate.com/assets/uploads/2012/08/slide-1-728.jpg Not that the most useful one (FMA4) is implemented.

ShintaiDK · Feb 8, 2013

Not going so well for Abwx

NTMBK: But since Intel compilers already generate the fastest code for AMD CPUs. And jaguar, assuming it will be used for consoles, also support AVX. And if you wish to port games as well, to a PC segment that outgrows consoles. Why use anything but universal instructions and the fastest code generating compiler there is?

NTMBK · Feb 8, 2013

ShintaiDK said:
Not going so well for Abwx

NTMBK: But since Intel compilers already generate the fastest code for AMD CPUs. And jaguar, assuming it will be used for consoles, also support AVX. And if you wish to port games as well, to a PC segment that outgrows consoles. Why use anything but universal instructions and the fastest code generating compiler there is?

I'm honestly not sure what the compiler situation would be- Intel would probably give the best results at first, I agree. But I can't see MS being happy to not use all the instructions of the processor that they can when they have a world-class x86-64 compiler and team- it's certainly possible that they would bring out an updated compiler tuned very specifically to the Jaguar cores. Look at the situation with the PS3 and 360- dev tools improved significantly over the course of its life, as the compiler teams got to grips with tuning code for the in-order PowerPC cores. It could honestly go either way, though.

Abwx · Feb 8, 2013

NTMBK said:
That wasn't the same instruction. 3DNow! had horizontal add for 2 floats, whereas SSE3 had hadd using 4 floats. Totally different instruction- it's like the difference between SSE2 and AVX2. http://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.100).aspx

My bad for the wording , what i wanted to point it that Intel
used a feature already present in 3Dnow , not that SSE3
has any relevance with 3Dnow , i guess that Shintaidk jumped
eagerly on the interpretation that suits his agenda , though..

SocketF · Feb 8, 2013

ShintaiDK said:
NTMBK: But since Intel compilers already generate the fastest code for AMD CPUs. And jaguar, assuming it will be used for consoles, also support AVX. And if you wish to port games as well, to a PC segment that outgrows consoles. Why use anything but universal instructions and the fastest code generating compiler there is?

Concerning the console wins I wonder if these will omit a fast move to AVX256. As the Bulldozer cores, Jaguar is splitting AVX256 instructions into two 128bit pieces. Thus it is better to stick with AVX128 from the beginning.

Because of the shorter VEX-prefix AVX is still a bit better than the SSEx-equivalent instructions, furthermore you can use 3 operands instead of only 2.

Thus it might help PCs with FX-processors a bit. Depending how much effort the publishers will put into the PC-ports.

Olikan · Feb 8, 2013

NUSNA_Moebius said:
Oh? I thought it was an improved Piledriver? LOL......

ooops!

2 years dely

ShintaiDK · Feb 8, 2013

SocketF said:
Concerning the console wins I wonder if these will omit a fast move to AVX256. As the Bulldozer cores, Jaguar is splitting AVX256 instructions into two 128bit pieces. Thus it is better to stick with AVX128 from the beginning.

Because of the shorter VEX-prefix AVX is still a bit better than the SSEx-equivalent instructions, furthermore you can use 3 operands instead of only 2.

Thus it might help PCs with FX-processors a bit. Depending how much effort the publishers will put into the PC-ports.

There is a benefit, even when using 2 cycles to execute it.

We also first got singlecycle SSE with Core 2. SSE was heavily used before.

SocketF · Feb 8, 2013

ShintaiDK said:
There is a benefit, even when using 2 cycles to execute it.

Which one? AMD's changed their part of the GCC compiler to generate AVX128 only, guess why .. yes it is a few % faster ...

We also first got singlecycle SSE with Core 2. SSE was heavily used before.

That's not the point, the point is that the consoles probably have Jaguar with AVX128 only, that wont never change now. Intel already has AVX256 single cycle execution today, but that is of no concern. Jaguar is inside, not intel.

Knowing that the console's single-thread performance will be quiet low, due to low clocks, I assume that the console programmers will try to squeeze everything out of the Jaguarcores, i.e. use AVX128 only, not 256.

ShintaiDK · Feb 8, 2013

SocketF said:
Which one? AMD's changed their part of the GCC compiler to generate AVX128 only, guess why .. yes it is a few % faster ...
That's not the point, the point is that the consoles probably have Jaguar with AVX128 only, that wont never change now. Intel already has AVX256 single cycle execution today, but that is of no concern. Jaguar is inside, not intel.

Knowing that the console's single-thread IPC will be quiet low, due to low clocks, I assume that the console programmers will try to squeeze everything out of the Jaguarcores, i.e. use AVX128 only, not 256.

Haswell is not released yet. So no single cycle 256bit AVX yet.

Phynaz · Feb 8, 2013

SocketF said:
Knowing that the console's single-thread IPC will be quiet low, due to low clocks

Me thinks you are confused.

SocketF · Feb 8, 2013

ShintaiDK said:
Haswell is not released yet. So no single cycle 256bit AVX yet.

Seems you mix up AVX with FMA. AVX is supported since Sandy-Bridge and yes - 256bit in one cycle.

But this is not the topic here anyways.

Phynaz said:
Me thinks you are confused.

Sorry, I meant performance, you are right that doesnt make sense.

ShintaiDK · Feb 8, 2013

SocketF said:
Seems you mix up AVX with FMA. AVX is supported since Sandy-Bridge and yes - 256bit in one cycle.

But this is not the topic here anyways.

Sorry, I meant performance, you are right that doesnt make sense.

256bit AVX instructions takes 2 cycles on SB/IB. The datapaths on those CPUs are also only 128bit.

If it was singlecycle, the difference would have been huge:
%gain/loss avx256 vs avx128
(negative % indicates loss
positive % indicates gain)

AMD BD Intel SB
410.bwaves -2.34 -1.52
416.gamess -1.11 -0.30
433.milc 0.47 -1.75
434.zeusmp -3.61 0.68
435.gromacs -0.54 -0.38
436.cactusADM -23.56 21.49
437.leslie3d -0.44 1.56
444.namd 0.00 0.00
447.dealII -0.36 -0.23
450.soplex -0.43 -0.29
453.povray 0.50 3.63
454.calculix -8.29 1.38
459.GemsFDTD 2.37 -1.54
465.tonto 0.00 0.00
470.lbm 0.00 0.21
481.wrf -4.80 0.00
482.sphinx3 -10.20 -3.65
SpecINT -3.29 1.01

400.perlbench 0.93 1.47
401.bzip2 0.60 0.00
403.gcc 0.00 0.00
429.mcf 0.00 -0.36
445.gobmk -1.03 0.37
456.hmmer -0.64 0.38
458.sjeng 1.74 0.00
462.libquantum 0.31 0.00
464.h264ref 0.00 0.00
471.omnetpp -1.27 0.00
473.astar 0.00 0.46
483.xalancbmk 0.51 0.00
SpecFP 0.09 0.19

Haserath · Feb 8, 2013

Sandy Bridge can sustain a full 16 single precision FLOP/cycle or 8 double precision FLOP/cycle double the capabilities of Nehalem. This guarantees that software which uses AVX will actually see a substantial performance advantage on Sandy Bridge and should spur faster adoption.

http://www.realworldtech.com/sandy-bridge/6/
?

Abwx · Feb 8, 2013

That s 2 double precision Flops/cycle/core , 2 x 64bit ops.

SocketF · Feb 8, 2013

ShintaiDK said:
256bit AVX instructions takes 2 cycles on SB/IB. The datapaths on those CPUs are also only 128bit.

I am talking about this:

Jaguar is splitting AVX256 instructions into two 128bit pieces

I thought that it is clear from that that the topic is decoding, not execution.

Speaking in general about "2 cycle execution" of any AVX256 instructions does not make sense either, e.g. multiplications take much more cycles then additions.

Some comments to your numbers:

a) Using AVX for SpecINT is "suboptimal" because AVX256 is usable only for FP. AVX256 for INT is called AVX2, in that case you really have to wait for Haswell. So I assume there are some reasons for the bad results, but they have nothing to do with 256bit vs. 128bit, because there is only 128b anyways. Source: http://www.drdobbs.com/tools/intel-avx2-will-bring-integer-instructio/231000372

b) For SpecFP: Code never ever consists of pure AVX256 parts. There is lots of other code. Check out the explanation of the y-crunsher program as an example:

Q: Why does AVX (v0.5.5) only give about 10% speedup over SSE4.1 (v0.5.4)? Shouldn't it be double the speed?
A: Unlike the majority of compute-intensive applications, y-cruncher does not exclusively use floating-point. As of v0.5.4, only about 30% of a Pi computation is floating-point bound. The remainder of the time is spent on integer operations and stalling on memory access. So cutting that 30% in half yields little overall speedup. Speeding up the code in this manner exposes more memory bottlenecks - which ends up reducing the speedup to only 10%...

Integer operations can be largely be emulated using floating-point (albeit with overhead). But most of the integer work involves carry-propagation, so it is not very vectorizable. For now, integer operations are still faster using the normal integer instructions.

http://www.numberworld.org/y-cruncher/

Conclusion: Only because the scores are not doubled does not mean that the execution units are not doubled as well.

SocketF · Feb 8, 2013

Abwx said:
That s 2 double precision Flops/cycle/core , 2 x 64bit ops.

No the numbers were already from one core only. The article is about the sandy bridge architecture not about some sandybridge quad core.
Edit: That is the important part:

As Figure 5 above indicates, Sandy Bridge can execute a 256-bit FP multiply, a 256-bit FP add and a 256-bit shuffle every cycle.

Furthermore, there are also SandyBridge Xeons with 8 cores .. think about it what it would mean if you apply your math in that case,too ;-)

inf64 · Feb 8, 2013

Are we sure SB/IB can sustain 2x256bit ops per core per cycle? Isn't it limited by the effective L/S BW?

Anyway,Jaguar like SocketF said is inside nextgen consoles(or SR

) and it supports basic AVX1.1. Devs will probably use AVX128 but that's not a big issue since I doubt games would benefit largely from fp 256bit ops anyway. Game code is usually integer heavy and branch heavy so 256bit fp ops are probably useless there(apart from maybe some physics on CPU,but they have GCN core on die anyway which can do better job).

Nemesis 1 · Feb 8, 2013

NTMBK said:
Actually, AMD are already using VEX encoding - they changed their proposed "SSE5" instructions to match the new encoding, for instance in their FMA4 instructions.

What you guys just want to ignor the facts . AMD has their own prefix they do not have nor will they ever have the vec prefix its intel exclusive that works with intel software hardware only. for auto recompile. or runtime i believe. no more red herrings I spent to much time debating this with the Amd fellow who spread his good cheer on this very forum and got me banned for a few days . Truth always risies to the surface. Of course the general public didn't know about AVX2. I see it as intel old mitois same elements instruction set coupled with harware software for 2x+ performance increases less in cases. But what the hell do i know can't spell and my grammer sucks. no one takes someone that can't spell with poor grammer seriously. Hiding in the open

inf64 · Feb 8, 2013

AMD Supports the same VEX encoding as intel does... How do you think it can run the AVX code,by magic pixie dust? What AMD does different is the encoding for their own proprietary XOP ISA,a unique media instructions they have built.

Nemesis 1 · Feb 8, 2013

inf64 said:
AMD Supports the same VEX encoding as intel does... How do you think it can run the AVX code,by magic pixie dust? What AMD does different is the encoding for their own proprietary XOP ISA,a unique media instructions they have built.

JUST STOP! I am not talking about the instruction set. THE prefix of vec or vex is an intel exclusive. It works with intel hardware/Software together. AMD does not have this software or hardware . They may have a prefix but its not Vec. .Its for auto recompile i believe at run time not sure.So now your going to say AMD has intel compilers. I know they can run intel compilers but it won't work the same and its legal. This is not the result AMD NV wanted when they complained to FTC about intel compilers . The change intel had to make . to make FTC happy Intel had label the compilars as not performing as well on none Intel products . A big win for intel

Nemesis 1 · Feb 8, 2013

inf64 said:
AMD Supports the same VEX encoding as intel does... How do you think it can run the AVX code,by magic pixie dust? What AMD does different is the encoding for their own proprietary XOP ISA,a unique media instructions they have built.

MODS I WANT NO MORE RED HERRINGS BY THIS MAN . The info is freely available and he is not telling the truth as we have all witnessed since 2006

SocketF · Feb 8, 2013

As long as you dont use Intel's compiler - there are lots of others- you can use AVX including VEX-prefix-instructions, also on AMD chips. They are 100% compatible.

Soon intel should also provide a compiler option for "slow-AMD-AVX" code. Even if it is slower, it will of course use the VEX-prefix. Prefixes are hardware, you are talking about software.

Funny side note: The y-crunsher programmer mentioned above, stated, that Microsoft's compiler generates better/faster AVX256 code for intel CPUs than intel's compiler ;-)

inf64 · Feb 8, 2013

Oh man, my fail to even trying to respond to that poster. Won't happen again,let him fall into the hole he dug out himself

.

For those who are interested in VEX prefix/coding scheme,wikipedia is the easiest source.

History

In August 2007, AMD proposed the SSE5 instruction set extension which includes a new coding scheme for instructions with three operands, using an extra byte named DREX intended for the Bulldozer processor core, due to begin production in 2011.[2][3]

In March 2008, Intel proposed the AVX instruction set, using the new VEX coding scheme.[4]

In August 2008, commentators deplored the expected incompatibility between AMD and Intel instruction sets, and proposed that AMD revise their plans and replace the DREX scheme with the more flexible and extensible VEX scheme.[5]

In May 2009, AMD announced a revision of the proposed SSE5 instruction set to make it compatible with the AVX instruction set and the VEX coding scheme. The revised SSE5 is called XOP.[6]

January 2011. The AVX instruction set is supported in Intel's Sandy Bridge microprocessor architecture.

2011. The AVX, XOP and FMA4 instruction sets, all using the VEX scheme, are supported in the AMD Bulldozer processor.[7]

Unknown date. The FMA3 instruction set, but possibly not FMA4, will be supported in Intel processors.

PS The guy cannot discern what is a compiler ( a piece of software) and what is an instruction coding standard(what VEX is)...

Richland & Kabini rumours

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Senior member

Platinum Member

Lifer

Senior member

Lifer

Lifer

Senior member

Lifer

Senior member

Lifer

Senior member

Senior member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Senior member

Diamond Member