podspi
Golden Member
He was talking about the possible outcome when manufactured at 22nm.
22nm is scheduled for end of 2012, right? I wonder if that will be a die-shrink of Gen2 BD, or Gen3... Is it too early to start the speculation? 😀
He was talking about the possible outcome when manufactured at 22nm.
22nm is scheduled for end of 2012, right? I wonder if that will be a die-shrink of Gen2 BD, or Gen3... Is it too early to start the speculation? 😀
If it can do 4.1 on one core, then it can do 4.1 on 4 cores(one per module) at the very least, assuming that you have this level of granularity in terms of overclocking options.
The problem is going to be cooling and power at this point, and that will come down to your motherboard and heatsink.
Dont expect GloFo 22nm sooner than Q3 2013....
Actually GF won't do 22nm for CPUs. They will have 20nm in late 2013 the last roadmap I saw.
.
Yes but for address calculation in call and lea e.g. And yes that helps for lea and call as can be seen in the latency tables.This assumption is not true, yet, all your twisted logic
is built upon this erroneous point...
BD, as pointed by AMD, is 4 issues width for each
integer core.
How they manage to do it is still an unknown,
but as i already posted it, the optimisation
manual say explicitly that the AGLUs perform
not only adresses generation , but also logical
and arithmetic operations....
So, to prevent the Bulldozer from lagging behind, clock speed and turbo core will have to be cranked up considerably. As for the much-discussed topic of IPC (instructions per clock) the Bulldozer will probably not be able to compete with its predecessor in spite of some architectural improvements; especially, because each pair of two cores has to share the frontend with the decoders while also using the same FPU.
Yes but for address calculation in call and lea e.g. And yes that helps for lea and call as can be seen in the latency tables.
The BD architecture is 2 wide, 2 ALUs, 2 x86-Ops / cycle. That is the information we have so far and this information is an official document from AMD. As long as there is no revision to that document that are the facts. Even more as I constantly repeat: As the decoders cannot decode more that 4 instructions for 2 cores it would also make no sense to be able to process more than 2 instructions / core. 2 Llano or Sandy Bridge Cores can however decode much more than that and Sandy Bridge has in addition a loop trace cache of already decoded instructions.
add_loop:
vmovsd xmm0, QWORD PTR [rax] ; Load double pointed to by RAX
vaddsd xmm0, QWORD PTR [rbx] ; Add double pointed to by RBX
vmovsd QWORD PTR [rax], xmm0 ; Store double result.
add rax, 8 ; Point to next element of array a.
add rbx, 8 ; Point to next element of array b
dec rcx ; Decrement counter.
jnz add_loop ; If elements remain, then jump.
cyc #instrs instrs note
1 3 movsd, addsd, movsd Only one load/store pair per dispatch
2 4 add, add, dec, jnz max of 4cyc
http://www.planet3dnow.de/vbulletin/showthread.php?p=4416527#post4416527The execution unit supports single-cycle operand bypass from an instruction to
a dependent instruction. Two ALU ops and two AGU ops can be executed in a
cycle. AGU ops include increment/decrement (INC), address generate, and x86-
64 LEA instructions.
I've got the original paper mag containing this column on Saturday. It's actually the rumour column ("whispers"). Articles like the one about Llano contain much more detail and are located in the articles section of the c't mag.I found just a lengthy article about Bulldozer from an CPU expert who comes to the exact same consclusions as I came.
http://www.h-online.com/newsticker/...rs-About-latencies-and-compilers-1232290.html
[...]
Just read the article of the link.
I mean here we see another problem of Bulldozers CMT approach.Let me quote the Bulldozer SW optimization manual, which I had for a couple of months (NDA version) before it went public. So I've already seen some interesting things. First some code (now on p. 136)
Then the cycle counting:Code:add_loop: vmovsd xmm0, QWORD PTR [rax] ; Load double pointed to by RAX vaddsd xmm0, QWORD PTR [rbx] ; Add double pointed to by RBX vmovsd QWORD PTR [rax], xmm0 ; Store double result. add rax, 8 ; Point to next element of array a. add rbx, 8 ; Point to next element of array b dec rcx ; Decrement counter. jnz add_loop ; If elements remain, then jump.
They finally count it as 7 instructions in 2 cycles. There is just no specific information on the type of counted cycles. OTOH this code would work since FPU and integer cores could issue in parallel. The new rax, rbx, rcx values could be calculated in the same cycle as the FPU instructions, since they would use renamed internal registers (OOO).Code:cyc #instrs instrs note 1 3 movsd, addsd, movsd Only one load/store pair per dispatch 2 4 add, add, dec, jnz max of 4cyc
Okay this is new information about increment/decrement. But that is realy not much. And again how to feed that with the limited decoders? Maybe the inc/dec capability is related to the:I've got the original paper mag containing this column on Saturday. It's actually the rumour column ("whispers"). Articles like the one about Llano contain much more detail and are located in the articles section of the c't mag.
Can anyone tell me for sure if one thread can use both 128-bit FMACs per cycle per module ???
I was under the impression that only one 128-bit FMAC could only be used per core per cycle.
I mean here we see another problem of Bulldozers CMT approach.
Okay 3 AVX instructions issued in two cycles, 4 integer instructions in 2 cycles. Appears to be possible. But in detail it is very strange. All three AVX instructions have strong dependency. So they not even cannot be issued in subsequent cycles, you have in addition to wait the full latency of all three instructions. Of course the scheduler can register rename and pull the vmovsd much in advance, so this sequence is fine for the scheduler, but that helps only if you get your scheduler filled with a lot of other stuff. A main issue and why latency is important even if you have great schedulers. Since Core2 Intel is just fantastic on latency.
Now integer. There is also strong dependancy. So no chance to do that in one cycle. Even Intel could not do it because of the dependancy. [Update] Yes Intel can do that because of MacroOp fusion and integer dependency is much less, there is only the status flag dependency (dec/jnz).
I am very afraid that they mean that the given cycles 1/2 means that in addition to the latency you have to wait an additional wait cycle to let the previous result propagate! As you might know, AMD achieved the small FO4 count especially by dropping the ability to propagate results at 0 cycle cost. An AMD engineer working on Bulldozer told this but I expected this in a different way. But now if you need an additional cycle for this propagation because of the high speed design this would mean, that performance is getting even much worse than I have expected so far.
Okay this is new information about increment/decrement. But that is realy not much. And again how to feed that with the limited decoders? Maybe the inc/dec capability is related to the:
dec ryx
jnz [address]
So that this can be fused. But I did not remember that AMD said something about MacroOp fusion. This would also be very difficult especially in a high speed design like Bulldozer.
And to put more bad news on Bulldozer:
http://www.chiphell.com/thread-190177-1-1.html
I have absolutly no idea if this is fake or not, but it would somehow fit to the information we have (remember: Slower cores with less throughput but at double core count and higher frequency).
Now as this comparison is at same frequency (which is stupid for Bulldozer of course) this would fit to the capabilities of the processors taking away the high frequency advantage of Bulldozer.
However I hope that is fake and Bulldozer performs somehow better, especially the 8 core result on Cinebench is too low in my opinion (fdiv?).
Regarding SuperPI I have the question how is this compiled that it performs always so terrible on AMD CPUs. I would throw SuperPI away as an indicator of CPU performance anyway.
Again I say from the architectural standpoint AMD failed with Bulldozer because of the heavy "gain vs die space" mismatch.
If it realy turns out to be true that the removed register result propagation ability is resulting in that dependand instructions need an additional wait cycle then Bulldozer's performance is unrepairably doomed. I still hope that I just misunderstand that because the performance outlook for Bulldozer is not great even without such an additional catastrophic issue.
If I have time I reread this manual part. But anyway I am not very happy with the bad quality of the public version of this manual. As I read it the first time I was very happy because I understood the AGLU as a MacroOP ALU functionality (for mov/add/sub) which would have been great, unless another user here pointed me to the Appendix information.
Dude, seriously what is your deal? I remember when you first started posting on this thread and you were praising BD for its design and saying how great it was and how AMD actually stood a chance. I was actually looking up to you, you were pointing out its strong points when others said you were wrong and bla bla bla...
Now your like the complete opposite. Others are trying to praise BD and your ripping them a new one...seriously WTF?!?!?! LOL
I am very confused.
I'm trying to think of real code that would do this....Two FMAC's in a row without anything between them??
Dude, seriously what is your deal? I remember when you first started posting on this thread and you were praising BD for its design and saying how great it was and how AMD actually stood a chance. I was actually looking up to you, you were pointing out its strong points when others said you were wrong and bla bla bla...
Now your like the complete opposite. Others are trying to praise BD and your ripping them a new one...seriously WTF?!?!?! LOL
I am very confused.
You are absolutly correct on this. What can I say?Dude, seriously what is your deal? I remember when you first started posting on this thread and you were praising BD for its design and saying how great it was and how AMD actually stood a chance. I was actually looking up to you, you were pointing out its strong points when others said you were wrong and bla bla bla...
Now your like the complete opposite. Others are trying to praise BD and your ripping them a new one...seriously WTF?!?!?! LOL
I am very confused.
At the beginning I though that will be no problem because I (if you reread my posts then) said that AMD will then come out with a 16 core BD. But with ~600 mm² die space that will just not happen. I assumed they could do it with less than 400 mm².
From scheduling and execution standpoint for sure yes, that was part of question clarification on AMD's Bulldozer preview web site. Regarding sustained usage it depends of course what the other thread does and throughput information was missing in the optimization manual.Can anyone tell me for sure if one thread can use both 128-bit FMACs per cycle per module ???
I was under the impression that only one 128-bit FMAC could only be used per core per cycle.
I am pretty sure AMD is releasing a 16 core Bulldozer, and I am pretty sure we have seen leaks of performance from it. It has been stated over and over again that there will be a 16 core Bulldozer CPU, as that is where we got the "50% more performance from 33% more cores" line from in the first place (16 cores versus the current 12 cores).
Oh I am not worrying about the Bulldozer launch. As I said to compete with current Intel offerings it is enough. I am worrying about the Sandy Bridge E launch! And then about Ivy Bridge launch ...I honestly think you are worrying way too much about something you have no control over. It will be released soon enough, and its release won't significantly change the world for either the better or worse. Just check it out in a couple months when we will have actual reviews and benchmarks on the processor. It isn't that long of a wait anymore.
Okay hey the 4-wide decoder and 32 Byte fetch is great but you forget that for one core that means only 2-wide decoding and 16 Byte fetch.I think you are overplaying the importance of decoding. The decoders don't need to be as wide as the execution units, because most real code starts with data stalls, during which you get to run a little forward with the decode. Intel is decode-limited without the uop cache mostly because they really suffer whenever crossing 16-byte alignment. AMD has somewhat faster decode simply because their window is 32 bytes. If you can decode full 4 instructions every second cycle without exceptions, I'm pretty sure you can extract most of the ILP in normal x86 code.
Decode bandwidth is also tightly dependent on branch prediction, as a mispredict causes you to waste decoding. Looking at all the added SRAM in the decode units, it's frankly possible that the real decode BW goes up in BD, because of reduced mispredicts.
As for execution width, BD is 4-wide, making it the widest AMD cpu ever. (still narrower for integer code than Intel's 5-wide, though). Sure, there is 1 less add pipe, but way more than 33% of x86 is movs (either as separate instructions or as memory operands in alu instructions), making it a win versus Phenom.
If your fears are founded and there is always a wait cycle between dependent instructions, then yes, that would totally ruin performance. I'm still hoping that the rumors for that are unfounded, and it only happens in some (not very common) situations.
Yes I know Interlagos. But JF-AMD said this is only issued as a server part and that they do not release that as a desktop part.
Therefore in server market everything seems to be okay because CPU prices are not so much an issue and AMD uses MCM. What I mean is that they cannot do a single die 16 core part. Whereas I think that Sandy Bridge E will be a single die part.
Oh I am not worrying about the Bulldozer launch. As I said to compete with current Intel offerings it is enough. I am worrying about the Sandy Bridge E launch! And then about Ivy Bridge launch ...