• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Intelligent AMD vs. Intel thread...

Jeff7181

Lifer
I hate to bring up this subject again cause of the flaming that goes on in these threads... but I'm curious...

What are the main architectural differences between the P4 and the AXP? I've heard brief descriptions... and I've heard that AXP's do more operations per clock cycle... but how many more? What doesn't AMD's Quantispeed technology do? Is it like 4 paths for data to take? I looked at the tech docs on AMD's site but they're not much help... just says Quantispeed makes it faster, doesn't say how.

Intel seems to have the lead right now in terms of top performance... do you see the Hammer taking the crown away from the P4 HT chips?

Could someone explain why a 64 bit CPU is faster than a 32 bit CPU?
 
Intelligent AMD vs. Intel thread...

Good Luck.

The best intentions deteriorate into a flame war no matter what; just look at what happened in all the recent Barton threads.

Chiz
 
for the 32bit vs 64 bit, i belive this analogy is correct: (correct me if im wrong anybody)

think of a 32bit cpu with a 2 lane highway....one going and one coming....compare that to a 4 lane highway(of a 64bit)...2 going 2 coming....pretty much doubles the performance

if you look at amd's website, somehwere they ahve a comparison between axp's and p4's...unless that is what you were referring to in your post?
 
Intelligent AMD vs. Intel thread? Damnit! I was hoping for a stupid fanboy AMD v. Intel flamefest... moving along 🙁
 
Quantispeed is a marketing term. It doesn't actually refer to any specific feature of the Athlon, but merely points to the fact that it does do more operations per clock on average than the P4 (specifically) does.

As for 64-bitness, look to the Anandtech FAQ and also here for some answers.

As for architectural differences. You've gotta get past this idea that there is such a thing as absolute performance. Performance varies according to how software is written. You'll see that in most modern code that the Athlon can finish more instructions on average than the P4 can, and then you'll see instances such as certain scenes in Lightwave in which a 1.8 GHz P4 is able to render scenes faster than a 2.25 GHz AthlonXP.

For a detailed look at both architectures, refer to these articles: 1, 2, 3, as well as Aceshardware for a detailed look at both.
 
Originally posted by: imgod2u
Quantispeed is a marketing term. It doesn't actually refer to any specific feature of the Athlon, but merely points to the fact that it does do more operations per clock on average than the P4 (specifically) does.

As for 64-bitness, look to the Anandtech FAQ and also here for some answers.

As for architectural differences. You've gotta get past this idea that there is such a thing as absolute performance. Performance varies according to how software is written. You'll see that in most modern code that the Athlon can finish more instructions on average than the P4 can, and then you'll see instances such as certain scenes in Lightwave in which a 1.8 GHz P4 is able to render scenes faster than a 2.25 GHz AthlonXP.

For a detailed look at both architectures, refer to these articles: 1, 2, 3, as well as Aceshardware for a detailed look at both.


IF the application requires applying the same algorithm to many items of data (single instruction multiple data out) then it is possible for the P4 to do well in performance, hence why its done well in Lightwave, although doesnt occur too often. However, there are a few examples were the AMD's strong and efficient FPU still wins, e.g. a CAD test a while back in real industrial environments shown the 2000+ MP amd chip giving a 10% advantage over an Intel 2.2 Northwood.
 
Originally posted by: txxxx
Originally posted by: imgod2u
Quantispeed is a marketing term. It doesn't actually refer to any specific feature of the Athlon, but merely points to the fact that it does do more operations per clock on average than the P4 (specifically) does.

As for 64-bitness, look to the Anandtech FAQ and also here for some answers.

As for architectural differences. You've gotta get past this idea that there is such a thing as absolute performance. Performance varies according to how software is written. You'll see that in most modern code that the Athlon can finish more instructions on average than the P4 can, and then you'll see instances such as certain scenes in Lightwave in which a 1.8 GHz P4 is able to render scenes faster than a 2.25 GHz AthlonXP.

For a detailed look at both architectures, refer to these articles: 1, 2, 3, as well as Aceshardware for a detailed look at both.


IF the application requires applying the same algorithm to many items of data (single instruction multiple data out) then it is possible for the P4 to do well in performance, hence why its done well in Lightwave, although doesnt occur too often. However, there are a few examples were the AMD's strong and efficient FPU still wins, e.g. a CAD test a while back in real industrial environments shown the 2000+ MP amd chip giving a 10% advantage over an Intel 2.2 Northwood.

I would say it is done quite a lot. Vector computing has been in developement for quite some time and it's brought about quite a bit of improvement. Even if it was to be used in a scalar way, SSE2 can still offer a performance increase vs conventional x87. Aceshardware did a test with Intel's compiler a while back and just by enabling the auto-vectorization option (which pretty much just wraps x87 operations inside a scalar SSE2 instruction), the performance increase was up to 300%.
I mean, think about how much vectors are used in software. I dare say more than half of all data structures use vectors (some are pointer-based, but those aren't really practical IMO for modern computing).
 
>You could always make your own thread....

Maybe the mods could lock it now while it is still civil?

I don't think it is quite true that the P4 inherently does fewer instructions per cycle than the Athlon, other than for floating point. It is HARDER for the P4 to AVERAGE as many instructions per cycle. If programmers diligently applied Intel's recommendations, the P4 could probably come very close to an Athlon. But is it worth the extensive amount of work to do so, the program "bloat", and the confusing code, when most parts of most programs are already more than "fast enough" as they are normally used? No. A lot of "office apps" are like this. Now run such non-optimal (for the P4) code as a benchmark. It's average instructions per cycle is not so great (70%?) as the Athlon. Real users using the program probably could not tell the difference; the slower parts are where it isn't noticeable.

Next take programs that require careful and skillful optimization because they need to be as fast as possible - the latest games for instance. You don't really need the coding oddities just to work around the P4s pitfuls in order to optimize for the Athlon. The lack of the consequent bloat makes the cache more effective; more things will fit easier. The Athlon has a fabulous FPU. Leading edge games are a very strong point for Athlons.

Maybe this simplification will be helpful. Present x86 CPUs break up instructions into parts. A part of an instruction takes less time to execute than the whole thing. They execute several of these parts simultanously. CPUs have duplicated "resources" so they can do so, with complex scheduling and interlocks to make it all work right. (Some resources that take up a lot chip space may not be fully duplicated, or not at all.) It is also not necessary to wait for one instruction (or group of parts) to complete before another is started (pipelining). But some parts depend on the result of previous instructions to operate correctly (leading to waits or stalls). And some parts require the use of resources that are already in use at the precise time they are needed. This makes it difficult to predict how long any series of instructions will take to get through. In general, more than one x86 instruction is completed per cycle. The P4 does not like instructions that depend on the results of previous instructions at all, but there is no way to get around this completely. The Athlon seems to be more adept.

There is a reason the P4 is more sensitive to dependencies. It relies on doing as little as possible in a cycle in order to be able to complete that cycle faster. If you have a dependency, you have an additional process - getting the previous result to the current instruction - which takes some time.

The problems due to a long pipeline have frequently been mentioned. When you have a long series of instructions in the process of execution, and then come to branch instruction, it may happen that the wrong set of instructions are being processed in the pipeline; the ones at the branch not taken. Then you have to undo the wrong parts and start over, which is a longer delay for a longer pipeline. Some types of branching are for convenience or compactness. You can get rid of those, although compact code is better for the cache. But there are branches which are "intelligent" and can't be removed. The average distance between branches is probably something like every 5 to 10 instructions, so this branching problem is not trivial. I personally believe the P4s slowdown due to this is exaggerated, once the code has been optimized. It seems to me it is should not be difficult in most cases to arrange code so that the P4 predicts branches correctly.

There is more than one way of conceiving of a 64 bit CPU. 64 bit instructions do not make programs run faster unless the data you operate on happens to be 64 bits. Only specialized progams make use of data over 32 bits. However a CPU that is designed to pull in and put out chunks 64 bits at a time, as opposed to 32, will have twice the bandwidth, even if the instructions are 32 bits and the data is 32 bits. Assuming the Athlon 64 does this, it would enable more resources to be employed usefully per cycle than current Athlons. It would also make a higher bandwidth memory system more useful to an Athlon 64.

Then there is 64 bit addressing. A 32 bit address is 4Gig of bytes. (64 bits is 4Gig times as big.) Real memories are closing in on 4Gig. The address space used by an OS is much larger than real memory; virtual addresses. In order to work efficiently, virtual addresses should be directly addressable and will need to larger than 32 bits. For instance, the memory on a video board occupies some address space. 128M is now common. The swap file on the hard drive occupies address space and is addressed as virtual memory. The OS swaps what is in virtual memory (HD) for what is in real memory when appropriate, and that part of real memory takes on the proper new address as if there were really a larger memory. XP likes to make the swap file twice as big as real memory. 32 bits for memory addessing is going to become obsolete.
 
To clarify, the Athlon does indeed have more parallel execution resources than the P4 (or P3) has. Three beefy x86 decoders vs 1 on the P4 (although I have no idea how much the execution trace cache levels the playing field as it does only store repeatable code), ability to issue 3-6 micro-ops (or was it macro-ops or whatever they call it) to the schedulers, 9 scheduling ports to the 9 execution units and a retirement rate of 3 micro-ops per clock (?). This vs the P4 with a maximum of 3 micro-op issue rate to the scheduler, 4-6 issue port (depending on the type of instruction) to the 7 execution units (2 of which are double-pumped 16-bit ALU's).
At the latter design periods of the Willamette, it was mentioned that it was taken through a "severe transistor shrink". I'm not sure how much of this was taken off the parallel execution resources.
 
Again, without flaming, and without creating a post to say "good luck" and give no information...

Which architecture do you think has more potential to last into the future? Intel seems to be extremely good at increasing clock speeds while AMD seems to be good at getting more performance without increasing clock speeds as much. Obviously the Athlon XP is going to be kinda phased out as the Athlon-64 becomes a proven performer. In your opinion can Intel's method of increasing clock speed keep up?

One would assume once a CPU is running so fast, it's impossible for it to run any faster due to physical limits... so it would be necessary to have 2 CPU's working in parallel, or having them do more work at a time. But we're not nearly at that limit yet, so is AMD handicapping themselves by not moving towards that limit? Or are they "ahead of the times?"
 
Originally posted by: Jeff7181
Again, without flaming, and without creating a post to say "good luck" and give no information...

Which architecture do you think has more potential to last into the future? Intel seems to be extremely good at increasing clock speeds while AMD seems to be good at getting more performance without increasing clock speeds as much. Obviously the Athlon XP is going to be kinda phased out as the Athlon-64 becomes a proven performer. In your opinion can Intel's method of increasing clock speed keep up?

One would assume once a CPU is running so fast, it's impossible for it to run any faster due to physical limits... so it would be necessary to have 2 CPU's working in parallel, or having them do more work at a time. But we're not nearly at that limit yet, so is AMD handicapping themselves by not moving towards that limit? Or are they "ahead of the times?"

I think the P4 architecture has much more future potential than the Athlon. The Athlon is really really getting close to it's limits. I don't think the Athlon will make it past 3Ghz (no idea what XP rating that would be) The P4 is going to make it to 5Ghz easy. The big question is 64bit. AMD's been pushing it back a bit - I'm really concerned we may lose this nice competition that we all benifit from if AMD's 64 bit doesn't work out. The other thing I wonder about is the Pentium 5 (or whatever it will be called). Does anyone know what intel's roadmap is past the P4? Are they going to try to bring IA-64 down to the desktop or will they release another 32bit cpu? Or maybe intel has their own 32bit/64bit plans going?

I'm quite amazed this thread hasn't degenerated into a fanboy fest yet. It's a pleasant surprise.
 
My opinion is that the original design for the P7 core never called for so much "skimping" as far as parallel execution resources. I think that with the two future releases (Prescott and Teja), we'll see more and more parallel execution resources added to the current design as well as a general move away from x87 in favor of SSE2. Intel seems to strongly favor its own SSE2 above x87 (which, as I recall, they created themselves) so as software uses it more, we'll see less and less of a parallel execution difference between the K7/K8 design and the P7 implementations (be it the Pentium 4 or 5 or whatever they'll get up to before they move on to the next core in Nehalem). It's kinda weird really, while many would argue that today's modern x86 (scalar) processors are actually achieving performance better than most Vector processors, the design for these x86 processors are becomming closer and closer to Vector processors in the form of these SIMD implementations.
 
If architecture is what the block diagrams represent, I see no reason the AMD architecture could not match the clock of Intel using the same processing technology. I think it has more to do with the chip layout (and what it would cost) than architecture. It would need a different implementation, it would seem, and I doubt if AMD has any intention of completely reworking that.

About the macro ops: The Athlon breaks up instructions into parts. Then it repacks the parts into groups AMD calls macro ops which proceded as unit. I suppose the idea is that there mostly are sufficient resources for the whole group (macro op) to be executed simultaneously, with minor contention for resources. AMD just duplicated resources to the extent necessary. From the block diagrams, there do not appear to be distinct multiple pipelines, just resources that can be used simultaneously. A whole macro op is pipelined.

This macro op idea is reminicent of the Itanium, where the instruction set itself has 3(?) operations packed into one instruction. The idea is that a compiler can work out what group of instructions can best be packed together. It is a lot easier than having the CPU try to figure it out on the fly.

I imagine the Athlon 64 is going to carry the macro op idea further.
 
Pentium 5 would be a funny name, as the Pentium originally got its name from it being a 5th generation chip.

I am really hoping to see the Athlon-64 do well, as competition is healthy for this industry. I've flip-flopped on all of my processors, going back and forth between AMD and Intel. I'm currently running a tbird, but will likely go with Intel next, as they're faster for what I do most of - encode media.
 
OK I'm going to ask a question. I see this statement at one link:
" The longer the pipeline, the greater the number operations that get hung up if one instruction stalls somewhere in it. ... If one instruction hung, you wouldn?t lose much time."

Does more pipelined operations getting hung up mean more time delay? It seem to me it doesn't. Even if there were 1000 instructions in a pipeline and it stalled one cycle, the total delay is still one cycle. Am I missing something?

Suppose a worker at the end of an assembly line drops his wrench and makes 1000 people wait for 1 minute. Does that add up to a 1000 minute delay? It's still one minute, isn't it?
 
Originally posted by: KF
If architecture is what the block diagrams represent, I see no reason the AMD architecture could not match the clock of Intel using the same processing technology. I think it has more to do with the chip layout (and what it would cost) than architecture. It would need a different implementation, it would seem, and I doubt if AMD has any intention of completely reworking that.

The block diagrams do not represent the entire implementation. But yes, if you hyperpipeline that design to an extent, it could achieve the same clockrate as the current P4 can. Conversely, you can add parallel execution resources on the P4 design as well.

About the macro ops: The Athlon breaks up instructions into parts. Then it repacks the parts into groups AMD calls macro ops which proceded as unit. I suppose the idea is that there mostly are sufficient resources for the whole group (macro op) to be executed simultaneously, with minor contention for resources. AMD just duplicated resources to the extent necessary. From the block diagrams, there do not appear to be distinct multiple pipelines, just resources that can be used simultaneously. A whole macro op is pipelined.

Micro-ops fusion is the idea that the processor doesn't have to spend resources sheduling micro-ops that can obviously be executed in parallel. This is similar to the IA-64 concept of VLIW but having the x86 decoder dynamically pack micro-ops sorta defeats this purpose. For now, the only purpose is to help execution efficiency by having a sorta "pre-shedule scheduling".

This macro op idea is reminicent of the Itanium, where the instruction set itself has 3(?) operations packed into one instruction. The idea is that a compiler can work out what group of instructions can best be packed together. It is a lot easier than having the CPU try to figure it out on the fly.

While micro-ops fusion does resemble some features of VLIW, it is not the same. The compiler has no control over micro-ops fusion, it is still figured out on-the-fly.

I imagine the Athlon 64 is going to carry the macro op idea a further.

As I recall, it will, there will be an additional "packing" stage in which micro-ops are packed and told which scheduling port to go into (I think) which alleviates some of the cases in which an issue port is filled while another stays idle.
 
amd sux, intel rox

amd prices rox, intel prices sux

me wants dual p3 system

me wants stability

me cant afford stability

stability also think pentium 3

me go with amd t-bred w/ 5000db heatsink

me deaf.
 
Dont know much about the cpu's design. But I know MHz for MHz, Intel costs less than AMD. So I stick with them. Otherwise I would have to pay prime dollar for AMD's clock cycles.

-------------------------------------Editted 2/17-------------------------------------

AMD Barton, XP3000+ 2.16GHz retail = $598.00
Intel 2.4GHz = $197.00

AMD TBred XP2800+ 2.25GHz = $395.00, however you must prurchase a qualifying mobo for this price.
Intel, see above. But for $381.00 we can get a retail 2.8GHz CPU

AMD TBred XP2400+ 2.0GHz = $160.00
Intel 2.0GHz = $165.00

Overall you you see I was correct. Also Intel based motherboards tend to cost less than the ones required for AMD of 'equal quality', so you still save some money that route.
AMD's PR rating seems to only hold best true in benchmarks from what I seen. After building both types of systems I notice no noticable difference in speed.
So overall I see no reason to spend more for less. Sure I can buy a XP1800+ and OC it to 2.2GHz. But why? Overall Intel OC's seem to be better. Many reports of 1GHz or higher. So if I buy a 2.4GHz and get it to over 3.4GHz, more power to me as I will never acheive that with a AMD setup. Unless the new Barton core is a quiet OC'ing king waiting to happen.

Prices done based on retail CPU's via Newegg.

But this is getting off topic. All I stated was I dont know the difference between the cpu's and that Intel costs less per MHz. I thought everybody already knew that as a rule of thumb. Intel based systems cost less/cycle except for the bottom end stuff. But I dont want a bottom end system... who would unless they have no decent computer.
 
Originally posted by: SinfulWeeper
MHz for MHz, Intel costs less than AMD

Meh? I just paid $65 for my 1800+. I am running it at 1826 mhz, 28.09 mhz/dollar. Lets say you got the cheapest P4 at newegg (1.7ghz, $123) and overclocked by 20% (same as me). That would give you 2040 mhz, at 16.59 mhz/dollar. I think that your logic is flawed.

BTW, stock:
1800+(1.53 ghz) 23.54 mhz/dollar
p4(1.7 ghz) 13.82 mhz/dollar
 
Ok ok, don't start that... it started to get argumentative so I tried to let this fall off the first page... we had a nice discussion going for a while, please don't start that stuff
 
Originally posted by: Jeff7181
Ok ok, don't start that... it started to get argumentative so I tried to let this fall off the first page... we had a nice discussion going for a while, please don't start that stuff

I don't see whats wrong; mostly everything i've read is intelligent.
 
Back
Top