Intel's Tulsa

imported_Questar · Jun 7, 2006

Originally posted by: BrownTown
well, we were talking about Clovertown which is a 4 core CPU and will have a 1066 FSB, meaning that it will have 266MHZ per core, and that is for memmory and cache cohernecy data. IF you don't think people around here know enough about CPUs to understand how things like low memmory bandwidth will affect performance you are mistaken. Personally, I am an electrical engineering student who has taken computer architecture classes, but more importantly, I can look at benchamarks of CPUs runnign at different FSB speeds and see that a lower FSB is bad. Now Woodcrest isn't too bottlenecked with each core getting 667MHZ of bandwidth, but with each core getting 266 MHz clearly you don't have to have a PHD to see that you are gonna have a big problem. ITs not like I'm the only one saying it, look at any site where people ahve a clue what they are talking about and you will see the same.

Your understanding is false. You can't divide the FSB by four and say each chip only gets that much; it just doesn't work that way. Especially since these chips will have multiple FSBs. You also do not know how much traffic over the FSB cache coherency actually uses. You don't know how the cache coherency algorithms work.

If FSB speed is so important, how come AMD still uses 200Mhz?

As far as people saying the same thing as you, well it's pretty typical of herd mentality. One person says one thing that supports somebody?s position - correct or not - and it gets repeated.

No one except for Intel knows how the four core and multiple socket core duos will perform. Nobody besides Intel knows anything about their coherency algorithms, FSB overhead, just about anything on how they will work.

If you really want to learn about how CPUs work, an enthusiast forum is the wrong place to be. Find the boards where the actual CPU designers post. Some of the IBM guys are scary smart. It's a whole new world of physics when you are talking about gate thickness that's measured in atoms.

Congrats on being an Engineering student, we certainly need more. When you graduate you will learn how much the books didn't teach you. And I have 20 years of experience to back up that claim.

blackllotus · Jun 7, 2006

Originally posted by: Questar

Originally posted by: BrownTown
well, we were talking about Clovertown which is a 4 core CPU and will have a 1066 FSB, meaning that it will have 266MHZ per core, and that is for memmory and cache cohernecy data. IF you don't think people around here know enough about CPUs to understand how things like low memmory bandwidth will affect performance you are mistaken. Personally, I am an electrical engineering student who has taken computer architecture classes, but more importantly, I can look at benchamarks of CPUs runnign at different FSB speeds and see that a lower FSB is bad. Now Woodcrest isn't too bottlenecked with each core getting 667MHZ of bandwidth, but with each core getting 266 MHz clearly you don't have to have a PHD to see that you are gonna have a big problem. ITs not like I'm the only one saying it, look at any site where people ahve a clue what they are talking about and you will see the same.

Click to expand...

Your understanding is false. You can't divide the FSB by four and say each chip only gets that much; it just doesn't work that way. Especially since these chips will have multiple FSBs. You also do not know how much traffic over the FSB cache coherency actually uses. You don't know how the cache coherency algorithms work.

If FSB speed is so important, how come AMD still uses 200Mhz?

[blah blah blah...]

Congrats on being an Engineering student, we certainly need more. When you graduate you will learn how much the books didn't teach you. And I have 20 years of experience to back up that claim.

And you also have no idea what you are talking about. AMD also has HyperTransport links, which provide vastly lower latencies than Intel's FSB. It allows for multiprocessing without a large latency overhead.

imported_Questar · Jun 7, 2006

And you also have no idea what you are talking about. AMD also has HyperTransport links, which provide vastly lower latencies than Intel's FSB. It allows for multiprocessing without a large latency overhead.

Funny, I don't think I ever mentioned AMD. This was a discussion about multicore Intel systems.

Did you have something constructive to add about the subject?

BrownTown · Jun 7, 2006

Well, for one thing AMD doesn't use 200Mhz HT, it uses 2000Mhz, and the memmory data goes straight to the CPU, and not over the HT. Plus, on the Intel side we are talking about the 2 way clovertown servers with dual independant buses, thats 4 cores on each 3 point bus running at 1066 MHz, so thats 266 effective bandwidth to each core. It does work this way, and it will be a big problem for Intel. The cache coherency data isn't even what I'm talking about here, we're talkign straight up bandwidth to main memmory, the cache coherency data will only act to reduce that number even further.

You are certainlly correct that enthusiest forums like this are a bad place to try to learn about CPUs becasue so many people are just reposted stuff they have heard in other places with little understanding of what they are tallking about. However, if you cannot see that clovertown will be bottlenecked then you are wrong. Perhaps it is just becasue of a lack of knowledge of hte actual implimintation that Intel will use. Clovertown is the server quad core, it will have 2 woodcrest cores on a MCM. The 2 clovertowns will be connect via 2 independant busses to the chipset. Each FSB has 3 points (the 2 cores and the chipset), due to these being 3 point busses isntead of Woodcrests 2-point busses they will not be able to run as fast becasue it is harder to drive the signal due to the additional capacitive load. Each core will effectively have the bandwidth of single channel DDR266 memmory (sure, if half the cores arent working then this nubmer gets a little better, but that kinda negates the whole point don't it?).

Viditor · Jun 7, 2006

Originally posted by: Questar

And you also have no idea what you are talking about. AMD also has HyperTransport links, which provide vastly lower latencies than Intel's FSB. It allows for multiprocessing without a large latency overhead.

Click to expand...

Funny, I don't think I ever mentioned AMD. This was a discussion about multicore Intel systems.

Did you have something constructive to add about the subject?

Sigh...I see Questar has forgotten his meds again.

What you said was "If FSB speed is so important, how come AMD still uses 200Mhz?"...so yes, I think you actually DID mention AMD...

Edit: BTW, what was YOUR constructive addition here...I think I must have missed it.

stardrek · Jun 8, 2006

Originally posted by: Questar

As far as people saying the same thing as you, well it's pretty typical of herd mentality. One person says one thing that supports somebody?s position - correct or not - and it gets repeated.

No one except for Intel knows how the four core and multiple socket core duos will perform. Nobody besides Intel knows anything about their coherency algorithms, FSB overhead, just about anything on how they will work.

I would just like to say that we may not know the algorithms used for cache coherency, but we do know some information that we can follow to some logical conclusions.

We know that cache coherency and inclusive cache is something Intel seems to really like. Cache coherency generates traffic that is equal to the square of the number of caches. That is important because of the following:

I will use Tulsa as an example of why this would have impact, being that this thread started when someone was asking about Tulsa.

Example:

Tulsa has a 16MB shared L3 cache for 2 cores. Tulsa will also have a three load FSB. Each core is basically the current Pentium 4 derived Xeon. Intel realized that with their implementation of Cedar Mill, which is a die shrink of Prescott, that there was a problem because of the scalability with multi-processor systems above 2 sockets. They could not pass enough data back and forth to the processors over the 800MT/s bus, which grants 6.4GB/s. This resulted in the need for a solution to prevent the transfer of information on the bus.

They are dealing with this in 2 ways both of which Tulsa was designed for, because it is two Cedar Mill cores.
1. One is the splitting of two components of the components in the chipset. There is the North Bridge and the PSB controller in one chip and the XNB (Memory Bridge) in another. This allowed for flexibility with the memory types they wanted and divided some of the paths for data.
  
  The other is with the shared 16MB L3 cache. Now on the surface the shared 16MBs of cache do not look as useful as some of other things that could have been done. But this allows for a clever use of cache coherency and inclusion. Like I said above, the cache coherency generates an amount of traffic that is equal to the square number of caches. Now one may think that because there is the extra large L3 cache on this processor then having cache coherency could cause a major pain in the butt. In a 4 socket system let?s say, the computer only needs to worry about cache coherency on 4 caches with Tulsa, rather then 8 caches with a normal dual core with independent L2 caches, because of the single 16MB L3 cache. What this cache inclusion does is that when Processor A needs to tell Processor B (or C or D) that a segment of data is no longer needed for Processor A then it only needs to search the single L3 cache of Processor B for that information and does not need to ask each of the independent cores L2 cache on Processor B. Now we are looking at an amount of traffic on the FSB that would be 4^2 with Tulsa, which is 16 proportional amounts of data instead of 8^2, which is 64 proportional amounts of data with a standard dual core with no shared L3 cache. Regardless of how small or large that amount of data would be, provided it fit in any of caches, the amount of traffic is significantly different.
From an overall system perspective, taking into account how much data that cache coherency actually takes, we are looking at about a 10% difference when it comes to this sort of 4-socket setup. That might not sound like a lot, but once you start getting to huge system, 10% can be a lot of crunching power.

This goes to show that the efficiency of their algorithm still falls prey to other external hardware that it runs on and that can be used to base other assumptions on. This is what BrownTown and others did.

It is a complex topic but I hope that was a fairly lucid explanation. If there any corrections that need to be made please note them.

Viditor · Jun 8, 2006

Excellent post stardreck!
You should really post more often, mate...nice work.

imported_Questar · Jun 8, 2006

Yes, nice writeup, but there are a couple of assumptions that may not hold trure:

1. the cache coherency generates an amount of traffic that is equal to the square number of caches.

Because...

2. You are assuming that all caches must always be coherent, which is false. Caches only need to be coherent when data is loaded from a cache that is invalid due to a store being performed in another cache. Put it another way, threads a and b are running on processors 1 and 2. If threads c and d running on processors 3 and 4 are not loading data that have been modified by threads a and b, there is no reason to keep the caches coherent, and therefore no FSB traffic for that reason.

Again, it comes down to the fact that no one here knows Intels algorythms, therefore no one can say how it will perform.

Interesting that the same people that say Woodcrest will be be hampered at 4s and above, were the ones that were saying let's wait for Conroe benchmarks - the Intel supplied benches weren't good enough. But yet here when there is even less information avalable the same people can say for sure about how bad it's going to be for the larger Intel server systems. Fanboyism is alaive and well.

imported_Questar · Jun 8, 2006

thats 4 cores on each 3 point bus running at 1066 MHz, so thats 266 effective bandwidth to each core. It does work this way,

You are assuming that each CPU would be running at 100% memory bandwidth utilization all the time.

It's very likely that one chip will be busy processing when another chip wants to access memory. Therefore the chip that wants memory would have the full 1066 available to it.

As I said over and over - and I'm not going to say any more because I'm bored with it frankly - is that no one here knows ANYTHING about how much FSB traffic cache coherency generates, period.

Take care,
Q

Viditor · Jun 8, 2006

Originally posted by: Questar
Yes, nice writeup, but there are a couple of assumptions that may not hold trure:

1. the cache coherency generates an amount of traffic that is equal to the square number of caches.

Because...

2. You are assuming that all caches must always be coherent, which is false. Caches only need to be coherent when data is loaded from a cache that is invalid due to a store being performed in another cache. Put it another way, threads a and b are running on processors 1 and 2. If threads c and d running on processors 3 and 4 are not loading data that have been modified by threads a and b, there is no reason to keep the caches coherent, and therefore no FSB traffic for that reason.

Again, it comes down to the fact that no one here knows Intels algorythms, therefore no one can say how it will perform.

Interesting that the same people that say Woodcrest will be be hampered at 4s and above, were the ones that were saying let's wait for Conroe benchmarks - the Intel supplied benches weren't good enough. But yet here when there is even less information avalable the same people can say for sure about how bad it's going to be for the larger Intel server systems. Fanboyism is alaive and well.

Certainly there will be MANY cases where the bottleneck isn't reached...but the very concept of a "limit" envisions scenarios where they are.

BTW, I'm not sure why you are referring to 4s systems...this is about 2s quad core. I don't think anyone here has said how bad it will be for the MCMs, we have just been pointing out obvious bottlenecks in the design based on existing hardware. We've seen much of the bottleneck in existing Presslers and Smithfields WRT MCM cache already...
As to Conroe, I don't think anyone had anything in the way of design issues with it...most people advising caution (including myself) were doing so in order to correct people making wild claims about actual performance numbers.

stardrek · Jun 8, 2006

He was making a reference to the example that I gave about the Tulsa systems I believe. Thank you for the complement, Viditor and Questar, it took a little bit to write that.

As for your response, Questar, about the cache coherency not generating traffic that is equal to the square number of caches, I may have not explained it well enough. By definition, the inclusive cache systems that Intel uses, requires that the data be present in all of the caches of all of the processors. That is why they are always so gung-ho about the size of their caches.

ahock · Jun 10, 2006

Intel claimed Tulsa will give them 100% performance lead vs Paxville. Will this be enough to overtake AMD lead in MP space? How many processor can scale for both Intel and AMD?

Accord99 · Jun 10, 2006

Originally posted by: ahock
Intel claimed Tulsa will give them 100% performance lead vs Paxville. Will this be enough to overtake AMD lead in MP space? How many processor can scale for both Intel and AMD?

It should make Intel's currently mediocre Truland platform competitive. It should perform extremely well with IBM's and Unisys's Xeon MP platforms, which are already competitve with Opteron servers using Paxville. IBM and Unisys's Xeon MP platforms can scale to 32 sockets.

blackllotus · Jun 10, 2006

Originally posted by: Questar

And you also have no idea what you are talking about. AMD also has HyperTransport links, which provide vastly lower latencies than Intel's FSB. It allows for multiprocessing without a large latency overhead.

Click to expand...

Funny, I don't think I ever mentioned AMD. This was a discussion about multicore Intel systems.

Actually you did. You said "If FSB speed is so important, how come AMD still uses 200Mhz? ", and I pointed out that AMD's hypertransport links make the FSB speed much less important on AMD processors.

ahock · Jun 10, 2006

Xeon for 32 sockets max? how about opetron? Does scale also to 32 sockets?

Accord99 · Jun 10, 2006

Originally posted by: ahock
Xeon for 32 sockets max? how about opetron? Does scale also to 32 sockets?

8 sockets is the max for Opteron using AMD's glueless design. To go further, AMD would need another company to produce a chipset, like IBM and Unisys have done for the Xeon MP. There is so far only one, called Horus but it has not been officially released.

ahock · Jun 10, 2006

so by using IBM and UNYSIS means Xeon MP is still an attractive option compared to AMD Opteron MP since Xeon scale more due to IBM and UNISYS chipsets?

Do you still know whether IBM/UNISYS still continue to develop chipset for Xeon MP (roadmap aside from X3). Likewise, are there any company that develops chipset for Opteron to scale beyond 8 sockets? Is there a link/roadmap on where will this be?

How about Intel's chipset? Are they capable of scaling to 32 sockets? My guess is not basically since HP and Dell which relies more on Intel's chipset opts for AMD boxes.

stardrek · Jun 10, 2006

Once you start getting into 32 sockets, you are likely looking at processes that need something like the Itanium (although this is a broad generalization, once you make programs for something on this scale, it would seem that makeing it for the Itanic might be a good idea). HP makes some great boxes called Super Domes that scale way up there...some up to 128 procs.

Accord99 · Jun 11, 2006

Originally posted by: ahock
so by using IBM and UNYSIS means Xeon MP is still an attractive option compared to AMD Opteron MP since Xeon scale more due to IBM and UNISYS chipsets?

For certain applications, they're attractive because there are no other choices if you need big x86 systems.

Do you still know whether IBM/UNISYS still continue to develop chipset for Xeon MP (roadmap aside from X3).

I'm sorry, I don't know the answer for that. Though IBM and Unisys do have a nice, lucrative market so I'd imagine as long as its profitable they will continue. Also, being able to offer a higher-performing platform than Dell or HP means they can differentiate themselves in the market place. In the meantime, Tulsa should just be a drop-in upgrade for the Xeon MP platforms.

Intel's next platform is Caneland, and will be for the first Core-architecture based Xeon MP. Here's an interesting Intel PDF discussing some of their future server plans.

ftp://download.intel.com/intel/finance/...tions/PDF_Files/Skaugen_2006_06_09.pdf

Likewise, are there any company that develops chipset for Opteron to scale beyond 8 sockets? Is there a link/roadmap on where will this be?

Doesn't seem like there is before the K8L arrives. Horus seems to be almost dead, as HyperTransport 3.0 will provide most of the benefits.

How about Intel's chipset? Are they capable of scaling to 32 sockets? My guess is not basically since HP and Dell which relies more on Intel's chipset opts for AMD boxes.

Intel's Xeon MP platform is 4 sockets only.

Dthom · Jun 11, 2006

Originally posted by: dexvx

Originally posted by: Viditor

Originally posted by: ahock
Do you have any info whether tulsa can outperform AMD in MP space? I'm not sure for this part except that for IBM due to their superior X3 chipset. I guess this is one reason why Dell pick AMD opteron due to lack of scalability on Intel's chipset for MP. I heard IBM X3/Paxville had outperform AMD Opteron in one of benchmarks (cant remember the link).

Click to expand...

Brown is correct...as to the X3 (Hurricane), the reason it outperformed in TPC is that it had a MUCH more expensive I/O (Fiber Channel and drives). The cost of the I/O was much more expensive than the rest of the system ($600-700k IIRC), and greatly effects the transaction based TPC benchmark...

Click to expand...

I keep telling you, the I/O front-end will have a negligible effect on performance.

First of all, IO is of course a huge factor in a real world test. However all of your analysis is faulty and biased. Having tested hundreds of systems with all sorts of different cpus and chipsets, the end controllers are generally a constant across the chipset/cpu. The quality of the driver is a huge factor, as is the hardware itself, so unless you use the same NIC hardware and the same driver and OS version for both tests you have a factor that skews the results. However the comment about "more expensive fiber hardware" is just plain wrong. Fiber is more expensive than copper products of course, because of volume of production and the cost of the tranceivers, but its in no way faster. They use the same chips as copper cards, and they run at the same speed. In the PC world, an intel copper card is $80. and performs exactly the same as a 900. fiber card. If the cards use the same NIC chip, then it can reasonable be considered a constant. If they are different (ie 3com vs intel NICs), then the entire test is flawed and irrelevent.

Intel's Tulsa

Senior member

Golden Member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Member

Platinum Member

Golden Member

Member

Platinum Member

Member

Senior member

Platinum Member

Junior Member