L2 cach ECC, On or off???

Sugadaddy · Dec 14, 2000

Is it faster to have l2 cache ECC enabled or disabled in the bios? This is for a Thunderbird.

Sugadaddy · Dec 14, 2000

Hello, anyone home?

formulav8 · Dec 14, 2000

ruckb · Dec 14, 2000

Off is faster

Which processor/BIOS is haveing a L2 ECC setting ?????
Have not seen this on Athlon Gigabyte board.

HR

Sugadaddy · Dec 14, 2000

MSI K7T-Pro2A...

My Asus K7V also has that setting.

Whitedog · Dec 14, 2000

EvilDonnyboy · Dec 14, 2000

with my PIII and P3V4X, it gives no performance difference. but with it on, it seems that my pc crashes less at unstable speeds, but makes no diff at stable speeds.

azeeb · Dec 14, 2000

Certainly if you have the function use it cos you paid for it.
What does is check your during memory access that the data is not corrupt and can actually fix some of the error on the fly. At only a slight performance hit and invariably you can run at higher freq. without less problems.

ruckb · Dec 15, 2000

Hello,

I aggree, that with overclocking normal DRAM main memory an ECC would help to improve the stability. Normally the first failing cells will be some e.g. tRP related single cell fails, that can be corrected if you have ECC DRAM.

But I do not know how an SRAM like the L2 cache will react on overclocking. Will it start with some single bit fails too? I would expect a SRAM much less dependent on some analog times then a DRAM.

I guess pm would know about this.

But I would have some additional questiones. Maybe somebody can answer:

how wide is the L2 Cache databus to the processor ?
If it has ECC it should be 72bit, or if it is a 128bit databus about 138bit should be neccessary to implement an ECC!

why is the ECC slower? I have read this somewhere, but I haven't found it again.
the calculation for right or wrong data should be done in parallel on the fly, because if you are adding some additional time for every read, I would expect a bigger performance impact.

bye

ruckb

Sephiroth_IX · Dec 15, 2000

Oh, i didnt read that you actully knew what you are talking about... Ill get in a little deeper.

ruckb · Dec 15, 2000

accepted, for DRAM main memory,
but the orignal question was about ECC setting for L2 Cache.

I remember, that Intel based systems are having the same data in
L1 and L2 cache. So an L2 ECC check would have to look for the data in the L1 cache.

But AMDs cache strategy is to have different data in L1 and L2 cache.
therefore this time consuming step would not be neccessary. Therefore
for L2 settings ECC should be as fast as non ECC !?!

What about the databus ?

ruckb

Sephiroth_IX · Dec 15, 2000

(I would have edited, but when you edit you cannot look at the other posts as reference)

<< the calculation for right or wrong data should be done in parallel on the fly, because if you are adding some additional time for every read, I would expect a bigger performance impact. >>

The main reason that it is having trouble is usually bandwidth. The Thunderbird has a 64-bit bandwidth to the L2, which is why the P3 is able to gain on it and even surpass it in non-FPU situations at high clock speeds. The Thunderbird simply cant get all the data across like the P3 can, with it's 128-bit pipe.

Now, down to the nitty gritty stuff that i dont understand yet

<< If it has ECC it should be 72bit, or if it is a 128bit databus about 138bit should be neccessary to implement an ECC! >>

Well, as i understand it, it is a set bus. If ECC works like parity, using another bit (although not just checking for odds and evens, of course) the bus would have to drop a data bit and replace it with the ECC bit, right?

<< the calculation for right or wrong data should be done in parallel on the fly, because if you are adding some additional time for every read, I would expect a bigger performance impact. >>

Well, for the ECC to check the data, it has to be the final information. So then, instead of using the clock to send the data to the registers, memory, wherever, it is used to check data, and then sent on the next clock?

Lets keep chatting about things like this, Ruck, i think we could learn from eachother

ruckb · Dec 15, 2000

Hi Sephiroth_IX,

thanks for the info about the L2 databuses!

first one question: what is a "set bus" ?

now some explanation:

ECC is getting more and more interesting over Parity, as wider your databus is getting. The (normal) algorithm behind ECC needs some additional bits to the normal databus. With a 64 bit databus there is the special case, that the ECC needs additional 8bit. This is the same number of additional bits as parity would need (Parity needs 1 bit per byte). If your databus is smaller, then the ECC correction would have an overhead and need more additional bits then parity. If the bus gets wider the ECC algorithm needs not so much additional bits as Parity.
The number of parity bits is just increasing linear with the wide of the databus (1 per byte).
the 138bit, that i mentiones where just a rough guess. I would have to check the algorithm to find out how many ECC bits you need for a 128bit databus. I assumed 10 bit for ECC, Parity would need 16 bit!

But normally you do not skip any data from your real databus for ECC, but you add some additional data, that can be used for ECC.
Therefore if the L2 cache is having an ECC, then the physical amount of data (databuswires) should be higher then the 64 or 128 bit, that are used for real information.

ECC calculation:
For this I would like to start with the normal configuration of an ECC for the main memory. there I know a little bit what I'm talking about ;-)
The following are just some thoughts that are coming into my mind, when thinking about the possible handling of ECC in the chipset.

Assuming the chipset (northbirdge) is starting a read to the DRAM, because the data was not in the L2.
the chipset is getting the data on the databus (including the ECC info).
Than it has to do two things:
-comparing the ECC data to the real data,
- and forwarding the real data to the processor.

I do not know, how fast the forwarding will work, but I assume, that the internal data handling in the chipset will take some time too.
If so, then this time can be used to calculate in parallel the info whether the data is correct or not. This calculation should be done very fast (I would assume a very small amount of hardware gatter). Therfore the dataforwarding to the processor can be stopped before it has started (I would assume some kind of pipelining in the Northbridge for the data to the processor).
If the northbridge detects some wrong data it can correct it and start to write the corrected data to the processor. Dependend on the ECC setting it will start a corrective Write to the DRAM too.

For this process there sould be only a performance panelty in case of wrong data.
I do not know how to include in this theory the performance panelty for ECC ON.

But this is valid only for the access to the main memory DRAM.
For an ECC check of L2 this should look a little bit different!

bye

ruckb

pm · Dec 15, 2000

I agree with most of what Ruckb posted. About the only exception is parity. Technically parity can be on any number of bits, whereas 1-bit corrected, 2-bits detected ECC requires a set number of bits. Ie. you could have 1 bit parity on a 64-bit bus (making it 65 bits) whereas any form ECC requires 8 bits minimum for 64-bits.

I've never heard of a design that took a performance penalty on memory reads to check ECC. PC133 SDRAM ECC has a one cycle latency on writes and a three cycle (IIRC) penalty on ECC corrected reads.

Most caches that I have seen do not require any additional latency for ECC reads or writes. If you think about it, it's crazy to require any latency on a cache design for ECC. For reads, if there's an ECC correction required, you can pull an exception and replay the data. For writes, you know where the data is, you can slap it in the cache and then write in the ECC data one cycle later. If the data gets read out in the meantime, you are bypassing the data through the cache anyway and ECC is irrelevant (for alpha/beta particles, anyway).

I design cache for a living, but I've never been required to work on a cache with ECC (yet, I'm sure it's coming soon). I know parity calculations intimately (I could draw three or four fast 8-bit parity calculation circuits from memory), but I'm not that certain of ECC. But it doesn't make sense that enabling ECC would result in lower performance.

Does anyone have any benchmarks? Surely they'd be easy to perform.

Edit: ok, ran benchmarks on my Pentium III 933MHz home machine with L2 ECC on and off (Asus P3V4X motherboard). I ran three different benchmarks and there was no difference at all between two of them (CPUMark99 and SisSoft's CPU benchmark) and the tiny difference in the third (3DMark2k) is too small (<0.5%) to be real and is presumably caused by the hard disk. So, on a Pentium III it doesn't seem to make a difference at all.

Anyone with a Tbird want to put this issue to rest?

Sugadaddy · Dec 15, 2000

Hey, my thread is back from the dead!!

Thanks for the explanation guys.

I did Sandra CPU benchmarks on both an Athlon classic 900 (K7V), and a T-Bird 900 (MSI K7Tpro2A) and the scores are the same in both cases.

Fisher999 · Dec 15, 2000

What a GREAT thread, have to subscribe to this one!!!

TTT

pm · Dec 15, 2000

Thanks, Sugadaddy. I think that proves it for Pentium III's and Thunderbird Athlons. L2 ECC doesn't affect performance at all, so leave it enabled.

Seph:

Ruckb:<< the calculation for right or wrong data should be done in parallel on the fly, because if you are adding some additional time for every read, I would expect a bigger performance impact. >>

Seph:<The main reason that it is having trouble is usually bandwidth. The Thunderbird has a 64-bit bandwidth to the L2, which is why the P3 is able to gain on it and even surpass it in non-FPU situations at high clock speeds. The Thunderbird simply cant get all the data across like the P3 can, with it's 128-bit pipe.>

I agree with ruckb here, you are performing the calculation in parallel. So, it's not really a matter of bus bandwidth because the calculation is within the cache unit and is not sent on the bus between the core and the L2.

< Well, as i understand it, it is a set bus. If ECC works like parity, using another bit (although not just checking for odds and evens, of course) the bus would have to drop a data bit and replace it with the ECC bit, right? >

No, ECC works similarly to, but not just like, parity. I can post up the 1-bit corrected, 2-bits detected ECC algorithm if you like.

But anyway, essentially with both parity and ECC you are doing something like coming up with a checksum for the bus. Parity is only capable of checking 1-bit and if there's an error detected you have no idea which bit is hosed. 1-bit corrected ECC is capable of checking for 2-bits being wrong and is capable of correcting one bit being wrong (so if there are two bits which are wrong, then it pulls a parity exception, but if there is one bit wrong it can correct it). The checksum in the case of parity is usually one bit. In the case of ECC, the checksum varies on the size of the memory chunk. I can't remember how many bits it is for various sizes (that's what I have books for

), but Ruckb is right that it's 8 bits for a 64-bit value, and I believe him when he guesses it's 10 bits for a 128-bit value. Sounds about right to me.

I keep saying "1 bit corrected" ECC because you can have higher values of correction in ECC. You can add more bits to the checksum and make it "2 bit corrected ECC" (which I think is capable of checking for 4 bits being wrong, not 3 like you might think, but I don't have my reference book in front of me to check this). You can do this for more and more protection. But nowadays 1-bit corrected ECC is what everyone calls "ECC" because this is all that's necessary.

< Well, for the ECC to check the data, it has to be the final information. So then, instead of using the clock to send the data to the registers, memory, wherever, it is used to check data, and then sent on the next clock?>

Not sure that I follow here. If the ECC calculation on reads takes longer than simply reading the data out of the cache (which is does, in my experience), then you pull a CPU exception (the CPU equivalent of throwing down a penalty flag in US Football) and then you can just do the equivalent of a branch misprediction pipeline flush. IE. the CPU can say, "I'm doing the wrong thing, flush and redo it". This is a performance hit, of course, but you should only encounter it on those few unlikely times when an alpha or beta particle nukes the contents of a memory bit and flips it the other way around. This is rare enough that the performance hit is negligible.

It definitely is rare enough that you don't want to add a way to keep the pipeline going and then slide the correct data in the place of the incorrect data. This would add a lot of control complexity and it's unnecessary.

Ruckb What are your feelings on cosmic ray statistics in memory? I've been quoting the statistics from that paper from that guy who does memory over at IBM (can't recall his name, but he presented a paper at a conference that I attended a few years ago... ISSCC, I think). It was (statistically) on DRAM, one bit flip per month per 256MB at sea level on 0.25um. You design memory, what kind of statistical error rates do you guys see/design to?

Nice discussion, Seph and Ruckb! Thanks for sending me the PM inviting me to participate, Ruckb.

Either of you read the cover story of Forbes (US economics magazine) on Sun having bit-flipping problems in the L2 caches on their UltraSPARC systems? People were quoted as seeing bit corruptions as frequently as once per month in the 2MB L2 cache modules at elevations of cities like Denver, CO. The $50k+ system would pull a parity error and crash since the L2 on the UltraSPARC is parity - not ECC.

Technonut · Dec 16, 2000

<< Anyone with a Tbird want to put this issue to rest? >>

Sandra 2001 Pro benchmarks:

T-Bird 800
Multiplier @ 850
FSB 112/37 + 1 = 113 FSB
RAM @ 150MHz (137 + 13 Host CLK + PCI CLK = 150MHz CAS2 Turbo)
WCPUID....962MHz

L2 cache ECC enabled:

Memory 544/641
CPU 2673/1319
MM 5376/6521

L2 cache ECC disabled:

Memory 546/643
CPU 2674/1320
MM 5379/6534

Not any difference to worry about.

AndyHui · Dec 16, 2000

PM: ECC on the L2 cache CANNOT be disabled for the Coppermine Pentium III processors....that's why you are not seeing any difference in performance.

BIOS settings are ignored.

Intel QIT/S-Spec information.
NOTE: All processors listed above have an L2 cache with a 4 GB address range. All the processors listed above have an L2 cache that supports ECC. The L2 cache's ECC cannot be disabled.

ruckb · Dec 16, 2000

HI,

I shouldn't sleep, because I'm missing to much ;-)
pm, thanks for joining!
From that what you are telling me, I would think, that only AMD is haveing a L2 ECC implementation !?! Right ????

of course you are right about the 1bit parity. I havn't thought about the theory, but just how it is implemented normally. Maybe that's the reason why I will never make a carrier ;-)

But the implemantation in the praxis is asking me the next questiones. If not a system like Rambus is winning, and we are doing a wider databus for the next generartions, how can a ECC setting can usefully be implemented. Now you have normally x8 devices (maybe x4 for expensive server apps, maybe x16 for very cheap memory). There you can add one additinal x8 devices to have the neccessary 8bit for ECC. If you are going to a 128bit bus you can still use one additional device for a Parity (but this time one per double byte), or use two additional devices.
But what would be a practical implementation of ECC in a 128b Databus?
To add two x8 devices to get the necessarry 10 bit does not make sense!
What is done on the graphic cards (no ECC at all, as only unimportent infos!?!)
Is there another algorithm, that can be implementet ?

pm, if I would appreciate it, if you could post the algorithms, or at least an online source of it.
Do you have informations about the IBM chipkill implementation, to repair a complete chip? I have some papers about it, and I knew, that there are different effective ways to implement it, but I have never read it so close to understand it. IBM is having a patent on it, and so companys like Serverworks are using only something similar!

About our FSER failures I would have to check our last studies. This values (and the theory behind them) is changing for every memory generation. But normally the failures are only an a small level compared to the spec.
What kind of failures are caused by particles in an SRAM ?
In a DRAM the most critical things are the cell, that has only a small amout of charge, or the BL when the SAs are just sensing the information.
But if we would discuss this we would have to open a new thread ;-)

I have not read the Forbes story. When was it, or is there an online source of the story ?

thanks for the discussion !!!!

ruckb

pm · Dec 16, 2000

Thanks, AndyHui. You are, of course, correct.

Thanks, Technonut. Interesting that it makes any difference at all. There's goes my theory. Clearly it's not a penalty on reads though... otherwise, as ruckb pointed out much earlier, it would be a huge difference. I presume that it's a penalty on writes (similar to ECC in SDRAM) and you are only seeing this penalty when there's a write-on-read. Maybe the Tbird design takes a one-cycle hit on L2 bypasses (where data is written and then read in the same cycle). All the caches that I've worked on (three) dealt with this differently. <shrugs>

As far as an ECC algorithm... It'll have to wait until Monday since I don't keep any books on this at home. I think one of my college textbooks has a good description of the standard algorithm.

Chipkill? All I know about that is that it's IBM's design to eliminate SER failures. I figured it was a fancy implementation of ECC. But I'm not a system level designer, so I never looked too closely at it.

As far as SER rates, Tim Dell from IBM wrote a paper on Chipkill and SER and he quoted some amazing numbers. Of course, IBM was/is trying to promote their ChipKill solution to the problem, so it makes sense that they might try to overexaggerate it, but still... Here's a story on it that has the quote:

'"This clearly indicates that because of cosmic rays, for every 256 Mbytes of memory, you'll get one soft error a month," said Tim Dell, senior design engineer for IBM Microelectronics.'

<What kind of failures are caused by particles in an SRAM ?>

Generally it's a bit flip in an SRAM cell - long BL's can have problems in front of sense-amps as you mentioned, but it's mostly the former. You have two back-back-inverters in a SRAM cell, and you hit it with an alpha, beta or gamma particle (beta particles are pretty insignificant though) and this ionizes atoms and creates enough additional electrons to flip the bit, flipping the inverter, which flips the other inverter and the cell is now holding the wrong value. This is simplified of course.

The Forbes article might be online, but I don't think so. If you have access to a library, it might (I've never been to a German library - how much English stuff do you guys have in your libraries?) have back issues of Forbes. It's not a very technical article. It actually makes the point that Sun was too secretive about the problem and that this backfired on them since they are losing customers as a result of it. It only generally talks about what the real problem is.

Edit: Found the Forbes article online. Interesting reading is found here.

Technonut · Dec 17, 2000

I heard about the Cosmic Ray induced errors before. When the researchers moved the test setup to an underground area under 50 feet of rock, the Cosmic Rays were eliminated resulting in no soft errors compared to to the ones recorded at sea level with no shielding. Also the larger the DRAM, the larger the error rate, with ECC memory being affected much more in theory. I am in no way an expert in the matter, just something I read awhile back, and found interesting.

It is just strange to consider that as a source of errors.

ruckb · Dec 17, 2000

Hi,

pm, what difference are you talking about? I thought all benchmarks showed that there is no difference between L2ECC on or OFF?
Do I have missed anything ?
As the L2 cach eCC of the PIIIs can not be disabled, is your board having this option, or have you disabled the main memory ECC for your benchmark?

Chipkill has nothting to do with SER. It is a kind of ECC, that corrects the fail of a complete chip on a DIMM. I think it can be used only on x4 based DIMMs. therefore it is some kind of ECC, that can correct 4 failing bits. I think Serverworks needs a 288bit databus to implement it.
The link to the SER Story is interesting. What I do not understand is, if it is really a problem, why was there is no option for DDR to implement something more effective then the single bit ECC. It was a new definition of a standard, and it would have been possible to implement something new!

pm. I'm not sure, but as far as I remember, the ECC algorithm is able to detect whether the fail is in the data information or the ECC information. Right ???

The behaviour of SUN in case of their L2 problem is unbelievable. Ther you can see how much difference it makes, when a lot of people are knowing a problem (and discussing it in the internet).
In case of SUN the problem is much bigger, and the reaction of SUN is close to Zero.

ruckb

L2 cach ECC, On or off???

Banned

Banned

Diamond Member

Member

Banned

Diamond Member

Banned

Junior Member

Member

Diamond Member

Member

Diamond Member

Member

Elite Member Mobile Devices

Banned

Golden Member

Elite Member Mobile Devices

Diamond Member

Administrator Emeritus<br>Elite Member<br>AT FAQ M

Member

Elite Member Mobile Devices

Diamond Member

Member