Samsung TLC erase block sizes

sambrightman · Sep 21, 2015

There are quite a few references on the web to the Samsung 840 & 840 EVO having an unusual erase block size of 1.5MiB due to TLC (and page size 8KiB). I assume this applies to the 850 drives too.

This would mean that most often-quoted alignment information for them is incorrect ("Windows 7+ does it for you"). However, the Anandtech review is the sole dissenter, saying EBS is 2MiB. Note that the Pro versions of these drives are using 2-bit cells and thus should have regular EBS according to conventional wisdom.

This looks like a potential typo and you can see someone asking about it in the comments (no answer). However, it could also be that the conventional wisdom is wrong - why would TLC have an odd EBS but not an odd page size? This post claims the information came from Samsung.

Does anyone have a solid source for this information and/or know how to contact the Anand for corrections? More generally, perhaps page and block sizes could be standard stats in SSD reviews?

sambrightman · Sep 21, 2015

Another user claims to have got the EBS direct from Samsung technical support:

https://bbs.archlinux.org/viewtopic.php?pid=1385980#p1385980

They have just flat-out refused to give me the information via e-mail. They claim Windows 7+ does it automatically, which is obviously not true (it doesn't know the size, it just guess 1MiB fits everyone).

Starting to suspect something strange is going on here, seems odd not to give out the information and seems odd that page size would not also be a multiple of three. As I understand it, this actually creates a much bigger problem.

If Anand got the correct figure of 256 pages per block and some users got the correct figure of 1536KiB erase block size then a page would be 6KiB. Windows would not be aligned by default, FS block size would frequently overlap etc.

Soulkeeper · Sep 22, 2015

It is a shame that the information isn't more readily available, but I doubt it makes much if any difference. The hardware and drivers are able to optimize for this stuff and shouldn't require any real user attention. The people that read your emails likely don't even know the answer tbh. This is something that can change from firmware, nand, or controller combinations also.

sambrightman · Sep 22, 2015

Soulkeeper said:
It is a shame that the information isn't more readily available, but I doubt it makes much if any difference. The hardware and drivers are able to optimize for this stuff and shouldn't require any real user attention. The people that read your emails likely don't even know the answer tbh. This is something that can change from firmware, nand, or controller combinations also.

What makes you think the hardware and driver can optimise for odd-sized or misaligned pages/blocks? That doesn't sound possible to me.

Soulkeeper · Sep 22, 2015

sambrightman said:
What makes you think the hardware and driver can optimise for odd-sized or misaligned pages/blocks? That doesn't sound possible to me.

The buffer use and the algorithms used to read/write to the hardware aren't equivalent to the filesystem configuration. You're likely dealing with 4K blocks on the os/fs side, but the hardware will decide when to actually read/write bunches of these optimally.

sambrightman · Sep 22, 2015

Soulkeeper said:
The buffer use and the algorithms used to read/write to the hardware aren't equivalent to the filesystem configuration. You're likely dealing with 4K blocks on the os/fs side, but the hardware will decide when to actually read/write bunches of these optimally.

I don't think anyone doubts the value of alignment. The disk knows nothing of partitions and cannot wait indefinitely for another write that fits the gap. Furthermore, the next gap will then need filling too.

The odd thing here is not actually having the details of the "geometry" to calculate from.

AlienTech · Sep 22, 2015

a LONG time ago people found this info and made their sector size and partition info match but these days it changes between capacity and versions and controllers and what not. I dont think it really matters for normal users. No one is going to use 1TB of writes every day.. They are designed to handle like 10GB or 20GB or 50GB of writes a day.. If you want more, get a professional drive. Why would you spend so much time trying to figure out something which they dont publish any data on. Unless you know someone who worked on the drive controller software no one else would be able to tell you this info. Write amplification is like 3x or higher on a regular drive, which means data is written 3 times for every byte you write.. Dont worry about it. Keeping the drive less than half full would save a lot more write cycles than any alignment you could do.

sambrightman · Sep 22, 2015

I don't agree this is deep knowledge that only a controller programmer knows. Look at the depth that the average Anandtech review goes into. It is even stated in the 840 EVO review (although possibly incorrect), along with much deeper information.

I would like to get this right, get the most out of the drive, gain knowledge and be able to measure the difference. It is not just about durability (performance also) but regardless doubling any existing amplification doesn't sound like a good idea. I also prefer not pay for a 1TB drive and only use 512GB as a workaround. If you prefer to not care and do it that way it is fine with me of course. I'd like to understand what is going on with TLC geometry.

Soulkeeper · Sep 22, 2015

If the internal page size for the device is 8K or 1M the drive is just gonna do it's thing regardless. Alignment is an entirely different matter. For all you know you can write 128MB to the drive and it'll just sit in the buffer untill the controller/firmware decide it's a good time to flush it to the nand. SSD's are basically just computers. Your OS has internal page sizes and deals with "kernel pages" in much the same way. You might get huge pages or 4K pages for a particular scenario.

Cerb · Sep 22, 2015

Alignment past 4K really doesn't mean much, these days. If you have a bigger write, Windows will try to align it to an appropriate power of 2 (and, they were doing this before SSDs became common, since they found it improved HDD performance in servers). Some SSDs may recommend 64KB, or 128KB partition alignment, but even that isn't too common, anymore.

The filesystem uses 4K. You can change it, but you won't get any benefits. With 4K alignment, the SSD can map to pages that are 4K, or multiples of 4K, with relative ease. After you've used the drive for awhile, and have written over it once, the actual mappings will be arbitrary, and should be considered random (they aren't, as evidenced by background GC improving performance, after letting the drive rest while on, but you have no visibility into how the data is actually stored).

Needing to know more for alignment was because of the performance fragility of earlier SSDs, not that you really should care about it. The SSD maps writes where it wants them to go, how it wants to. With modern SSDs, you are far removed from the actual pages and erase blocks.

Soulkeeper · Sep 22, 2015

The one thing you can count on, as far as i've seen, is that you'll always be dealing with multiples of 4K with modern SSDs. So if your software, fs, os, etc. are dealing with 4K chunks you'll pretty much always get near 100% performance.

sambrightman · Sep 22, 2015

Soulkeeper said:
The one thing you can count on, as far as i've seen, is that you'll always be dealing with multiples of 4K with modern SSDs. So if your software, fs, os, etc. are dealing with 4K chunks you'll pretty much always get near 100% performance.

That is exactly what I'm asking. It is not clear that this is the case with Samsung TLC. Contrary to what you suggest in your previous post, SSD page size is part of the physical architecture of the NAND and cannot change as the controller sees fit. If there are 3 bits per cell and X cells per page (and Y pages per block), this would imply page size is not a multiple of 4K - it would have to be a multiple of 3. Ignore the OS/FS page size for now, just talking about the NAND architecture.

Writes cannot sit indefinitely in controller cache. This would be highly unusual and a huge data loss risk for a non-battery/UPS-backed drive. As far as I'm aware unusual also for a battery-backed drive to flush less often than a low number of seconds - much like the OS cache. Also, writes must eventually be committed.

I can imagine a controller possibly coalescing writes (even random ones with remapping) into clusters of pages so that only the last one requires a double-write. Still presents a double read for each odd one later on. [EDIT: they would not have to be contiguous]

sambrightman · Sep 22, 2015

Cerb said:
Alignment past 4K really doesn't mean much, these days. If you have a bigger write, Windows will try to align it to an appropriate power of 2 (and, they were doing this before SSDs became common, since they found it improved HDD performance in servers). Some SSDs may recommend 64KB, or 128KB partition alignment, but even that isn't too common, anymore.

The filesystem uses 4K. You can change it, but you won't get any benefits. With 4K alignment, the SSD can map to pages that are 4K, or multiples of 4K, with relative ease. After you've used the drive for awhile, and have written over it once, the actual mappings will be arbitrary, and should be considered random (they aren't, as evidenced by background GC improving performance by letting the drive rest while on, but you have no visibility into how the data is actually stored).

Potentially you would get some benefits to having e.g. 8K filesystem blocks on an SSD with 8K NAND pages. One FS write of 8K would be one SSD write instead of 2. However, seems like this is *very* likely to be solved with the controller cache anyway (and you waste more space on files smaller than 8K). So agree, this kind of thing is probably not worth the time.

All of this presumes that the drive itself is 4K though (or a multiple). That is exactly my question - it seems like it might not be for these Samsung drives.

Cerb · Sep 22, 2015

sambrightman said:
That is exactly what I'm asking. It is not clear that this is the case with Samsung TLC. Contrary to what you suggest in your previous post, SSD page size is part of the physical architecture of the NAND and cannot change as the controller sees fit.

Where is that contradicted?

If there are 3 bits per cell and X cells per page (and Y pages per block), this would imply page size is not a multiple of 4K - it would have to be a multiple of 3.

No, it wouldn't have to be. The number of raw bits stored in a page must be a multiple of 3 (or have waste), but the externally visible data need not be. Any adherence to multiples of 3 in the page or block structure will be due to engineering convenience.

Writes cannot sit indefinitely in controller cache.

With a pSLC cache, yes, they could. Most SSDs do not cache writes in volatile memory, but only buffer them in memory long enough to write them (up to 5ms w/ TLC, <2ms w/ MLC, and IIRC, <100uS w/ SLC). pSLC is a cheap way to not need expensive capacitors, provided that MLC/TLC NAND data not in the page being written can be prevented from becoming corrupt if power is lost mid-write.

I can imagine a controller possibly coalescing writes (even random ones with remapping) into contiguous clusters of pages so that only the last one requires a double-write. That starts to get pretty complicated: still presents a double read for each odd one later on and would maybe stop working with fragmentation of erased cells.

It only requires added reads if there is insufficient RAM to cache the page tables (which is why really heavy workloads tend to bog down drives w/ only a little bit of SRAM, but not most drives with big DRAM caches). Most SSDs with DRAM use the DRAM to store mapping info, so that only the data itself must be read from the NAND.

Cerb · Sep 22, 2015

sambrightman said:
Potentially you would get some benefits to having e.g. 8K filesystem blocks on an SSD with 8K NAND pages. One FS write of 8K would be one SSD write instead of 2. However, seems like this is *very* likely to be solved with the controller cache anyway (and you waste more space on files smaller than 8K). So agree, this kind of thing is probably not worth the time.

There are still two writes needed, though, since you need one for the metadata (though, data from many writes may be coalesced, and written later).

There should not be any benefit to matching the sizes precisely. Two 4K FS blocks could be put together in a single 8K NAND page, for example; making 8K FS blocks more wasteful, even with 8K NAND pages. I would in fact expect that to be done, for smaller-than-page-size FS allocations, once the data is moved out of the block that it was originally written to.

File system block size need not match the NAND page size. All that we really care about there is that filesystem blocks do not partially overlap NAND pages (as was the case with the old 63 starting sector).

There's always wasted space on very small writes. But, the larger page sizes, within reason, are part of the compromises that allow for greater density at some acceptable cost, and that waste tends not to be much.

sambrightman · Sep 25, 2015

Cerb said:
No, it wouldn't have to be. The number of raw bits stored in a page must be a multiple of 3 (or have waste), but the externally visible data need not be. Any adherence to multiples of 3 in the page or block structure will be due to engineering convenience.

I'm not following where you disagree with me. There is an integer number of cells per SSD page, the cells are 3 bits, therefore the pages are multiples of 3 bits. These are the actual pages that the controller must eventually read or write from, and their size does not change. You seem to agree but say "no"?

Cerb said:
With a pSLC cache, yes, they could. Most SSDs do not cache writes in volatile memory, but only buffer them in memory long enough to write them (up to 5ms w/ TLC, <2ms w/ MLC, and IIRC, <100uS w/ SLC). pSLC is a cheap way to not need expensive capacitors, provided that MLC/TLC NAND data not in the page being written can be prevented from becoming corrupt if power is lost mid-write.

The pSLC cache is not designed to be used indefinitely and most information about it only provides very temporary asynchronous behaviour. Perhaps all the details are not made available. I agree that in principle a write that overlapped two oddly-sized physical pages could be cached here (the SLC itself should then align of course) and only written when there was another write to fill up the second physical page.

Cerb said:
It only requires added reads if there is insufficient RAM to cache the page tables (which is why really heavy workloads tend to bog down drives w/ only a little bit of SRAM, but not most drives with big DRAM caches). Most SSDs with DRAM use the DRAM to store mapping info, so that only the data itself must be read from the NAND.

If a write overlaps two physical pages that are mapped to be non-consecutive, how is the RAM cache helping? What about random reads?

sambrightman · Sep 25, 2015

Cerb said:
There are still two writes needed, though, since you need one for the metadata (though, data from many writes may be coalesced, and written later).

There should not be any benefit to matching the sizes precisely. Two 4K FS blocks could be put together in a single 8K NAND page, for example; making 8K FS blocks more wasteful, even with 8K NAND pages. I would in fact expect that to be done, for smaller-than-page-size FS allocations, once the data is moved out of the block that it was originally written to.

File system block size need not match the NAND page size. All that we really care about there is that filesystem blocks do not partially overlap NAND pages (as was the case with the old 63 starting sector).

There's always wasted space on very small writes. But, the larger page sizes, within reason, are part of the compromises that allow for greater density at some acceptable cost, and that waste tends not to be much.

I think we agree on all of this. I said in the post you reply to that this seems likely to be dealt with by cache and has more downside by increasing wasted space if you would try to match. Alignment is much more important, and that's why I'm trying to understand if TLC drives have every second NAND page unaligned.

I'm only saying the ability to do a *good* job of combining two 4K writes into 8K depends on the exact implementation of the caching. Many drives do not have non-volatile cache as I understand it: Samsung 840 has TLC and no TurboWrite. I think most 2-bit MLC drives have no SLC region but haven't looked into it. So it still leaves a question of how those drives try to mitigate the mismatch. Initially coaslesced in RAM but that's flushed every few seconds. Maybe it's just that a single NAND page being double-written every few seconds (and they have to be "active" seconds too) is not a big deal over the lifetime of the drive.

Cerb · Sep 25, 2015

sambrightman said:
I think we agree on all of this. I said in the post you reply to that this seems likely to be dealt with by cache and has more downside by increasing wasted space if you would try to match. Alignment is much more important, and that's why I'm trying to understand if TLC drives have every second NAND page unaligned.

Again, though, TLC doesn't mean it has to be 12K instead of 8K, or anything of that sort. I don't know what they actually have for bits per page, off the top of my head, but let's say they stored 8KB as 77824 bits/page, with the remaining 1228 bits as ECC info. That would give a nice even 8KB page, with 3 bits per cell. The 3 bits is not for your data, just the raw flash.

I'm only saying the ability to do a *good* job of combining two 4K writes into 8K depends on the exact implementation of the caching. Many drives do not have non-volatile cache as I understand it

No common consumer drive today should have a volatile write cache. They have to do limited buffering of writes in between coming in on the SATA or PCIe links, but as soon as they can get those to NAND, they do. Like, in a matter of milliseconds (or less).

Samsung 840 has TLC and no TurboWrite. I think most 2-bit MLC drives have no SLC region but haven't looked into it.

Toshiba, Sandisk, and Micron have now also done this. It is likely that all will shortly. But, what that offers is the same, as far as actually handling how things get written and rewritten (overall it's just plain superior), as MLC or TLC w/o the pSLC caches. The pSLC is so quick to write to that, as long as the drive can prevent address scrambling, or write disturbs, during a bad shutdown, it should not need a capacitor bank, beyond what it normally operates with, to handle bad shutdowns gracefully.

So it still leaves a question of how those drives try to mitigate the mismatch. Initially coaslesced in RAM but that's flushed every few seconds. Maybe it's just that a single NAND page being double-written every few seconds (and they have to be "active" seconds too) is not a big deal over the lifetime of the drive.

It results in higher wear, since it's 1-3KMLC/TLC to 1-3K MLC/TLC, instead of 50K? 100K? pSLC to 1-3K MLC/TLC. It could be minutes, or hours, though, with normal use, in between moving them. What any drive will do is take data written to a block that's either in need on remapping (causing imbalance in mapping trees, FI, from excess fragmentation), or that's in a pool due to be used to keep wear even, and move any live data to pages on other blocks free for writing. In the process, it will optimally arrange that data (what defines optimal may vary by drive). Doing so from the pSLC cache reduces undue wear on the MLC/TLC it backs.

Hellhammer · Sep 26, 2015

Cerb said:
Again, though, TLC doesn't mean it has to be 12K instead of 8K, or anything of that sort. I don't know what they actually have for bits per page, off the top of my head, but let's say they stored 8KB as 77824 bits/page, with the remaining 1228 bits as ECC info. That would give a nice even 8KB page, with 3 bits per cell. The 3 bits is not for your data, just the raw flash.

And that is exactly how it is. The physical page/block size is always larger due to spare ECC bytes.

tweakboy · Sep 27, 2015

Dont do a quick format,,,, instead do the long one, takes a while but you should be set...thx gl

sambrightman · Oct 3, 2015

Cerb said:
Again, though, TLC doesn't mean it has to be 12K instead of 8K, or anything of that sort. I don't know what they actually have for bits per page, off the top of my head, but let's say they stored 8KB as 77824 bits/page, with the remaining 1228 bits as ECC info. That would give a nice even 8KB page, with 3 bits per cell. The 3 bits is not for your data, just the raw flash.

Ok, this is the key. My assumption was that (data) page size was an integer multiple of cell size. Physical page size is an integer multiple of cell size, remaining cells after the data page size are used for ECC/admin.

Cerb said:
No common consumer drive today should have a volatile write cache. They have to do limited buffering of writes in between coming in on the SATA or PCIe links, but as soon as they can get those to NAND, they do. Like, in a matter of milliseconds (or less).

Toshiba, Sandisk, and Micron have now also done this. It is likely that all will shortly. But, what that offers is the same, as far as actually handling how things get written and rewritten (overall it's just plain superior), as MLC or TLC w/o the pSLC caches. The pSLC is so quick to write to that, as long as the drive can prevent address scrambling, or write disturbs, during a bad shutdown, it should not need a capacitor bank, beyond what it normally operates with, to handle bad shutdowns gracefully.

The SLC cache benefit in terms of admin and latency is clear to me already, but again: the original Samsung 840 had no SLC cache (it was touted as a feature of the EVO). Presumably a large number of 2-bit MLC drives in the last few years also had none.

ECC explains how writes do not overlap pages on TLC drives. But you agree the frequency of double-writing a NAND page on drives without SLC cache is higher? Ignoring volatile cache for now, two FS 4KiB pages get written to a device with 8KiB NAND pages. The drive controller can manage this to some extent by writing two different NAND empty pages. Even after the disk is half-full, you can sometimes re-arrange the data: modifications on two different NAND pages can be relocated to one single page. But eventually you have a disk that is 70% full or something and receive two such (FS) writes a couple of seconds apart and have to write twice instead of once. A drive with either NAND page size = FS page size or SLC cache can avoid this.

Cerb said:
It results in higher wear, since it's 1-3KMLC/TLC to 1-3K MLC/TLC, instead of 50K? 100K? pSLC to 1-3K MLC/TLC. It could be minutes, or hours, though, with normal use, in between moving them. What any drive will do is take data written to a block that's either in need on remapping (causing imbalance in mapping trees, FI, from excess fragmentation), or that's in a pool due to be used to keep wear even, and move any live data to pages on other blocks free for writing. In the process, it will optimally arrange that data (what defines optimal may vary by drive). Doing so from the pSLC cache reduces undue wear on the MLC/TLC it backs.

I was more talking about the wear on the SLC region itself. A small region of the disk sees essentially every write.

Hellhammer · Oct 3, 2015

sambrightman said:
I was more talking about the wear on the SLC region itself. A small region of the disk sees essentially every write.

This varies depending on the implementation. Some have fixed cells for SLC portion (e.g. SanDisk Ultra II), whereas in others the physical location of the cache can be dynamic (i.e. consumes cells equally).

Cerb · Oct 3, 2015

sambrightman said:
But you agree the frequency of double-writing a NAND page on drives without SLC cache is higher?

http://forums.anandtech.com/showthread.php?t=2412220

Yes, and that has been observed. There's two parts to it, one being the lower wear on SLC, and the other being that a lot of small writes are temporary, which may be overwritten or TRIMed within a short amount of time (leaving less to be copied to the backing MLC/TLC flash).

But eventually you have a disk that is 70% full or something and receive two such (FS) writes a couple of seconds apart and have to write twice instead of once. A drive with either NAND page size = FS page size or SLC cache can avoid this.

Sure, but you're not saving much even in that case, unless you're dealing with only <4K sync writes to sequential LBAs, which is going to be pretty rare, I think.

I was more talking about the wear on the SLC region itself. A small region of the disk sees essentially every write.

I haven't seen specs on what that can take, but I would imagine it would be in the tens of thousands of p/e cycles, leaving no reasonable situation where the MLC/TLC behind it (which it may or may not share at different times) isn't worn out first.

Samsung TLC erase block sizes

Junior Member

Junior Member

Diamond Member

Junior Member

Diamond Member

Junior Member

Member

Junior Member

Diamond Member

Elite Member

Diamond Member

Junior Member

Junior Member

Elite Member

Elite Member

Junior Member

Junior Member

Elite Member

AnandTech Emeritus

Diamond Member

Junior Member

AnandTech Emeritus

Elite Member