這是用戶在 2024-4-3 15:10 為 https://kcall.co.uk/ssd/index.html 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?
Everything I know about SSDs

Solid State Devices using NAND Flash, how they differ from Hard Drives, and how they affect file deletion and recovery

March 2019: 2019年3月:


Introduction: 簡介:

I started writing this rather long page for my own benefit, when I acquired a upgrade from my old 2006 Win 8 250 gb HDD PC to a Dell Optiflex 3010 with a 128 gb SSD. I never used more than 30 or 40 gb of the system drive, and I'm not a gamer or an avid film or music collector either. Not on a PC anyway. As I played with my new kit the further I went I realised that I knew very little about NAND flash in SSDs, just how SSDs work, how do they read and write and store data, and what sort of trickery do they employ? I can visualise an HDD, writing tiny magnetic patterns on a rotating surface, but SSDs are different, vastly different.
我開始寫這個相當長的頁面是為了我自己的利益,當我從我的舊2006年Win 8 250 GB HDD PC升級到128 GB SSD的Dell Optiflex 3010時。我從來沒有使用超過30或40 gb的系統驅動器,我也不是一個遊戲玩家或狂熱的電影或音樂收藏家。不是在PC上,無論如何。當我玩我的新工具包時,我意識到我對SSD中的NAND快閃記憶體知之甚少,只是SSD如何工作,它們如何讀取,寫入和存儲數據,以及它們採用了什麼樣的技巧?我可以想像一個硬碟驅動器,在旋轉的表面上寫微小的磁性圖案,但固態硬碟是不同的,有很大的不同。

There's also quite a few misconceptions about SSDs which seem persistent, and it would be nice to examine them if not perhaps quash a few of them. Perhaps I was guilty of harbouring quite a few misconceptions myself. However it started, this article grew into, shall we say, a mid-level technical discussion. If all you need to know is that SSDs are quiet, reliable, fast, and will work for years, then there's no need to read any further. If however, you think that knowing how to read a 3D TLC NAND flash cell is interesting, then you have little option but to plough on.
關於SSD也有很多誤解,這些誤解似乎是持久的,如果不能消除其中的一些,那麼檢查它們會很好。也許我自己也犯了不少錯誤。無論如何開始,這篇文章逐漸發展成為一個中級的技術討論。如果您只需要知道SSD安靜、可靠、快速,並且可以工作多年,那麼就沒有必要進一步閱讀。然而,如果你認為知道如何讀取3D TLC NAND快閃記憶體單元是有趣的,那麼你別無選擇,只能繼續努力。

As much of the detailed information as possible has been sourced from corporate and private technical articles, with quite a lot from Seagate and WD, and the wonderfully named Flash Memory Summit. Some of the conclusions I've made are from just trying to apply what logic I can along with common sense. Such is the complexity of NAND flash controllers, the variance in their methods of operation, and the speed of their development, that trying to comprehend let alone keep up with them is difficult to say the least. I can't say whether what I've written isn't confusing or is even true, but it's more of a guide than a bible. There'll be some repetition too. And it will soon be out of date.

I am obliged to those I have borrowed from, and will also be obliged to those who point out any errors without any reward apart from that of contribution. I've tried to explain what is different with SSDs, and why it is so hard to grasp with our ingrained HDD minds.

The first misconception might be the plural of SSD: gramatically it should be, so I'm told, SSDs, but SSD's is almost as commonplace. Here I will stick to one SSD, many SSDs.


Software and hardware: 軟體和硬體:

This article was written in 2019 onwards and deals almost exclusively with NAND flash in the form of pc or laptop storage devices we know as SSDs. I shan't complicate things even more by referring to the ubiquitous flash drive or other NAND flash devices. If significant differences exist I shall try to note them as and when that occurs, but the default is the internal drive. Nowhere here is there anything about flash storage in phones, etc.

Most of the detail was produced whilst my PC was running Windows 10 Home, with a fairly modest internal 2.5" WD Green 120 gb SSD. This uses a Silicon Motion SM2258XT controller and four 32 GiB SanDisk 05497 032G 15nm 3D TLC memory chips with an inbuilt SLC cache of unknown capacity. As this article tries to discuss the behaviour of SSDs as a whole it shouldn't matter what host operating or file system is used, but in my case it's Windows and NTFS. Nothing here is specific to a particular brand or type of SSD, it should all be generic. We're really dealing with the principles of SSD operation.
大部分的細節是在我的PC運行Windows 10 Home時產生的,內部有一個相當溫和的2. 5」WD綠色120 GB SSD。它使用Silicon Motion SM2258XT控制器和四個32 GiB SanDisk 05497 032G 15 nm 3D TLC內存晶片,內置容量未知的SLC緩存。由於本文試圖從整體上討論SSD的行為,因此使用什麼主機作業系統或文件系統並不重要,但在我的情況下,它是Windows和Linux。這裡沒有什麼是特定於特定品牌或類型的SSD,它應該都是通用的。我們實際上是在處理SSD操作的原理。

The only additional software applications I have used are Piriform's excellent Recuva, which can list both live and deleted files and their cluster allocations, and HexDen (HxD), a very usable and capable hex editor. Recuva is free from www.piriform.com, and HexDen is also free from www.mh-nexus.de. I use the portable versions of both pieces of software.

All the conclusions and opinions here are entirely my own work, and any data taken from my own pc. It would be wise to verify, or at least agree with my reasoning, before accepting these words as the truth. Much of this is a simplified explanation of a very complex subject.


SSD Physical Internals: SSD物理內部組件:

Poking inside an SSD is something of a disappointment, a small pc board with a few NAND flash chips and a controller chip, lightweight and a little flimsy. As for the software inside the controller, I can only summarise the basic tasks. It seems commonplace that controllers are bought in from external manufacturers, as indeed are the memory chips. SSD controller software is proprietary, very complex and highly guarded, but all controllers have to do basic tasks, even if we don't quite know how. Only those tasks can be discussed here, the very clever tweaks and tricks will have to remain known only to the manufacturer. I'll start with a little groundwork.


NAND Flash: NAND快閃記憶體:

I wasn't going to delve into the internals of NAND flash, there are enough frankly bemusing articles on Wikipedia for all that. All you really need to know is that NAND (NOT-AND) flash memory stores information in arrays of cells made from floating-gate transistors. The floating gate can either have no charge of electrons, and be in an 'empty' logical state, or be charged with electrons at various voltage thresholds and be in a logical state which represents a value. NAND flash is non-volatile and retains its state even when the SSD is not powered up. Oh yes, it's called flash because a large chunk of cells can be erased (flashed) at a time.

But if you want to know more, go ahead. Here the term cell and transistor refer to the same physical entity and are used interchangeably, and I won't keep saying NAND all the time.

Flash memory comprises multiple two-dimensional arrays of transistors, and supports three basic operations, read, program (write) and erase. Apart from the flash arrays, the flash chip includes command and status registers, a control unit, decoders, analogue circuits, buffers, and address and data buses. A separate chip holding the SSD controller sends read, program, or erase commands to the flash chip. In a read operation the controller passes the physical address to the flash chip which locates the data and sends it back to the controller. in a program operation the data and physical address are passed to the chip. In an erase operation, only the physical address is passed to the chip.

The flash chip's latches store data transferred to and from the flash arrays, and the sense amplifiers detect bit line voltages during read operations. The controller monitors the command sent to the chip using the status register. The controller also includes Error Checking and Correction (EEC) algorithms to manage error and reliability issues in the chip and to ensure that correct data is read or written.

Each row of an array is connected by a Word line, and each column by a Bit line. At the intersection of a row and column is a Floating Gate Transistor, or cell, where the logical data is stored. Word lines are connected to the transistors in parallel, and bit lines in series. The ends of the bit lines are connected to sense amplifiers.

Flash arrays are partitioned into blocks, and blocks are divided into pages. Within a block the cells connected to each word line constitute a page. The cells connected to the bit lines give the number of pages in a block. Common page sizes are 4k, 8k or 16k, with 128 to 256 pages making a block size between 512k and 4mb. A page is the smallest granularity of data that can be addressed by the chip control unit.

Read or program operations involve the chip controller selecting the relevant block using the block decoder, then selecting a page in the block using the page decoder. The chip controller is also responsible for activating the correct analogue circuitry to generate the voltages needed for program and erase operations.

Although the number of cells in each row is nominally equivalent to the page size, the actual number of cells in each row is higher than the stated capacity of each page. This is because each page contains a set of spare cells as well as data cells. The spare cells store the ECC bits for that page as well as the physical to logical address mapping for the page. The controller may also save additional metadata information about the page in the spare area. During a read operation, the entire page (including the bits in the spare area) is transmitted to the controller. The ECC logic in the controller checks and correct the read data. During a program operation the controller transmits both the user data and the ECC bits to the flash memory.

Upon system boot the controller scans the spare area of each page in the entire flash array to load the logical to physical address mapping into its own memory (there may be other techiques for holding mapping data in the controller). The controller holds the logical to physical address mapping in the Flash Translation Layer (FTL). The FTL also performs garbage collection to clear invalid pages following writes, and performs wear-leveling to ensure that all the flash blocks are used, evenly.

Since flash does not support in-place updates, a page needs to be erased before its contents can be programmed; but unlike a program or a read operation which work at a page granularity, the erase operation is performed at a block granularity.


2D and 3D, and Layers:

In flash architecture a block of planar flash, a two-dimensional array of cells, is rather unsurprisingly called 2D flash. If one (or more) array is stacked on top of each other then it's 3D flash. 3D NAND flash is built on one chip, up to 32 layers, and was devised to drive costs down when planar flash reached its scaling limit: 3D flash costs little more than 2D to produce, but multiplies the storage capacity immensely. In both 2D and 3D the cells in each page (the rows) are connected by Word Lines, and the cells at each offset within a page (the columns) are connected with a Bit Line (to put it very simply).

3D flash is not the same as layered flash, where separate very thin chips are arranged in a stack. This is prohibitively expensive. Most modern consumer SSDs (in the 2010's) use 3D TLC flash.


Can I see one?

The cell size on end-user flash is minute, with 15nm being common, and ranges from 43nm down to 12nm. Actually cell size, or cell diameter, is misleading, as the stated size is not a measurement of any dimension of a cell but a measure of the distance between discrete components on the chip. The silicon layers on the chip are approximately 0.5 to 3nm thick: by comparison a hydrogen atom is 0.1nm in diameter, and the silicon atoms used in chip manufacture 0.2nm. A nanometre (nm) is indeed exceedingly small, a billionth of a metre, and as an analogy if one mn were the size of a standard marble (about 13mm) then one metre would be the size of the earth. The power of a billion is impressive.


SLC, MLC, TLC, QLC and Beyond:

A Single-Level Cell (SLC) has one threshold of electron charge to indicate the state of one bit, one or zero. A Multi-Level Cell (MLC) holds a voltage denoting the state of two bits, with three different thresholds representing 11, 10, 00 and 01. A Triple-Level Cell (TLC) holds the state of three bits, 111, 110, 100, 101, 001, 000, 010, and 011. The 15 thresholds used in Quad-level cells (QLC) can be deduced if anyone is at all interested. (I have seen other variations of what these threshold values represent in bit terms.)

Unfortunately when the double level cell was developed it was called a multi-level cell and given the acronym MLC, thus forcing everyone to type out multi-level cell laboriously when they want to refer to multiple level cells. If only it had been called a double-level cell we could use DLC, TLC, and QLC freely and use MLC to describe the lot, but it's too late for that now. If only flash had stopped at SLC, with its yes/no one/zero state, these explanations would be far easier to write, and hopefully far easier to grasp.

With multi-level cells physical NAND pages represent two or more logical pages. The two bits belonging to a MLC are separately mapped to two logical pages. Odd numbered pages (including zero) are mapped to the least significant (RH) bit, and even numbered pages are mapped to the most significant (LH) bit. Similarly, the three bits belonging to a TLC are separately mapped to three logical pages, and a QLC is mapped to four logical pages (The page numbering for TLC and QLC is unknown).

The more bits a multi-level cell has to support affects the cell's performance. With SLC the controller only has to check if one threshold has been exceeded. With MLC the cell can have four values, with TLC eight, and QLC 16. Reading the correct value of the cell requires the SSD controller to use precise voltages and multiple reads to ascertain the charge in the cell. It's also apparent that if a single physical page supports multiple logical pages then that page will be read and written more frequently than a SLC page, with consequent affect on its life expectancy. Furthermore it would seem self-evident that a TLC SSD would need only a third of the physical cells required in an SLC device, so my 120 gb TLC SSD would actually hold only 40 gb of NAND cells.

High-use enterprise SSDs used to be the province of the SLC, with it's greater speed, endurance, reliability and read/write capabilities, MLC and TLC are gaining acceptance for enterprise use. The end-user consumer SSD market gets the cheaper higher capacity but slower and more fragile multiple level cells.


Why is Nothing One?

Anyone still following this may have noticed a common factor in both single and multi-level cells, in that an empty cell - where the floating gate has no charge - represents one. Unlike HDDs, where any bit pattern can be written anywhere, a default logical state of ones is present on an empty SSD page. This is because there is only one programming function on the cells, to move electrons across the floating gate. NAND flash cells can only be programmed to a state of zero, there is no ability to program a one. With multi-level calls the default is still one across all pages, but a logical one can be represented even after the cell has been programmed and there are electrons present across the gate.

Ever since Fibonacci introduced the Hindu-Arabic numeral system with its concept of zero into European mathematics in 1202, the human mind associates zero with empty and one with full. To be empty and represent one is rather perplexing, and appears to be mainly from convention (an empty state could represent zero but would required inverters on the data lines). Possibly the circuitry is less complex, and possibly the ability of an empty cell to conduct a charge implies that it is a one.


They're all SLC anyway:

After all this it's perhaps worth emphasising that NAND flash, whatever its intended use, is all physically SLC. If you could look into a TLC cell you wouldn't see 101, or 011, or whatever. There can only ever be one quantity of electrons in a cell, no matter how that quantity is interpreted. The SSD controller knows whether the cells are to be treated as SLC, MLC etc and programs them accordingly, measures the electron count, and determines what logical value it represents. But even quad cells can only contain one value, just as do SLC cells.


The Myths and Misconceptions:

And now we come to the myths, misconceptions and the real reason for writing this article, what happens when an SSD page is read, written and rewritten, and how does this affect deleted file recovery? On one hand we have NTFS, designed specifically for HDDs way before SSDs became easily available, NAND flash with its own unique way of operating, and several billion humans with years of ingrained HDD use and expectations. And here, if I haven't already, I shall use SSD interchangeably if incorrectly for NAND flash.


Storage Device Controllers:

All HDDs, SSDs and flash drives have an internal controller. It's the way that the storage device can be, in the words of Microsoft, abstracted from the host. That abstraction is done by logical block addressing, where each cluster capable of being addressed on the storage device is known to the host by an ascending number (the LBA). The storage device controller maps that number to the sectors or pages on the device. To the host this mapping is constant - a cluster remains mapped to the same LBA until the host changes it. On an HDD this relationship is physical and fixed: in its simplist deconstruction an HDD controller just reads and writes whatever sectors the host asks it to. It doesn't have to think about what was there before, it just does what it's told and writes new data on top of the old. It does that because it can, there's nothing preventing a new cluster being written directly on top of the same sectors of an old one. On an SSD it's different.

With an SSD the host still uses the LBA addressing system with the constant reconciliation between LBA and cluster number. It knows that the device is a SSD and has a few tricks to accommodate this, but they will come later. The SSD controller however has many tricks to reconcile the host's file system, written for HDDs, with the demands of NAND flash.


Flash Translation Layer:

The host still uses LBA addressing to address the SSD for read and writes, as it knows no other. These commands are intercepted by the Flash Translation Layer on the SSD controller. The FTL maintains a map of LBAs to physical block addresses, and and passes the translated PBA to the controller. This map is required because unlike an HDD the LBA to PBA relationship is volatile. It's volatile because of the way data is written to NAND flash.

An empty page, with all cells uncharged, contains by default all ones. If a hex editor is used to look at an SSD's empty sectors however, it will be presented with clusters of zeroes. This is because empty pages are not allocated to the LBA/PBA mapping table. Instead, if a read request is issued for an empty page a default page of zeroes is returned. This applies to both unallocated clusters and those which are part of a file: the SSD does not allocate a page and change all its cells from ones to zeroes.


Floating Gate Transistors:

This section might be helpful before plunging into reads and writes, and here cell and (FG)transistor become interchangable (a cell is a transistor). For more, much more, about floating gate MOSFET (Metal-Oxide-Semiconductor Field-Effect Transistors) there is always Wikipedia.

A FGMOS transistor has three terminals, gate, drain, and source. When a voltage is applied to the gate a current can flow from the source to the drain. Low voltages applied to the gate cause the voltage flowing from source to drain to vary proportionally to the gate voltage. At a higher voltage the proportional response stops and the gate closes regardless.

The charge in the floating gate alters the voltage threshold of the transistor, i.e. at what point the gate will close. When the gate voltage is above a certain value, around 0.5 V, the gate will always close. When the voltage is below this value, the closing of the gate is determined by the floating gate voltage.

If the floating gate has no charge then a low voltage applied to the gate closes the gate and allows current to flow from source to drain. If the floating gate has a charge then a higher voltage needs to be applied to the gate for it to close and current to flow. The charge in the floating gate changes how much voltage must be applied to the gate in order for it to close and conduct.


SSD Reads:

There's nothing inherent in the design of NAND flash that prevents reading and writing to and from individual cells. However in line with NAND flash's design goal to be simple and small the standard commands that NAND chips accept are structured such that a page is the smallest addressable unit. This eliminates space that would be needed to hold additional instructions and cell-to-page maps.

To read a single page, and the cells within it, the page needs to be isolated from the other pages within the block. To do this the pages not being read are temporarily disabled.

All cells/transistors in the same block row (a page) are connected in parallel with a Word Line to the transistors' gates. All transistors in the same block column (cell offset) are chained in series with a Bit Line connecting the drain of one to the source of the next. At the end of each bit line is a sense amplifier. When a read takes place a pass-through voltage is applied to all word lines except the page being read. The pass-through voltage is close to or higher than the highest possible threshold voltage and forces the transistors in all pages not being read to close whether they have a stored charge or not. All bit lines are energised with a low current.

The word line for the page being read is given a reference voltage, and all the bit line sense amplifiers read. Transistors holding a high enough electron count will not be closed by the reference voltage, and the bit line current will not pass through the source/drain chain to the sense amplifier. Transistors with no charge, or a charge below the threshold, will be closed by the reference voltage and conduct the bit line current to the sense amplifier. Several reads at varying threshold voltages are required to determine the logical state of a multi-level cell.

(To add, or avoid, more confusion a floating gate transistor can be either open or closed. An open gate does not conduct an electrical charge, and a closed gate does. So if the gate is open nothing can get through, and if it's closed it can. No wonder we're confused.)

It can be seen from this that to read one page in a block requires that all transistors in every page in the block receive either a pass-through or one or more reference voltages. It also appears that this will still apply even if some or all of the other pages in the block are empty. This becomes significant in Read Disturbance below.


Interpreting the results:

It is quite easy to grasp the concept behind reading an SLC. Only one threshold applies to SLC flash so only one test voltage is required - the floating gate either will or will not close. if the threshold voltage closes the gate then the bit line current passes to the sense amplifier and the stored value is one. If it doesn't then it's zero.

Multi-level cells are different, and the reasoning behind the stored value bit order becomes apparent. In a MLC the possible user bit combinations are 11, 10, 00 and 01, separated by three threshold values. To read the most significant (l/h) bit only requires one read, of the middle threshold voltage. If the gate closes then the MSB is one, if it doesn't then the MSB is zero, no matter what is in the least significant bit. To read the LSB (r/h) two reads are required, one of threshold one, and one of threshold three. If the read of threshold one closes the gate then the LSB is set to one and read two is not required. If the gate opens then a read of threshold three is taken. If it closes the LSB is set to zero. If it doesn't then the value is one.

The bit combinations in TLC cells are 111, 110, 100, 101, 001, 000, 010, and 011, separated by seven threshold values, and are more tricky to grasp. The MSB bit again only requires one read of the middle threshold, as in MLC. The central bit requires two reads, at threshold two and six, and the LSB requires four reads, at thresholds one, three, five and seven.

All multi-cell pages based on the MSB (l/h) are treated as SLC, with only one read required to determine the user bit value.


SSD Writes:

The most significant aspect of NAND flash, the widest fork in the HDD/SSD path, and the fundamental, pivotal factor in what follows, is that data can only be written to an empty SSD page. This is not new, nor is it in any way unknown, but it has the greatest implications for data security and recovery.

While SSDs can read and write to individual pages, they cannot overwrite pages, as the voltages required to revert a zero to a one would damage adjacent cells. All writes and rewrites need an empty page. Unlike HDDs, where a compete cluster is written to the disk whatever was there previously, the act of writing an SSD page allocates an empty page with its default of all ones, and an electrical charge is applied to the cells that require changing to zeroes. This is as true for multi-level cells as it is for SLCs, as the no-charge all-ones pattern is either replaced with a charge representing another pattern, or is left alone. This is a once-only process.

When a write request is issued an empty page is allocated, usually within the same block, and the data written. The LBA/PBA map in the FTL is updated to allocate the new page to the relevant LBA. The LBA will always remain the same to the host: no matter which page is allocated the host will never know. This is the same process if the user data is being rewritten or if it is a new file allocation: the only difference is that the rewrite will have slightly more work to do. The old page will be flagged as invalid and will be inacessible to the host, but will still take up space within its block as it cannot be reused.

Whilst it's easy to grasp writing to SLC pages, multi-level cell pages are more difficult to visualise. The controller accumulates new writes in the SSD cache until enough logical pages to fill a physical page are gathered, and then writes the physical page. This entails the fewest writes to the page. If a logical page in a multi-level page is amended it would require a new page to be allocated and all logical pages rewritten, as the individual values in the physical page can't be altered. If a logical page is deleted then I surmise that the deleted logical page is flagged as invalid, and when the block becomes a candidate for garbage collection any valid logical pages are consolidated before writing. In other words a multi-level page, or at least the majority of them, will always contain a full compliment of logical pages.

It's apparent that if NAND flash handles data writes in this way - and it does - the SSD will eventually become full of valid and invalid pages, and performance will gradually slow to a crawl. Although an individual SSD page can't be erased a block can, and this method is used to return blocks to a writable state. To expedite this, and to ensure that a pool of empty blocks is always available for writes, the SSD controller uses Garbage Collection.


Garbage Collection:

Garbage Collection is enabled on the humblest up to the highest capacity SSD: without it NAND flash would be unusable. Garbage Collection is part of the SSD controller and its work is unknown to the host. In its simplest form GC takes a block holding valid and invalid pages, copies the valid pages to a new empty block, updates the LBA mapping tables, and consigns the old block to the invalid block pool. There the block and its pages are reset to empty state, and the block added to the available block pool. Thus a pool of available blocks should always be available for write activity. As long as there is power to the SSD GC will do its work, it cannot be stopped. There are various sophisticated techniques for GC routines, all proprietary and mainly known only to the manufacturers.

When an SSD arrives new from the factory writes will gradually fill the drive in a progressive, linear pattern until the addressable storage space has been entirely written. However once garbage collection begins, the method by which the data is written - sequential vs random - affects performance. Sequentially written data writes whole blocks, and when the data is replaced the whole block is marked as invalid. During garbage collection nothing needs to be moved to another block. This is the fastest possible garbage collection - i.e. no garbage to collect. When data is written randomly invalid pages are scattered throughout the SSD. When garbage collection acts on a block containing randomly written data, more data must be moved to a new block before the block can be erased.


The Garbage Collection Conundrum:

Garbage Collection can either take place in the background, when the host is idle, or the foreground, as and when it is needed for a write. Whilst background GC may seem to be preferrable, it has drawbacks. If the host uses a power-saving mode when idle, GC will either wait for the device to restart with a consequent user delay for GC to complete, or wake the device up and reduce battery life whilst the host is 'idle'. Furthermore GC has no knowledge of the data it is collecting. Inevitably some data will be subject to GC and then be deleted shortly afterwards, incurring another bout of GC and consequent additional and unnecessary writes (write amplification, the ratio of actual writes to data writes). Foreground GC, seemingly the antithesis of performance, avoids the power-saving problems, only incurs writes when they are actually required, and with fast cache and highly developed GC algorithms presents no noticeable performance penalty to the user. The trend in modern GC appears to be foreground collection, or a combination of foreground and background collection.

Based on foreground garbage collection, and that most user activity is random, then the inevitable conclusion is that the SSD will spend most of its life at full capacity, if by that we mean available blocks, even though the allocated space appears to the host to be low.

However there is another potential problem with SSDs, and that is to do with a historical event: the way that file systems were designed.


File Systems - What you see isn't what you get:

Host file systems were designed in the days when HDDs reigned supreme, simply because SSDs had yet to arrive in an available and affordable form. The file system does not take into account the needs of NAND flash. Files are constantly being updated: they get allocated, moved and deleted, and grow and shrink in size. The way the file system handles this is incompatible with the workings of NAND flash.

It's worth emphasising that storage devices are abstracted from the host operating system. Whilst an array of folders and files are displayed by Explorer in a form wholly comprehensible to a human, it's all an illusion. What Explorer is showing is a logical construct created entirely from metadata held within the file system's tables. The storage device controller knows nothing about files or folders, or tables or operating systems: all an HDD or SDD sees are commands to read or write specific sectors, which it does faithfully. An SSD has one advantage over an HDD however, it knows that some pages hold data, and are mapped to an LBA, and some pages are empty, hold no valid data, and are not mapped to an LBA. Conversely an HDD does not need to know this, to an HDD all sectors are the same.


File Deletion:

In NTFS, when a file is deleted the entry in the Master File Table is flagged as such, and the cluster bitmap is amended to flag the file's clusters as available for reuse. The delete process takes place entirely within the MFT and the cluster bitmap. This is perfectly adequate for an HDD, as NTFS can simply reuse the MFT entry and the clusters whenever it wishes. On an SSD the process from NTFS's point is exactly the same, as NTFS has no other way of deleting files. However all the SSD sees is exactly what an HDD would see, updates to a few pages. Neither an HDD nor an SSD knows that it's the MFT and cluster bit map being updated, as they have no knowledge of such things. As there is no activity on the deleted file's clusters, the SSD's pages holding the clusters remain mapped to their LBAs in the FTL. The SSD's FTL has no way of knowing that these pages are no longer allocated by NTFS: to the SSD the pages are still valid and will not be cleaned up by garbage collection.

As these 'dead' pages are allocated to an LBA they could be released when files are allocated or extended and the host uses that LBA. In this case the page will be flagged as invalid and a new page used. However it is inevitable that eventually a significant amount of unused and unwanted baggage which is not flagged for garbage collection will be pointlessly maintained by the SSD controller and be unavailable for reuse. To overcome this, and to correlate the hosts view of allocated and unallocated pages with the SSD's, NTFS from Windows 7 onwards acquired the TRIM command.


SSD Detection:

Although the storage device is abstracted from the File System, to enable some of the file system's SSD tweaks it needs to know whether the device is an HDD or SSD. There are various ways to do this, including querying the rotational speed of the device, which on an SSD should be zero (or perhaps one). This seems the most widely used and most proficient method.



TRIM (it isn't an acronym) is a SATA command sent by the file system to the SSD controller to indicate that particular pages no longer contain live data, and are therfore candidates for garbage collection. TRIM is only supported in Windows on NTFS volumes. It is invoked on file deletion, partition deletion, and disk formatting. TRIM has to be supported by the SSD and enabled in NTFS to take effect. The command 'fsutil behavior query disabledeletenotify' returns 0 if TRIM is enabled in the operating system. It does not mean that the SSD supports it (or even if an SSD is actually installed) but all modern SSDs support a version of it.

There are three different types of TRIM defined in the SATA protocol and implemented in SSD drives. Non-deterministic TRIM: where each read command after a TRIM may return different data; Deterministic TRIM (DRAT): where all read commands after a TRIM return the same data (i.e. become determinate) and do not change until new data is written; and Deterministic Read Zero after TRIM (DZAT): where all read commands after a TRIM return zeroes until the page is written with new data. By the way whilst DRAT returns data on a read it is not the userdata that was ptrviously there bafore the TRIM: it is random.

Fortunately Non-Deterministic TRIM is rarely used, and Windows does not support DRAT, so a read of a trimmed page - which is easily done with a hex aditor - invokes DZAT and returns zeroes immediately after the TRIM command is issued. The physical pages may not have been cleaned immediately following the TRIM command, but the SSD controller knows that there is no valid data held at the trimmed page address.

TRIM tells the FTL that the pages allocated to specific LBAs are to be classed as invalid. When a block no longer has any free pages, or a specific threshold is reached, the block is a candidate for garbage collection. Live data is copied to a new empty block, and the original block is erased and made available for reuse.

TRIM is an asynchronous command that is queued for low-priority operation. It does not need or send a response. The size of the TRIM queue is limited and in times of high activity some TRIM commands may be dropped. There is no indication that this takes place, so some unwanted pages may escape garbage collection.



Windows Defragger - now called Storage Optimiser - has an option to Optimise SSDs. This does not defrag the SSD but sends a series of TRIM commands to all unallocated pages identified in NTFS's cluster bitmap. This global TRIM (or RETRIM) command is run at a granularity that the TRIM queue will never exceed its permitted size and no RETRIM commands will be dropped. A RETRIM is run automatically once a month by the storage optimiser.



All NAND flash devices use over-provisioning, additional capacity for extra write operations, controller firmware, failed block replacements, and other features utilised by the SSD controller. This capacity is not physically separate from the user capacity but is simply an amount of space in excess of that which can be allocated by the host. The specific pages within this excess space will vary dynamically as the SSD is used. According to Seagate, the minimum reserve is the difference between binary and decimal naming conventions. An SSD is marketed as a storage device and its capacity is measured in gigabytes (1,000,000,000 Bytes). NAND flash however is memory and is measured in gibibytes (1,073,741,824 bytes), making the minimum overprovisioning percentage just over 7.37%. Even if an SSD appears to the host to be full, it will still have 7.37% of available space with which to keep functioning and performing writes (although write performance will be diabolical). Manufacturers may further reduce the amount of capacity available to the user and set it aside as additional over-provisioning, in addition to the built-in 7.37%. Additional over-provisioning can also be created by the host by allocating a partition that does not use the drive's full capacity. The unallocated space will automatically be used by the controller as dynamic over-provisioning.

My humble WD SSD has four 32 gb chips but a specified capacity of 120 gb, meaning that it has 8 gb set aside as additional over-provisioning. Add this to the 7.37% minimum (9.4 gb) and the 17.4 gb equates to almost 15% over-provisioning space.


Wear Levelling:

Some files are written once and remain untouched for the rest of their life. Others have few updates, some very many. As a consequence some blocks will hardly ever see the invalid block pool and have a very low erase/write count, and some will be in the pool every few minutes and have a very heavy count. To spread the wear so that all blocks are subject to erase/writes equally, and the performance of the SSD is maintained over its life, wear levelling is used. Wear levelling uses algorithms to indentify blocks with the lowest erase count and move the contents to high erase count blocks; and to select low erase count blocks for new allocations. As with garbage collection, wear levelling is far more complex than I could possibly deduce, let alone explain.


Read Disturbance:

SSD reads are not quite free, there is a price to pay. As described above, a read of one page generates a pass-through voltage on all other cells in the block. This voltage is likely to be below the highest threshold value that could be held by the cell, but it still generates a weak programming effect on the cells, which can unintentionally shift their threshold voltages. The pass-through voltage induces electric tunnelling that can shift the voltages of the unread cells to a higher value, disturbing the cell contents. As the size of flash cells is reduced the transistor oxide becomes thinner and in turn increases this tunnelling effect, with fewer read operations required to neighbouring pages for the unread flash cells to become disturbed, and move into a different logical state. Cells holding lower threshold values are more susceptible to read disturbance.

Thus each read can cause the threshold voltages of other unread cells in the same block to shift to a higher value. After a significant amount of reads this can cause read errors for those cells. A read count is kept for each block and if it is exceeded the block is rewritten. The count is high for SLC cells, around 1m, lower for 25 nm MLC at around 40,000, and much lower for 15 nm TLC cells.


File Recovery:

And now we come to deleted file recovery. NTFS goes through exactly the same process to delete a file on an SSD as it does on an HDD, with the exception of the additional TRIM command. And the TRIM command (assuming it's executed) and a few SSD quirks destroys any practicable chance of deleted file recovery.

TRIM commands, as described above, have a complimentary setting within the SSD controller in the form of DRAT and DZAT. (I don't believe that non-deterministic TRIM is used in any reputable SSD, and I don't think that Windows supports DRAT, but I have no proof.) The implementation of DZAT means that immediately on successful execution of the TRIM command (which will in most cases be immediately on file deletion) any attempt to read the TRIMed page will return zeroes. The data on the page will still exist until the block is processed by the garbage collector, but that data is not accessible from the host by any practicable means, or any general software.

Garbage collection is independent of the host device and will be invoked at the will of the SSD's controller. Once the process is started it cannot be stopped, apart from powering off the SSD. Once powered up again the garbage collector will resume its duties to completion.

Deleted file recovery on a modern SSD is next to impossible for the end user, and under Windows as close to impossible as you can get. A theoretical examination of the chips would most likely show compressed and encrypted data, striped over multiple blocks, and no possibility of relating one page of data to another across the multiple millions of pages. There is a very small possibility of recovering recently deleted files by powering off the SSD immediately and sending it to a professional data recovery company. They may recover some data, given enough time and money.

After a session of file deletion, such as running Piriform's CCleaner, run Recuva on the SSD. The headers of the deleted files found (and presumably the rest of the file) will all be zeroes. This is TRIM and DZAT doing their work in a few seconds, killing any chance of deleted file recovery. Of course TRIM can be disabled, at the cost of performance, but it's probably better to be a little less cavalier when deleting files that might be wanted later.


Deletd File Security:

The notion of secure file deletion - overwriting a file's data before deletion - is irrelevant, and if any other pattern except zeroes is chosen is just additional and pointless wear on the SSD. Even overwriting with zeroes will cause transaction log and other files to be written, so secure file deletion on an SSD should never be used. Wiping Free Space is far worse for pointless writes, and is even more futile than secure file deletion. The deleted files just aren't there any more.


The OCZ Myth:

Some years ago (as a little light relief to all these acres of text) the OCZ forums were buzzing with the latest method of regaining performance on their SSDs: run Piriform's CCleaner Wipe Free Space, with one overwrite pass of zeroes. Although performance may have been regained, logic, and common sense, went out of the window. The theory was that overwriting the pages with zeroes was equivalent to erasing blocks (this was before the days of TRIM). This was nonsense, and should have been apparent from the start. The default state of an empty page is all ones, not zeroes, and how could a piece of software possibly erase NAND flash?. The real reason was that as CCleaner was filling the pages with zeroes the SSD controller simply unmapped the pages and showed default pages of zeroes to the host. The invalid pages were then candidates for garbage collection, which gave a much greater pool of blocks to call upon on writes, and hence a better performance. A sort of RETRIM before that was invented.


SSD Defragmentation:

One of the SSD mantras is that an SSD should never be defragged. Whilst there is little (there is a little) to be gained from rearranging clusters into adjacent pages - an SSD has no significant overhead in random reads - an SSD defrag is not entirely verboten. In fact from Windows 8 onwards the Storage Optimiser will defrag an SSD if certain conditions are met. If System Restore is enabled, the fragmentation level is above 10%, and at least one month has passed since the last defrag, Windows Storage Optimiser Scheduled Maintenance will defrag the SSD. This is what Microsoft calls a Traditional Defrag, it is not an Optimise (RETRIM). The defrag is required to reduce the extents on the volume snapshot files when system restore is enabled.

There is nothing to be afraid of in a monthly defrag. Most users won't hit the 10% fragmented criteria so a simple RETRIM will be run, and Windows 10 users won't get defragged anyway (System Restore is disabled in Widows 10 by default). The reduction in life of an SSD will not be noticed. Furthermore, although SSDs are not fazed by random reads, files do get fragmented and that means a significant increase in I/Os. An occasional clearup is a boon.


SSD Lifetime: There are many users worried about the life expectancy of their SSDs. Yes, continuous write/erase cycles, and the added and unseen write amplification, do take a toll on the life of NAND flash. Using an SSD does wear it out. My WD Green 120 gb SSD, a TLC SSD from a reputable manufacturer but at the very lowest cost, has an estimated life of 1 million+ hours and a write limit if 40 terabytes. One million hours is 114 years, so we can forget that. As for writes, at 1 gb a day - far more than my current rate of data use - it would take the same 114 years to reach 40 tb. Even with massive write overhead this SSD is not going to wear out in the forseeable future. If all 128 gib of available flash is used equally, the 40 tb equates to 312 writes per cell, a very conservative number.


The End:

The only thing to add is that NAND flash, SSDs, and especially SSD controllers, are far more sophisticated, complex and incomprehensible than what has been written here, what I know, what I could possibly comprehend, and what I could possibly explain. I should also add secret, as their software is proprietary. Whilst an HDD is a marvel of complex electro-mechanical engineering at a ridiculously low cost, the SSD is an equally marvellous and complex piece of electronics and software at a minimally higher cost. We should be thankful for both.



You can return to my home page here

If you have any questions, comments or criticisms at all then I'd be pleased to hear them: please email me at kes at kcall dot co dot uk.

© Webmaster. All rights reserved. Last modified on Friday January 24th, 2020.