30 авг. 2009 г.

SSD vs. Linux

Интересная цитата на тему рискованности использования SSD-дисков (особенно под Линухом):
<...>The bug is an assumption in most standard Linux filesystems (including ext2 and ext3). They all assume that the update granularity of the underlying block device they're writing to is greater than or equal to the filesystem's granularity. (I.E. that the smallest write you can do to the block device is the same size or smaller than a filesystem block.) Oh, and also that the alignment works out, so that data you're not writing doesn't get destroyed as "collateral damage" by a failed write operation.

This isn't true for flash, which has "erase blocks" up to a couple megabytes each. As with burnable cd-roms, you have to do an "erase" pass on an area of flash memory before it can have new data written to it. You can't erase just part of an erase block, it's all or nothing.

Actual Linux flash filesystems (like jffs2) are aware of this. That's why you can't mount them on an arbitrary block device, you have to feed them actual raw flash memory hardware (or something that emulates it) so they can query extra information (erase block boundaries). They cycle through the erase blocks and keep one blank at all times, copying the old data to the blank one before making the old one the new blank one. (That's why mounting them is so slow, the filesystem has to read all the flash erase blocks to figure out which one's newest, and which order to put them in.)

But flash USB keys pretend to be normal USB hard drives, which you format FAT, or ext3 or some such. And when you write to them, what they're actually doing internally is reading a megabyte or so of data into internal memory, erasing that block of data, and then re-writing it with your modifications. Generally they'll cache some writes first so they're not rewriting a whole megabyte of flash every time you send another 512 byte sector (which would not only be insanely slow but would run through the limited number of flash write cycles pretty quickly).

This sounds fine... until you unplug one during a write and the device loses power between the internal "erase" and "rewrite". Suddenly, the sector you were writing to might be ground zero in the middle of a megabyte of zeroes. You can lose data before, after, or on both sides of the sector it was updating.

Journaling doesn't pretect you from this, because it was built on the assumption that data you weren't writing to didn't change. The "collateral damage" effect of flash erase blocks undermines what journaling is trying to do, by violating the design assumptions of conventional Linux filesystems. In fact if the journal isn't what got zeroed, the journal replay may complete happily and mount the filesystem and hide the fact anything went wrong... until you try to use the missing data, anyway. (Not that using a non-journaled filesystem would actually improve matters, but it's less likely to give you a false sense of security.)

For further reading, see Linux filesystems expert Valerie Aurora's excellent post on why she personally doesn't want a flash disk in her laptop.

Rob Landley.

Комментариев нет: