A model of file update

I wanted to document (part of) the model of file behaviour that I use. This model is important to understand for people writing reliable crash-safe systems.

Simple scenario, no file length update

We consider the case of a single file, whose existence at a given path has been made persistent (i.e., after a crash, the file will still be found at that path).

We restrict to the case where we write over existing data, without extending the file length (again, the file length is assumed to already be persistent).

The model of the file data is as follows: There is an array of bytes which represents the data which is definitely on disk. There is also a "partial" (some entries may be missing) array of bytes which have been written by the application (eg via a call to pwrite), but which have not yet made it to disk, and so are "pending".

An example: the file contents are initially: abcdefghij (10 bytes), and we use pwrite to write hello (5 bytes) at offset 2. Then the pending array is: ..hello... (where I have used a . character to indicate that the byte at that position is missing). Using a fixed-width font:

..hello... <- pending array

abcdefghij <- on disk array

After the call to pwrite, the bytes in the pending array can move to persistent storage on disk, or they may not. For example, it may be that the h and the first l move to persistent storage:

...e.lo... <- pending array

abhdlfghij <- on disk array

The order in which the bytes flush is arbitrary in the model. In real-world filesystem implementations, the file data is usually handled in blocks of 4096 bytes, and individual blocks (rather than individual bytes) flush to disk in arbitrary order. Our model is slightly weaker than this, but if a program is correct wrt. our model, then it is also correct with the stronger guarantees of block-based flushes. Very occasionally a program will depend on the property "block-aligned block-sized writes flush to disk atomically". In this case, we would need to strengthen our model (e.g. use an array of blocks, rather than an array of bytes).

The behaviour of an fsync in our model is simple: After an fsync, all bytes in the pending array will be stored on disk. For example, after an fsync we have:

.......... <- pending array

abhellohij <- on disk array

One thing to note about this model is that further writes by the application go to the pending array first, and in general we do not know if any bytes from a previous write have actually made it to disk. For example, suppose I write "01" to the beginning of the file, and then I write "89" to the end. The situation might be:

01......89 <- pending array

abhellohij <- on disk array

And it is perfectly possible for the "89" bytes to flush to disk before the "01". In other words: writes do not necessarily commit to disk in program order.

Of course, later writes will overwrite earlier writes in the pending array at the same positions (but we typically don't know if an earlier write still remains in the pending array, or whether it was flushed to disk).

More advanced scenario: file length update

Typically we might write to a file and cause the length to be updated. The important thing to realise is that the update to the file's length is also "pending" and not directly linked (in the model) to the size of the arrays. Suppose we start with contents "a...e" (i.e., the file is 5 bytes long) and write another 5 bytes "hello" to the end of the file. Then we have this situation:

.....hello <- pending array (pending length update: 10)

abcde <- on disk array

The question then is: what happens next? Recall that the "on-disk array" models the data that will be recovered after a crash. One possibility is that the length update flushes first, and the file is padded with zeros (it should be the case on all filesystems that new data is initialized to 0).

.....hello <- pending array

abcde00000 <- on disk array

Here, in abuse of notation, we have used the 0 character to indicate that the relevant byte has a value of 0.

From this point on, bytes flush from the pending array as before.

---

When an update to length is pending, other behaviours are also possible. For example, perhaps the length is first updated to 8 on disk, then the 3 bytes "hel" flush to disk, then finally the length is updated to 10 and the last 2 bytes flush. Although I have not seen this behaviour in the wild, it would not surprise me at all to be told it was observable with some filesystems. Of course, if there were two individual writes that caused "hello" to be pending, this behaviour is probably not surprising at all. Filesystems that "journal metadata" might be better in this regard (the file length is considered metadata): a single write of "hello" to the end of a file might result in either the file length being 10 after a crash, or it being 5, with no other possibilities observable.

---

The role of fsync is as before: after fsync completes, all pending bytes (and any pending length update!) are stored on-disk and will survive a crash.

The above description gives a flavour of what the formal model would be. There are still some unanswered questions, but the above is probably sufficient for the majority of application writers who want to know what happens when writing to a file and a crash occurs.

There are further complexities which we have not touched on: When creating a file, for example, one typically has to call fsync on the containing directory in order for the new entry to reliably survive a crash.

More interesting are scenarios such as "cross-directory rename" where a file is renamed from one directory to another. In this case, application programmers would like to assume that after a crash, the file is available under the new path or the old path. However, this is likely not guaranteed by some filesystems (perhaps there are two links to the file, or perhaps neither the old nor the new path exists and the file is lost, or perhaps the file is preserved in the special directory "lost+found", or something else).

Within a single directory, renaming one file over another typically IS atomic wrt. crash recovery. The reason is that application writers expect this to work, and will complain bitterly if a filesystem doesn't support it. However, it is certainly debatable whether this is required by POSIX (the standard limits itself to what happens in normal operation, without a host crash; in this case, applications see a rename as atomic... but this may be unrelated to what actually happens on disk and what the resulting recovery state might be after a crash). And of course, renaming a file over another in the same directory doesn't guarantee that the result is recorded on disk - for that you need (again) an fsync on the containing directory.

Further reading: Of course, the number of web pages that discuss these sorts of issues is huge. A good place to start is ext4 developer T'So's blog article "Don't fear the fsync" https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/

Search This Blog

Tom Ridge's blog