How to safely append to a file? (Can we?)

[NOTE: this post was inspired by looking at the mirage/index code and trying to understand the properties that code provides in a crash scenario]

In systems with reasonable claims to robustness in the event of system crash, the issue arises that we want to write to a file in an "atomic" way. Let's simplify by considering appending data to a file. Let's also assume that the filesystem performs blk-sized blk-aligned writes atomically (but multiple such writes may be reordered). Simplify further by assuming we only want to append a single block of data to a file at a position that is a multiple of the block size.

What can go wrong? 

The problem is that there are many moving parts, each of which can result in incorrectness. Here are some possibilities:

We append a new block at the end of the file (at a position that is a multiple of the blk size). This should be atomic. Has the FS (filesystem) freelist correctly recorded that the block is now in use? Has the FS updated the inode to record the new length of the file? Has the new block been linked into the file blocklist in the correct way? Has the new block been filled with the correct data? Has all of this happened "atomically" wrt the underlying block device, taking into account the caches at each level?

Let's assume that the freelist functions correctly, that is, a block allocated to a file is definitely recorded as allocated in the freelist even through a crash. Then the block cannot be later allocated to another file and somehow receive new data etc. Let's also assume that the filesystem+block device are resistant to eg data corruption issues.

We are left with issues relating to the file object itself. 

Let's assume that the block is zero-ed before being allocated to the file. If the operations are non-atomic, we may find post-crash that we get an appended block with zeros in it. This seems like an error that might well occur. Let's call it E_GARBAGE_APPEND.

If the block is not correctly allocated to the file data list, then we may get a dangling block pointer post-crash. Since this would likely be considered an FS error, let's assume this doesn't happen, i.e., that the allocation of the block from the freelist, and the linking of that block into the file's data, happens atomically wrt crashes, and precedes the update to the file size.

Recap: we perform the append of a block at a position that is a multiple of the block size. The block may contain garbage or not. The file size may be increased or not. Apart from that, there is not much else that can go wrong.

How can we guard against the E_GARBAGE_APPEND issue? The problem is really that the FS did not commit the file size change atomically with the block write (and with the filling of the block with the correct data). One solution is for the application to record the file size separately (say, as a header in the first bytes of the file itself). Then we can ignore the FS size, and use the header instead as the definitive source of information about the file size. Then to append a block, we perform the append as before, issue a write barrier, and then update the size in the header. This ensures that, for the size to update on disk, the correct contents of the buffer must have been written to disk previously. 

This trick also works to allow non-block-sized appends at positions that are not block-size-multiples.

One downside of this approach is that the barrier operations are most likely implemented by fsync on the underlying file. fsync is expensive. This approach requires 1 fsync for each append (2 if we want to ensure the updated size in the header actually makes it to disk). Applications probably find this too expensive. Thus, it is likely that appends will be applied without updating the header size. At some point (in mirage/index this is determined by whether the amount of data that has not been flushed is larger than some set amount), we can issue an fsync and update the size (and perhaps issue another fsync). Thus, the fsync cost is amortized over multiple appends. The trade-off is that these appends are not made persistent-on-disk until the size update is made persistent on disk.

This approach is quite "pessimistic" in that it takes steps to prevent the bad behaviours happening. An alternative is to be "optimistic" and try to detect and recover from bad behaviours.

In this approach, appends to a file would be followed by a checksum (cryptographic hash) of the appended data and the previous checksum (with some nonce checksum to start off). After a crash, we could read correct entries up to the first invalid checksum. A problem with this approach is that, since filesystems can reorder writes fairly aggressively, it may be that the first invalid checksum appears very early in the file. To combat this, we can issue fsyncs relatively regularly (every 10 seconds say). This would ensure that we lose at most 10 seconds worth of appends. Clearly the guarantees here are more subtle (because they involve "time"). An alternative would be to issue an fsync every 10 appends (say). This is a relatively strong guarantee (we can read all but possibly the last 10 appends), which avoids the need to flush on every append.

Finally, we might decide this is all too much, and just use a database such as LMDB or SQLite. However, these likely require an fsync for each transaction, so that if we issue an append as a single transaction, performance will be similar to the fsync-every-append scenario. So even here it looks like some form of batching is necessary. An advantage is that we don't have to deal with the "size in header" complications from the pessimistic approach, or the "hashing" required for the optimistic approach.


Addendum 1: Reading about LMDB, fsync vs fdatasync, here: https://www.openldap.org/lists/openldap-devel/201410/msg00008.html the author describes that a possible fix to delayed metadata update is for LMDB itself to track the allocated length of the file internally (as we suggest above). The issue is that fdatasync does not atomically increase the file size if the sync causes the file to grow.


Addendum 2: An implementation of these ideas in OCaml can be found here: https://github.com/tomjridge/kv-hash/blob/master/src/log_file.ml


Comments

Popular posts from this blog

What is so special about cross-directory rename of a directory?

The Economics of Writing (Good) Code Documentation

A model of file update