What is so special about cross-directory rename of a directory?
This is a short note to detail the issues that are raised by the filesystem operation of renaming a directory from its parent directory to some other directory.
Background
A filesystem typically allows a nested collection directories: the root directory contains subdirectories, which in turn contain subdirectories, and so on.
The on-disk state of the filesystem will typically lag the in-memory state. Indeed, more advanced filesystems may even permit the on-disk state to diverge from the in-memory state, so that the on-disk state may not represent any state that occurred in-memory.
For example, we create a file "foo.txt" then a file "bar.txt". It is quite possible that a filesystem that then crashed might restart in a state where the file "bar.txt" exists, but "foo.txt" does not, even though this state was never seen in-memory.
Implementations of filesystems would like to treat every file and every directory as independent objects. Then operations on each object can be reordered before being made persistent. If we want maximum performance, we allow arbitrary reorderings of operations between independent objects. Even for a particular object, we might allow significant reordering of operations: for example, the two operations of adding an entry "foo.txt" to a directory, then adding "bar.txt", might be reordered. The more reordering we allow, the looser the specification, and the faster the possible implementations.
Cross-directory rename of a directory introduces performance problems
This all seems fine. Except that the operation of renaming a directory "d" from a source directory src to a different destination directory dst introduces some problems.
In order to perform this operation, the filesystem has to check that the directory d is not being renamed to a subdirectory of itself. This is a global check, which involves many filesystem objects.
It is not sufficient to perform this check on the in-memory state of the filesystem: we may find that the rename is valid, and perform the rename, but then crash. Because the on-disk state does not match the in-memory state, it might be the case that in the on-disk state the rename is invalid, and after restart we find that the directory has been renamed to a subdirectory of itself and the basic filesystem wellformedness property (that the directory hierarchy forms a tree) is broken.
For this reason, a filesystem must treat this operation carefully.
Filesystems using a log of operations
Almost all (all?) filesystems which are crash-safe use a log of operations so that, indeed, the on-disk state lags the in-memory state. Then it is safe to perform the rename check using the in-memory state, since we know that operations will not be reordered before they reach disk, so that the scenario outlined above cannot occur.
The problem with using a log in this way is that it significantly reduces performance. For example, suppose we have two processes, one modifying files under "/dir1/" and the other under "/dir2/". If the first process issues a sync, then it will block until its last operation (and all previous operations) has completed. Since no reordering of operations is allowed, this sync will also flush all operations that have been made by the second process at this point. So the behaviour of one process affects the performance of another, which is not ideal. The standard example of this is process A creating hundreds of files as part of the process of downloading a large media file (say), while another process B (a text editor say) makes a very small change to a very small text file and tries to sync the result to disk. In most filesystems this small change to a text file by A will also require the sync of hundreds of files made by B.
Summary
Filesystems which use a log to ensure crash consistency are typically giving up opportunities for optimizations (because the "specification", that the on-disk state lag the in-memory state, is very strong). Moreover, because of this strong specification, the behaviour of one process is likely to impact on other processes using the filesystem, so that processes are not effectively separated from each other (in terms of performance).
Comments
Post a Comment