Posts

What is so special about cross-directory rename of a directory?

This is a short note to detail the issues that are raised by the filesystem operation of renaming a directory from its parent directory to some other directory. Background A filesystem typically allows a nested collection directories: the root directory contains subdirectories, which in turn contain subdirectories, and so on.  The on-disk state of the filesystem will typically lag the in-memory state. Indeed, more advanced filesystems may even permit the on-disk state to diverge from the in-memory state, so that the on-disk state may not represent any state that occurred in-memory. For example, we create a file "foo.txt" then a file "bar.txt". It is quite possible that a filesystem that then crashed might restart in a state where the file "bar.txt" exists, but "foo.txt" does not, even though this state was never seen in-memory. Implementation optimizations Implementations of filesystems would like to treat every file and every directory as indepe

Waypoints 2022Q3: An OCaml profiling library in one file

Image
I wrote a "one file" implementation of my favourite profiling technique, "waypoints". The idea of waypoints is that you can mark points (waypoints!) in your code, and the profiling library will record, for each pair of waypoints, the count (how many times each pair of waypoints was traversed), and the total time (from which it can compute the average time). Here is some example code: Here there are 5 waypoints, w1... The instrumented code consists of a for-loop, wrapping some trivial branching code, with random sleeps at various points. By default, when the program terminates the profiling data is printed to stdout: This tells you (for example) that the w2 branch was taken 7 times, and the w3 branch was taken 3 times. The average time for the w2-w2' section was 3.7 * 10^9, compared with 117 * 10^6 for w3-w3'. Clearly it is worth trying to optimise the w2-w2' section first... The "time" here is measured using the time stamp counter , i.e., measu

A model of file update

I wanted to document (part of) the model of file behaviour that I use. This model is important to understand for people writing reliable crash-safe systems. Simple scenario, no file length update We consider the case of a single file, whose existence at a given path has been made persistent (i.e., after a crash, the file will still be found at that path).  We restrict to the case where we write over existing data, without extending the file length (again, the file length is assumed to already be persistent).  The model of the file data is as follows: There is an array of bytes which represents the data which is definitely on disk. There is also a "partial" (some entries may be missing) array of bytes which have been written by the application (eg via a call to pwrite), but which have not yet made it to disk, and so are "pending". An example: the file contents are initially: abcdefghij (10 bytes), and we use pwrite to write hello (5 bytes) at offset 2. Then the pendi

Separation of concerns: Buffered file IO with OCaml

This piece of code  adds write buffering for append operations on a file, and read buffering to attempt to make repeated preads quicker. Ultimately the implementation supports (buffered) pread, pwrite and (buffered) append. I'm posting it here because I thought it nicely demonstrated separation of concerns: first we build a file which maintains its own length in memory (V2); then we add the write buffer (V3); then finally we add the read buffer (V4). Each implementation builds on the previous. But each level is reasonably self-contained and addresses the single feature at that level. Previously I tried to do this by implementing everything in a single level, but the code was considerably more complex and harder to follow. 

The Economics of Writing (Good) Code Documentation

TL;DR: Writing documentation can be beneficial even for a lone programmer writing code that will never be read by anyone else; however, the benefits, when other programmers are involved, are cumulative and potentially huge in terms of time saved (of those programmers) when trying to grok the existing codebase.  It is well known that programmers in general hate to write documentation. There are lots of reasons for this. For example, in the heat of writing code, the programmer simply cannot envisage what it would be like for someone else to come along in a couple of years and try to understand the code from scratch... And anyway, what does the programmer owe to that future person? (The programmer will likely have moved on by that point...) etc. etc. So, poor documentation (or lack of documentation altogether) is a problem that stems from the programmer, and probably left unfixed by the managers of the programmer, right up the chain of responsibility. This post isn't an attempt to add

How to safely append to a file? (Can we?)

[NOTE: this post was inspired by looking at the mirage/index code and trying to understand the properties that code provides in a crash scenario] In systems with reasonable claims to robustness in the event of system crash, the issue arises that we want to write to a file in an "atomic" way. Let's simplify by considering appending data to a file. Let's also assume that the filesystem performs blk-sized blk-aligned writes atomically (but multiple such writes may be reordered). Simplify further by assuming we only want to append a single block of data to a file at a position that is a multiple of the block size. What can go wrong?  The problem is that there are many moving parts, each of which can result in incorrectness. Here are some possibilities: We append a new block at the end of the file (at a position that is a multiple of the blk size). This should be atomic. Has the FS (filesystem) freelist correctly recorded that the block is now in use? Has the FS updated th

kv-hash, a new key-value store

At OCamlLabs/Tarides, I have been working on kv-hash ( https://github.com/tomjridge/kv-hash ), a replacement for irmin/index ( https://github.com/mirage/index ) The original index code implements a key-value store. New key-values are written first to a log, and then when the log is large it is merged into the main data store, and a new log is started. The performance is good initially, but deteriorates over time (essentially because the entire store needs to be written whenever the log is merged into the store).  kv-hash is an alternative, where the performance remains constant over time, and where the merge duration in particular is fast and constant. In our tests, this avoids various issues with off-the-shelf solutions such as LMDB, RocksDB, SQLite etc. LMDB is great, providing the active set of keys fits in memory; if it doesn't, the use of an mmap means that performance degrades quite significantly RocksDB has amazing write performance, but the relatively poor read performance