In October, I’ll be in New York on O’Reilly Velocity Conference, giving a . I’ve decided to release some of my preparation notes as a series of blog posts. “What We Talk About When We Talk About On Disk IO” talk Knowing how the IO works, which algorithms are used and under which circumstances can make lives of developers and operators much better: they will be able to make better choices upfront (based on what is in use by the database they’re evaluating), troubleshoot the performance issues when the database misbehaves (by comparing their workloads to the ones the database stack is intended to be used against) and tune their stack (by spreading the load, switching to a different disk type, file or operating system, or simply picking a different index type). While the Network IO is frequently discussed and talked about, Filesystem IO gets much less attention. Of course, in the modern systems people mostly use databases as storage means, so applications communicate with them through the drivers over the network. I believe it is still important to understand how the data is written onto the disk and read back from it. Moreover, Network IO has many more things to discuss and ways to implement different things, very different from one operating system to another, while Filesystem IO has a much smaller set of tools. There are several “flavours” of IO (some functions omitted for brevity): Syscalls: , , , , , open write read fsync sync close Standard IO: , , , , fopen fwrite fread fflush fclose Vectored IO: , writev readv Memory mapped IO: , , , open mmap msync munmap Today, we’ll discuss, the Standard IO combined with series of “userland” optimisations. Most of the time the application developers are using it plus a couple of different flags on top of this. Let’s start with that. Buffered IO There’s a bit of confusion in terms of “buffering” when talking about functions, since they do some buffering themselves. When using the Standard IO, it is possible to choose between . This has nothing to do with the buffering that will be done by the Kernel further down the line. You can also think about as a distinction between “buffering” and “caching” which should be make the concepts different and intuitive. stdio.h full and line buffering or opt out from any buffering whatsoever “user space” buffering Block Disks (HDDs, SSDs) are called and the smallest addressable unit on them is called sector: it is not possible to transfer an amount of data that is smaller than the sector size. Similarly, the smallest addressable unit of the file system is a (which is generally larger than a sector). Block size is usually smaller than (or same as) the (a concept coming from the Virtual Memory). Block Devices block page size Everything that we’re trying to address on disk, ends up being loaded in RAM and most likely cached by the Operating System for us in-between. Page Cache (previously entirely separate, Buffer Cache and Page cache ) helps to keep cache the buffers that are more likely to be accessed in the nearest time. principle implies that the read pages will accessed multiple times within a small period in time, and spatial locality implies that the related elements have a good chance of being located close to each other, so it makes sense to save the data to amortise some of the IO costs. In order to improve the IO performance, the Kernel buffers data internally by delaying writes and coalescing adjacent reads. Page Cache got unified in 2.4 Linux kernel Temporal locality Page Cache does not necessarily hold the whole files (although that certainly can happen). Depending on the file size and the access pattern, only the chunks that were accessed recently. Since all the IO operations are happening through the Cache, sequences of operations such as can be served entirely from memory, without accessing the (meanwhile outdated) data on disk. read-write-read When the operation is performed, the Page Cache is consulted first. If the data can already be located in the Page Cache, it is copied out for the user. Otherwise, it is loaded from the disk and stored in the Page Cache for the further accesses. When the operation is performed, the page gets written to cache first and gets marked as dirty in the Cache. read write Pages that were marked , since their cached representation is now different from the persisted one, will be to disk. This process is called . Of course, the writeback has it’s own potential drawbacks, such as queuing up too many IO requests, so it’s worth understanding thresholds and ratios that are used for writeback when it’s in use and check queue depths to make sure you can avoid throttling and high latencies. dirty flushed writeback Delaying Errors When performing a write that’s backed by the kernel and/or a library buffer, it is important to make sure that the data actually reaches the disk, since it might be buffered or cached somewhere. The errors will appear when the data is flushes to disk, which can be while syncing or closing the file. Direct IO is a flag that can be passed when opening a file. It instructs the Operating Systems to bypass the Page Cache. This all means that for a “traditional” application using the Direct IO will most likely cause a performance degradation rather than the speedup. O_DIRECT Using Direct IO is often frowned upon by the Kernel developers, and it goes so far that the Linux man page quotes Linus Torwalds: “ ”. The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid However, the databases such as and use Direct IO for a reason. Developers can ensure a more fine-grained control over the data access patterns, possibly using a custom IO Scheduler and an application-specific Buffer Cache. For example, PostgreSQL uses Direct IO for (write-ahead-log), since they have to perform a write as fast as possible while insuring it’s durability and can use this optimisation since they know that the data won’t be immediately reused so writing it bypassing the Kernel page cache won’t result into performance degradation. PostgreSQL MySQL WAL Direct reads will make a read directly from the disk, even if the data was recently accessed and might be sitting in the cache. This helps to avoid creating an extra copy of the data. The same is true for the write: when the write operation is performed, the write is done directly from the user space buffers. Block Alignment Because DMA (direct memory access) makes requests straight to the backing store, bypassing the intermediate Kernel Buffers, it is required that all the operations are sector-aligned (aligned to the 512B boundary). In other words, every operation has to have a starting offset of a multiple of 512 and a buffer size has to be a multiple of 512 as well. For example, RocksDB is making sure that the operations are block-aligned (older versions were allowing unaligned access by aligning in the background). by checking it upfront Whether or not O_DIRECT flag is used, it is always a good idea to make sure your reads and writes are block aligned: making an unaligned access will cause multiple sectors to be loaded from the disk (or written back on disk). Using the block size or a value that fits neatly inside of a block guarantees block-aligned I/O requests, and prevents extraneous work inside the kernel. Nonblocking Filesystem IO I’m adding this part here since I very often hear “nonblocking” in the context of Filesystem IO. It’s quite normal, since most of the programming interface for network and Filesystem IO is the same. But it’s worth mentioning that there’s which can be understood in the same sense. no true “nonblocking” IO O_NONBLOCK is generally ignored for regular (on disk) files, because the block device operations are usually considered non-blocking (unlike sockets, for example). The Filesystem IO delays are not taken into account by the system. Possibly this decision was made because there’s a more or less hard time bound on when the data will arrive. For the same reason, something you would usually use like and do not allow monitoring and/or checking status of regular files. select epoll Closing Words It is hard to find an optimal post size given there’s so much material to cover, but it felt about right to have a clear after Standard IO before moving to mmap and vectored IO. If you find anything to add or there’s an error in my post, do not hesitate to contact me, I’ll be happy to make corresponding updates.