Understanding durability by building our own Write-Ahead-Log
This issue was supposed to be about LSM Trees, but like most things in life - it didn't go according to plan
For this week’s issue, I initially planned to do a deep dive into Log-Structured Merge trees (LSM Trees), including a hands-on attempt to implement them. LSM Trees are fundamental to many No-SQL databases, including HBase and Cassandra, and storage engines like RocksDB.
However, while I was coding, I realized that a Write-Ahead-Log was an essential component for the durability of an LSM Tree which I couldn’t overlook. I began writing a simple WAL implementation and soon discovered that it wasn’t trivial and that it made for a fun weekend project.
This prompted a shift in focus. Instead of discussing LSM Trees - we’ll take a step back and dive a little deeper into how Write-Ahead-Logs are implemented to guarantee high write throughput, durability, data integrity, and recovery.
WAL 101
Write-ahead logs are essential for ensuring durability, especially in scenarios where a system failure could result in the loss of all in-memory state.
When a system that needs to ensure durability, be it a database or a node in a distributed system, receives a command that alters its state, it first records this command on a disk-based file.
In such systems, each node operates a single log for this purpose. Every request for a state change received by the server is sequentially recorded in this log, maintaining a total ordering guarantee. This sequential storage offers three primary advantages:
Easy Replay: Replaying log entries is straightforward. Upon restart, the system only needs to iterate through the log from beginning to end to reconstruct any lost in-memory state.
High Write-Throughput: Appending entries to the end of the log file eliminates the need for disk seeks before writing. This enables exceptionally fast write operations (even on SSDs).
Data Integrity: The append-only nature of the process significantly reduces the likelihood of corruption in previously written data during a failure. Since the file is only extended and earlier data remains untouched, the integrity of existing data is largely preserved.
Each log entry is assigned a unique identifier, commonly referred to as a Log Sequence Number (LSN). The LSN serves a critical function, particularly in systems like databases. It tracks the log entries that have been successfully persisted by the system. This tracking is essential for determining which entries need to be replayed in case of a system failure.
Consider the example of a simple Key-Value store’s put function
We see that before the system makes any change to its internal state, it first ensures that the state change is safely persisted to the Write-Ahead-Log. This ensures that in the event of a crash, it can successfully recover any entries it was processing by simply replaying the log.
More than just a log file
Often perceived as simple append-only logs, especially when part of larger systems like Databases, RAFT (consensus algorithms), or LSM trees, WALs are intricate structures essential for system reliability. Their functionality extends beyond basic logging.
Each implementation of a WAL is tailored to its specific environment, influencing overall system architecture far more than one might initially think. A WAL written for a consensus algorithm implementation (for instance the RaftLog in Ratis) might be implemented in a very different manner from a WAL written for a database such as sqlite.
Although specific implementations and the APIs they expose can vary, certain key functionalities are commonly expected in a WAL:
WriteEntry: Allows adding new entries to the log.
ReadAll/Recover: Enables the recovery of all entries from the log, or its most recent segment, ensuring that data can be restored in the event of a system failure.
Log Rotation: To enhance efficiency during startup and recovery, the log is divided into multiple files. This approach also aids in managing log size.
Integrity Checks: A vital aspect of WALs is ensuring the unaltered state of data on disk. Integrity checks confirm that the stored data has not been modified or corrupted, maintaining data reliability.
Auto-Repair/Recovery: In cases of minor corruption, typically resulting from a crash, the log is designed to autonomously repair itself, preserving the continuity and integrity of the data.
Let’s spend some time diving deeper into how a WAL usually supports these expectations.
Writes
A WAL must operate efficiently to prevent becoming a bottleneck for the main system it supports. Fast writes are crucial, and several factors must be considered to ensure this:
Serialization Performance: Data to be written to the disk must first be serialized into a byte stream. This process should be quick and efficient, producing a compact byte stream to minimize file size and write time. Many systems use Google's Protocol Buffers for this purpose, which provides rapid serialization and efficient encoding. Their strong type guarantees also simplify data reading and writing.
Write Performance: To optimize write times, the WAL should append data only at the end of the log, eliminating the need for disk seeks before writing. Sequential writes are faster not just on HDDs but on SSDs too (because of the way writes and Garbage Collection are handled in SSDs; see references).
In-Memory Buffers with Periodic Syncing: Rather than writing data directly to disk, WALs typically use in-memory buffers to store incoming writes, flushing these buffers to disk at set intervals. This flush frequency is adjustable and should be calibrated based on system needs. Shorter intervals offer better durability but slower write performance, while longer intervals enhance write speed at the cost of durability. The ideal balance depends on the specific requirements of the system.
FSync: Operating systems use their own buffers for disk file operations, which means data written to disk might initially only be stored in an OS buffer and can be lost in case of power failure. This can be problematic for WALs, which require strong write guarantees. To address this, operating systems provide an FSync API to force the synchronization of the OS buffer with the disk. However, since FSync can slow down write operations, WALs often allow configuration of this behavior to strike a balance between write performance and data safety.
Checksums: To ensure the integrity of data, each log entry should include a checksum. This checksum is vital during reads and repairs, helping to identify and eliminate corrupted entries. It adds a layer of data verification, further securing the log against errors.
Reads
It's essential to understand that a WAL is written to far more frequently than it is read from. This high write-to-read ratio underscores the WAL's primary function in recording data changes for durability. Additionally, it's important to note that many WAL implementations do not support simultaneous reading and writing. This design choice is practical because reading from the WAL typically occurs only during the startup and recovery phases when the system is not actively processing new requests.
When reading, the WAL processes entries sequentially, starting from the beginning of the file and moving towards the end. An integral part of this process is verifying the checksums for each entry to ensure data integrity. If a checksum verification fails, the WAL halts the reading process and returns an error. This indicates that the WAL requires repair before it can be successfully read again. In cases where a single WAL entry is found to be corrupted, all subsequent entries must be discarded. This is because, beyond the point of corruption, there's no longer a reliable guarantee of data integrity, and continuing to process potentially compromised entries could lead to further inconsistencies or data loss.
Log Segmentation
Since each state change is recorded in the log, it can lead to rapid growth in file size. This becomes a performance bottleneck during system startup and recovery, as larger log files take longer to process. To mitigate this, most WAL implementations employ a technique known as Log Segmentation. This involves dividing the log into multiple smaller files, or segments, each uniquely identified by a segment number. Smaller segments are quicker to scan during startup, allow for fault isolation (protecting older entries from corruption in newer segments), and facilitate faster checkpointing.
Log Rotation is the process that manages these segments. Whenever the WAL needs to write a new entry, it first checks if the current segment has sufficient capacity. If not, it creates a new segment. This decision is based on various metrics like the number of entries or the total size of the current log file.
Additionally, a WAL can be configured to limit the number of segments it maintains. Once this limit is reached, the oldest segments are removed. This strategy not only helps in managing disk space but also ensures that the WAL doesn't become overly large, maintaining efficient system performance.
Log Repair
Log corruption is an unavoidable risk in any system and can arise from various causes, including application crashes, server failures, or disk corruption. While log segmentation effectively isolates faults across different segments, it's crucial to recover as much data as possible from corrupted files.
Typically, the focus of repair efforts is on the latest log segment. Earlier segments, which are locked and unaltered once written, are generally less susceptible to corruption barring major disk failures.
Among the various strategies for data recovery, we'll explore a straightforward method similar to that used by etcd. This approach involves reading through the entries of the latest segment and verifying their integrity. The process continues until we encounter a damaged entry, an unexpected end-of-file marker, or an invalid checksum. The log file is truncated at this point, effectively removing the corrupted part. The result is a clean log file containing only valid entries.
After the repair, the log can be reopened for new write operations.
Implementing a WAL in Go
We’ll be implementing our Write-ahead log in Go. Even though its said that benchmarks are often not worth the paper they’re written on, I still ran a few simple bench tests. The WAL we will talk about can achieve writes in the range of a few hundred thousand entries/second on my SSD, where each entry is around 56 bytes in size. Again, these numbers don’t mean much, but it’s nice to see them.
I go over the implementation in the following YouTube video, and the complete code is available on my GitHub.