A central tenet of databases is that any committed data survives a crash or a failure. Write-ahead logging is a fundamental primitive that ensures all changes to data are first written safely to stable storage before being applied. Coupling that with some careful use of sequence numbers and we can guarantee that changes made to a database can survive system crashes.

Motivation

Let’s start with a simple transaction T1 that reads object A, and updates the value for A with a write. To simplify matters, A is stored on disk as a single page as in Figure 1.

Figure 1. A simple transaction

Databases execute transactions in main-memory, which is facilitated by the buffer pool. The buffer pool holds an in-memory representation of the state of the database and periodically writes that state to permanent storage on disk. When the simple transaction starts, the value for A is read from disk and stored in the buffer pool to take part in the transaction, as in Figure 2.

Figure 2. Reading A from disk to the buffer pool

As the transaction proceeds, it writes a new value for A. This value is immediately updated in the buffer pool.

Figure 3. Writing A to the buffer pool

Lastly, the transaction commits.

Figure 4. Committing the transaction

At this point in time, the view of the database in-memory is that object A has value 2, and the view of the database on disk is that object A has value 1. To guarantee the atomicity and durability of transactions, the database needs to determine what happens if the system fails in a way that causes the loss of main memory. The database must ensure that any updates to transactions that had committed prior to the crash are reflected in the database and that any updates from transactions that were in progress or aborted at the time of the crash are removed from the database.

Buffer Management

Both file systems and databases use a buffer residing in main memory to cache frequently or recently used data. When the buffer fills up, or an event triggers a write to disk, the data in the buffer is flushed to non-volatile storage permanently and the space in the buffer can be freed for use for new data.

Buffers provide excellent performance advantages when accessing data, but they introduce reliability concerns when true crash recovery is required, particularly when you require ACID guarantees that database management systems (DBMS) provide. Any data in volatile storage, such as main memory, will be lost in the case of a process crash. This means that any data that is kept in a buffer without being written (or flushed) to disk is lost in the case of a crash. Non-volatile data, such as a magnetic disk or solid-state drive, store data durably and can survive process crashes.

The buffer manager divides changes to data into a unit of storage called a “page”. During normal operation, the pages in the buffer pool representing changed data become “dirty”. The buffer manager writes dirty pages from the buffer pool to stable storage periodically. When and how the buffer manager writes dirty pages in volatile memory to non-volatile disk dictates both the performance characteristics of the database and the complexity of any algorithm that recovers data in the face of a crash.

Figure 5. Database buffer management

Stealing and Forcing

When using a buffer manager, the DBMS needs to uphold certain guarantees in the face of process crashes. First, it must ensure that any changes made a transaction are durable once the DBMS has told a user application that the transaction is committed. We cannot lose commits. Second, the DBMS must ensure that no partial changes are written to durable non-volatile storage if a transaction is aborted. Providing these two guarantees is complicated by the fact that data in the buffer may include transactions that must support typical ACID guarantees through system failures.

A key decision for the database recovery system to make is what to do about transactions that have been processed in main memory, but that have not been written to disk. Policies around handling this scenario are divided into two categories called “STEAL” and “FORCE”.

A STEAL policy states whether or not the database allows an uncommitted transaction to overwrite the most recent committed value of object in non-volatile storage. If you are using a STEAL policy, the buffer manager is allowed to overwrite data in memory with uncommitted data to increase performance. If you are using a NO-STEAL policy, this action is not allowed. Another way to think about this is, if the buffer pool is running out of space, is a transaction allowed to steal a slot in the buffer pool from another transaction that has not committed yet?

A FORCE policy requires that all updates made by a transaction are reflected on non-volatile storage before the transaction is allowed to be committed. A FORCE policy requires all changes to be written to disk before announcing a transaction is complete, whereas a NO-FORCE policy does not require this. Another way to think about FORCE is, are we allowed to acknowledge to the application that a transaction has completed before it is written to disk?

To make the ideas of STEAL and FORCE more concrete, let’s consider two transactions like those in Figure 6 below. Two transactions T1 and T2 are scheduled for execution and their execution paths are interleaved so that T2 begins after T1 and commits before T1. T1 reads and writes to object A, and T2 reads and writes to object B. Both A and B happen to be stored on the same page on disk.

Figure 6. Two interleaved transactions

Transaction T1 begins and the page containing object A is read from disk into the buffer pool, as in Figure 7.

Figure 7. Read A into the buffer pool

Transaction T1 continues, updating the value for A in the buffer pool (but not yet on disk).

Figure 8. Write A into the buffer pool

At this point, the transaction scheduler switches context and works on T2. The READ(B) statement executes and since the data for B is already in the buffer pool nothing needs to happen. Execution continues on to WRITE(B=7). The change is made to the data in the buffer pool as in Figure 9.

Figure 9. Write B into the buffer pool

The next step in transaction T2 is to commit the data. At this point in time, the database implementer has two decisions they need to make. First, do we return to the calling application that transaction T2 is complete and has committed? Or should we force a write of the dirty pages changed by T2 to disk before returning to the calling application? In other words, do we force the buffer pool to write out all of its changes to disk before advertising a transaction as committed? Second, note that transaction T1 has made changes to the buffer pool, even though data is not committed. If we write the changes made to the buffer pool out to disk, the changed value of A will be written out, even though the transaction has not yet committed. Should we allow transaction T1 to steal data by writing out uncommitted transactions?

Figure 10. Commit T2

A NO-STEAL and FORCE policy is the simplest set of policies to implement. A NO-STEAL policy does not allow overwriting data with something that has not been committed to disk. This provides atomicity without the need for UNDO logic when recovering because changes that were not part of a transaction committed to disk cannot be included in any other transaction. A FORCE policy makes sure that every update is on disk before a transaction commits. This provides durability without the need to handle complicated REDO logic when recovering from a crash because all committed data has been safely stored on non-volatile storage.

Though NO-STEAL/FORCE is the simplest policy to use when implementing a recovery system, it leads to the worst case performance scenario because every change to the database must be fully written to disk before completing a transaction. For this reason, commercial databases choose a STEAL/NO-FORCE policy to increase performance, even though it complicates the logic for recovering from failure. That choice dictates how we use logging to recover from failures, which is covered in the next section.

Write Ahead Logging

Most crash recovery systems are built around a STEAL/NO-FORCE approach, accepting the risks of writing possibly uncommitted data to memory to gain the performance improvements of not having to force all commits to disk. The STEAL policy imposes the need to UNDO transactions and the NO-FORCE policy imposes the need to REDO transactions. Databases rely on a log that stores transaction data and system state with enough information to make it possible to undo or redo transactions to ensure atomicity and durability of the database.

A database log is a file on disk that stores a sequential list of operations on the database. Each entry in the log is called a log record. During normal database operation, one or more log entries are written to the log file for each update to the database performed by a transaction. Each entry has a unique sequence number called the Log Sequence Number, or LSN that uniquely identifies each record in the log. A second type of log record is a checkpoint. Periodically, these records are written to describe the overall state of the database at a certain point in time and may contain information about the contents of the buffer pool, active transactions, or other details depending on the implementation of the recovery system. The log file contains enough information that we are able to undo the effects of any incomplete or aborted transactions, and that we are able to redo any effects of committed transactions that haven’t been flushed to disk.

Log records are only valuable if they can ensure recovery from failure. Write Ahead Logging (WAL) is a protocol stating that the database must write to disk the log file records corresponding to database changes before those changes to the database can be written to the main database files. The WAL protocol ensures that:

  1. All log records updating a page are written to non-volatile storage before the page itself is over-written in non-volatile storage.
  2. A transaction is not considered committed until all of tis log records have been written to non-volatile storage.

WAL ensures that all log records for an updated page are written to non-volatile storage before the page itself is allowed to be over-written in non-volatile storage. This ensures that UNDO information required by a STEAL policy will be present in the log in the event of a crash. It also ensures that a transaction is not considered committed until all of its log records (including its commit record) have been written to non-volatile storage. This ensures that REDO information required by NO-FORCE policy will be present in the log.

To highlight the performance advantages achieved with write-ahead logging, imagine that a transaction makes an update 1000 objects in the database and that each of those objects reside on different pages on disk. Without write ahead logging, the DBMS would need to write 1000 pages to disk to successfully complete the transaction. With WAL and a STEAL/NO-FORCE policy, we can update the data in-memory and write a single log record to disk that includes all information necessary to REDO the 1000 object update. Writing the remaining 1000 pages to disk can be done asynchronously without impacting running user-facing transactions.

The WAL Protocol

The actual WAL protocol needs to mark when a transaction begins using a <BEGIN> record a transaction starting point. When a transaction finishes, we write a <COMMIT> record to the log. Each additional entry in the log contains information about the change to a single object. This includes identifiers for the transaction being performed, the objects being modified, the before value of the object (to facilitate UNDO) and the after value of the object (to facilitate REDO). The database must write all log records to disk before it returns an acknowledgement to the calling application.

Figure 11 shows the starting state of a database using WAL. Here, we have an additional in-memory buffer dedicated to write-ahead logging called the WAL buffer. Changes destined for the log are buffered here in-memory before being written to non-volatile storage. For the sake of simplicity, the buffer pool has already been populated with the same page of data needed by the transaction.

Figure 11. A transaction using WAL

When the transaction begins, we start by marking the start of the transaction in the WAL buffer.

Figure 12. Marking the start of a transaction

When it is time to update the data for the transaction, we first write the log entry to the WAL buffer with a sequence number, followed by the transaction responsible for the change, the object being changed, and the before and after value of the object. Once this change has been written to the WAL buffer, we make the change to the same object in the buffer pool.

Figure 13. Writing data using WAL

When the transaction reaches the COMMIT statement, we again write an entry in the WAL buffer. This time, because we have reached a commit, the entire buffer is copied into the log on disk. Once the log record has been safely stored on disk, the transaction can be flagged as “committed” and execution is free to return back to the application — we have enough information in the log to restore the state of the database even if we lose data in the buffer pool.

The write to the log on disk is a sequential write that appends to the end of the log. Sequential append-only I/O is significantly faster than the random access I/O that required to write the contents of the buffer pool out to disk. The use of a log here increases the performance of the database by condensing the transaction into a single sequential I/O operation before signalling to the application that the transaction has committed.

At a later point in time, the database is free to flush the data in the buffer pool to disk.

Figure 14. Flushing the buffer pool to disk

There is enough information in the log to recreate the state of the database if the buffer pool is killed before it is flushed to disk. The next section covers exactly to recover data after a crash using the log.

Maintaining Log Records

In a system using WAL, we write all modifications to the database to a stable log before applying them to the database itself. This simple act facilitates crash-recovery because we have a record of what changes were made before they are actually applied. Imagine a process that is in the middle of performing some operation when the machine it’s running on loses power. Upon restart, that process might need to know whether the operation it performed succeeded, partially succeeded, or failed. If the process uses a write-ahead log, it can replay the changes in the log to recover state, and compare that state to the current state of the process. On the basis of this comparison, the process could decide to undo what it had started, complete what it had started, or keep things as they are.

Conceptually the log is a single ever-growing sequential file that is append-only. Every log record has a unique log sequence number (LSN), and LSNs are assigned in ascending order. Log records may contain redo and undo information. A log record containing both is called an undo-redo log record, there may also be undo-only log records and redo-only log records. A redo record contains the information to redo a change made by a transaction (if they have been lost). An undo record contains the information needed to reverse a change made by a transaction (in the event of rollback). This information can simply be a copy of the before/after image of the data object, or it can be a description of the operation that needs to be perform to undo/redo the change.

The database uses the log sequence numbers to facilitate recovery by tracking the state of database pages, transactions, and flushed data during normal database operation. Each component of the system required for crash recovery tracks the sequence number of the log records that have modified data they require for recovery:

  • flushedLSN:
    • The largest sequence number flushed to disk so far.
    • This number tracks whether a log record is durably persisted.
  • pageLSN:
    • The last log record that modified a page
  • recLSN:
    • The oldest update to the page since last flushed
  • lastLSN:
    • The last record of for each transaction

Each page of data contains the sequence number of the most recent update to the page (the pageLSN). In addition, the database system tracks the maximum sequence number that has been flushed so far (the flushedLSN). During operation, we write a page x to disk only after all pages up to x have been flushed to disk. That is, we can write a page x if pageLSN(x) <= flushedLSN. This means that before we flush a dirty page to disk, we first have to flush the log record that made the page dirty to disk.

Figure 15 shows a conceptual idea of how the different sequence numbers are maintained. For each page we maintain a pageLSN marks the most recent log sequence number that modified this page along with a recLSN that records the first log sequence number that modified this page to make it dirty. Together, recLSN and pageLSN provide a boundary recording the before and after state of the page and the log records that modified data at those points in time. Lastly, we flushedLSN records the last log sequence number that was written to disk. The pageLSN and recLSN are committed to disk in the database alongside each page when that page is written to disk.

Figure 15. Maintaining WAL Records

To following two diagrams show the case where pageLSN is less than flushedLSN, or greater than flushedLSN. In the first case, it is safe to write modified page to the master database because the pageLSN is less than flushedLSN and we know that the log record modifying the page has been written to disk. In the second case, the log record modifying the page has not been written to disk and we cannot write the modified page data to disk yet.

Figure 16. Safe to flush data to disk
Figure 17. Not safe to flush data to disk

Maintaining Abort Records

As an optimization for handling aborts, the database also maintains a prevLSN that records the previous log sequence number for this transaction and a “transaction end” record signalling that a transaction has completed and there will be no more log records for this transaction. These two additions allow the recovery system to easily walk through any log records that are part of the same transaction to make recovery more performant. We also add compensation log records (CLR). The CLR describes the changes made to undo any actions of a previous update. It has all the fields of a regular log record and it gets added to the write ahead log like any other record. For each undone update, the database creates a CLR entry in the log and restores the old value to the database. After the databases reverses the aborted transaction’s updates it then writes a transaction end log record to signal that we will not see any additional changes for this transaction.

As an example, we can look at a section of the log record as it is added to over time. In the table below, we have a transaction T1 that updates the value for A. At some later point in time, the transaction is aborted. To facilitate the abort of the transaction, we need to undo the change that was made to A. To do this, we create a compensation log record log for the transaction id T1. You will notice that the record with sequence number 026 sets A back to the value it had before the update was made to object A through sequence number 002. Lastly, a TXN-END log record is added to signal that we will never see any more changes made to the database from transaction T1.

LSNprevLSNtxnIdtypeobjectbeforeValueafterValueundoLSN
001nilT1BEGIN
002001T1UPDATEA3040
011002T1ABORT
026011T1CLRA4030001
027026T1TXN-END

Using compensation log records, whenever a transaction is aborted, the database plays back transaction updates in reverse order, and for each one writes a compensation log record noting what changes were made. At the end of the abort, write a TXN-END record to signal that this transaction is complete.

Recovery — ARIES

The most widely known and emulated implementation of crash recovery is ARIES — Algorithms for Recovery and Isolation Exploiting Semantics, which is based on the background we’ve covered so far. The ARIES recovery algorithm is split into there phases: analysis, redo, and undo.

  • Analysis: The analysis phase reads the log from the last recorded checkpoint to identify any dirty pages and active transactions at the time of the crash.
  • Redo: The redo phase repeats all changes to the database from a point in the log forward. This includes transactions that will abort. This retrace brings the database back to the exact state it was in before the crash.
  • Undo: Reverse the changes made by transactions that did not commit before the system crashed.

Although it may seem counter-intuitive to redo all changes, even ones that will eventually be aborted, it actually simplifies the overall algorithm considerably.

Analysis Phase

During the analysis phase, the log is scanned starting form the last checkpoint recorded via the database’s log sequence numbers. The scan proceeds forwards until it finds a transaction end record (TXN-END). Whenever a transaction end record is found, the associated record keeping for that transaction can be removed from the buffer pool. For all other records (ones that are not transaction end), we add a transaction to the buffer pool with the status of UNDO. The transaction is marked as UNDO because we don’t yet know if the transaction will succeed and we need to record the potential of having to undo these steps. On a commit record in the log, the transaction can be changed from UNDO to COMMIT. Lastly, for any UPDATE log records, if the associated page is not marked as dirty, we mark the page as dirty and set the pages recLSN to signal the oldest update to the page since it was last flushed.

Redo Phase

The redo phase repeats all history that was written to the log to reconstruct the state of the database at the exact moment of the crash. During this phase, we reapply all updates (even aborted transactions) and rewrite all compensation log records.

To perform the redo phase, the database scans forward from the log with the smallest recLSN in the dirty page table that was reconstructed during the analysis phase. For each update record or compensation record, for a given sequence number, the database re-applies the update. This update happens for all records unless the affected page has not been marked as dirty, or the records sequence number is greater than the recLSN, or if the pageLSN is already written to disk.

When redoing an action, the database acts as though it is a normal update to the database. All changes are first applied to the log record and then the affected page’s pageLSN is set to that log record’s sequence number. At the end of the redo phase, any committed transactions are included in the log with a TXN-END record and they are removed from the list of active transactions.

Undo Phase

In the last phase, the database reverses all transactions that were active at the time of crash but that have not been committed. Since we removed any committed transactions from the buffer pool during the redo phase, any remaining transactions in the buffer pool can be undone. These are all transactions with UNDO status in the ATT after the Analysis and REDO phase.

The database processes transactions in reverse sequence number order. As it reverses the updates of a transaction, the database writes a compensation log record entry to the log for each modification. Once the last transaction has been successfully aborted, the database flushes out the log and then is ready to start processing new transactions.

Conclusions

ARIES was designed to support the needs of industrial strength transaction processing systems. ARIES uses logs to record the progress of transactions and their actions which cause changes to recoverable data objects. The log is the source of truth and is used to ensure that committed actions are reflected in the database, and that uncommitned actions are undone. Conceptually the log is a single ever-growing sequential file (append-only). Every log record has a unique log sequence number (LSN), and LSNs are assigned in ascending order.

Crash-recovery is a fundamental primitive underpinning databases and data storage technologies. Although the implementation can be challenging, the ideas around recovery are fairly straightforward: write all changes to disk in a log before applying any changes to the master data record. By tracking sequence numbers of each change, we can facilitate redoing and undoing database changes and restoring the system to the state it was in before the crash.

References: