ποΈ 27012025 0108
π
sstables
File format where data is stored as sorted key-value pairs
[apple, banana, cherry, grape]
Key Characteristicsβ
- Sorted order makes range queries (e.g., keys between "banana" and "grape") efficient.
- Immutable once written (to avoid random writes and fragmentation).
How It Worksβ
- New data is written to a write-ahead log (WAL) and a memtable in memory.
- When the memtable is full, it is flushed to disk as a new SSTable file.
Merging SSTablesβ
- To avoid having too many small SSTables, compaction merges and re-sorts SSTables periodically.
Example Usageβ
- Foundations of databases like Apache Cassandra and LevelDB.
1. Write-Ahead Log (WAL)β
What is it?β
- The WAL is a sequential log file where all incoming writes are first recorded before they are applied to in-memory structures (like the memtable) or written to disk.
- The primary purpose is durability: If a crash occurs, the WAL ensures no data is lost because every write operation has been logged persistently.
How it Worksβ
- Write Operation:
- When a client writes data, itβs first appended to the WAL (a simple sequential write to a file).
- This is much faster than performing random writes on disk.
- Sync to Disk:
- The WAL is flushed to disk to ensure durability.
- If the system crashes, the WAL can be replayed to recover the latest state.
- Memtable Update:
- The same data is also written into the memtable (in-memory data structure).
- The WAL is effectively a safety net for the memtable.
Key Characteristicsβ
- Sequential Writes: Very efficient on disk since writes are always appended.
- Crash Recovery: If the database crashes, it replays the WAL to rebuild the memtable and recover lost data.
2. Memtableβ
What is it?β
- The memtable is an in-memory, write-friendly data structure (often a balanced tree like a Red-Black Tree or Skip List) that stores key-value pairs in sorted order.
- It acts as a buffer for writes before data is flushed to disk as a SSTable.
How it Worksβ
- Write Operation:
- When data is written to the database, it is added to the memtable after being logged in the WAL.
- The memtable keeps data sorted in memory, making it efficient for range queries and lookups.
- Threshold Trigger:
- When the memtable reaches a size threshold (e.g., 64MB), it is flushed to disk as an immutable SSTable.
- Flushing:
- During the flush, the data is written to a new SSTable in sorted order, and the WAL for that memtable is discarded.
Key Characteristicsβ
- Sorted Data: Makes range queries fast because data is organized as itβs written.
- Write Buffer: Reduces the number of direct disk writes by batching them in memory.
- Transient: The memtable exists only in memory and is lost on a crash unless the WAL is used for recovery.
Relationship Between WAL and Memtableβ
-
WAL ensures durability, while the memtable ensures efficiency:
- Durability: WAL guarantees that writes are not lost during a crash.
- Efficiency: Memtable minimizes the frequency of disk writes by batching writes in memory.
-
During a flush:
- Data in the memtable is written to disk as a sorted SSTable.
- Once the memtable is flushed, the corresponding WAL file can be deleted because the data is now safely persisted in the SSTable.
Flow of a Write Operationβ
-
Client Write:
- The database writes the operation to the WAL (sequential write).
- The data is then written to the memtable (in memory, sorted).
-
Memtable Reaches Threshold:
- When the memtable is full, it is flushed to disk as a sorted SSTable.
- The corresponding WAL is deleted since itβs no longer needed.
-
Crash Recovery:
- If the database crashes, the WAL is replayed to rebuild the memtable and ensure no data is lost.
Why Use Both?β
- Durability (WAL): Ensures that no data is lost even in the event of a crash.
- Write Efficiency (Memtable): Reduces frequent small writes to disk, improving performance by batching operations.
Example in Practiceβ
In a system like LevelDB or RocksDB:
- Incoming writes are logged in the WAL.
- Data is also added to the memtable.
- Periodically, the memtable is flushed to disk as a new SSTable.
This approach balances write durability, efficiency, and eventual read performance.
Let me know if you'd like an example or deeper explanation of anything!