Log structured merge trees

Problem statement & requirements

Where is it used?

A data base is just a server. We need to implement the specific software that is going to work as a database. Since this is going to be a server anyhow, it will have the same components : a CPU, a memory and a disc. Using these, we need to cater to our specific use case to create a data base software.

For LSMs, we serve the use case which is very write heavy. A (losing) competitor of LSM, for this specific use case is a B tree / B + tree. But a B/B+ tree’s write is $O(log_{m}N)$. This means to write an item to B/B+ trees, you might need to update the disc $O(log_{m}N)$ times, which might seem small, but given the huge amount of data present inside these databases - this will also take time.

Therefore for write-heavy use cases, we need a different solution. This is where LSMs come into picture.

Why specifically LSM?

LSM offer a write complexity of O(1), and a read complexity of $O(N\times logM)$. (Where $N$ is the number of SSTables scanned, and $M$ is largest SSTable’s size)

Notice that in a B / B + tree, $O(log_{m}N)$ is the read complexity. So we are compromising on the read complexity while using LSMs.

Design : How does it work?

Writes

All the writes to a LSM based database happen inside the memTable inside the memory. MemTable is a sorted tree / map, which keeps the entries in a sorted fashion whilst the insertion process is working. Now, when this memTable is full, this memTable is flushed into the disc as a SSTable (Sorted String Table). All further writes will happen in a new memTable.

Reads

During a read, we firstly check whether the key is present inside the memTable. If it is, then the corresponding value is returned. Note that we don’t need to check the disc in this case, because whichever value for that key is present in the disc, is older than the one present inside the memTable, as the memTable is created later.

If the item is not present inside the memTable, we start scanning SSTables inside the disc. Since these are already sorted, we take $O(logM)$ time to search these SSTables, where M is the size of the SSTable. Now we start searching for the key from the highest level L0. If they key is found, the corresponding value is returned, without searching for the key in lower levels.

In the worst case scenario, we will need to search all SSTables, inside all the levels. So if there are S SSTables, then we will take $O(S\times log_{2}M)$ time to search an item.

Bloom filters - an optimisation for reading SSTables

A pre-cursor for searching the SSTable directly is using a bloom filter. A bloom filter can give out the detail on whether the item is definitely not present inside the SSTable. Otherwise it can give that the item might be present in the SSTable.

So during a read - we can firstly check the bloom filter, and see whether the item is definitely not present. If item is definitely not present, there is no need to search this SSTable. Otherwise if the bloom filter says that the item might be present, then we will need to search the SSTable in that case.

Therefore, when a memTable is written to disc, the corresponding bloom filter is also created and attached to the SSTable.

Note that the read complexity still remains the same in the worst case scenario.

How do bloom filters work?

Bloom filters are just hash sets. If some item is input inside the bloom filter, then it’s key is hashed and the output bit is marked as 1. Now the range of the output of the hash function is small, therefore many keys will map to a single output bit.

Therefore, if for a key, the output bit of the hash function is present inside the bloom filter hash set, then the key might be present inside the corresponding data structure.

But if for a key, the output bit of the hash function is NOT present inside the bloom filter hash set, then the key definitely is not present inside the corresponding data structure - the SSTable in this case.

Compaction - the background process to optimise reads

As certain conditions are met, a level merges it’s SSTables and flushes the output in a lower level. This is a background process in LSM trees, known as compaction. There are different strategies to implement this compaction mechanism.

Fundamental concepts

Size amplification / Space amplification

The actual size of the table and the optimal size of the table can differ, because multiple versions of the same key are present inside different levels of a LSM tree.

This increase in actual size is called space amplification

Write amplification

This refers to the total number of times a disc write can happen because of a single write operation on the LSM tree.

In the case of LSMs, one write operation can lead to writes into memTable, then L0 SStable, L1 SStable etc. Therefore a single write can lead to many disc writes.

Read amplification

Similar to write amplification, this refers to the total number of times a disc read can happen because of a single read operation on the LSM tree.

In the case of LSMs, one read operation will need to read through memTable and multiple SS tables through multiple levels inside disc.

Strategy 1 : Size tiered compaction

In size tiered compaction, a background process groups SS tables of similar size. Once this group size reaches a certain threshold (let’s call this “compaction threshold”), then the process merges these tables and flushes this new SS table in the lower level.

Problem 1 : 50% of disc should be available for compaction

During compaction, enough space is required for the original SS tables, and the destination compacted SS table. Therefore in the worst case scenario, 50% of disc is filled up with original SS tables, and those when being compacted need 50% more space for the destination compacted SS table.

Problem 2 : Data is scattered across levels and SS tables

Updates lead to stale data being present inside various SS tables inside different levels in disc. Lower levels will have a lot of stale data, because recent updates reach to lower levels after a lot of compactions.

How is write amplification equal to $O(log_{m}N)$?

Let’s say initially we have N SSTables. Now, let’s say the min tables for merging is 2. In this case, you’ll need $O(log_{2}N)$ operations to reach a single SSTable.

Operations:

$N$ → $N/2$ SS tables : 1 operation
$N/2$ → $N/4$ SS tables : 2 operations
$N/2^i$ → $N/2^{i+1}$ SS tables : $i$ operations
$N/2^{j−1}$ → $N/2^j$ = 1 SS table : $j=log_{2}N$ operations

Pros & Cons

Pros	Cons
Very good choice for write heavy workloads - low write amplification	Data is split across many SS tables even at the same level. Therefore reads are slow
	Lower levels have a lot of obsolete data, increasing space amplification significantly
	For x size of SS tables required to be compacted, 2x space is required for the compaction process

Strategy 2 : Leveled compaction

In leveled compaction, unlike size tiered compaction, we merge all the SS tables in a level in a sorted fashion, to create SS tables of the next level. Therefore, all the items inside a level get flushed into the next level, as part of this compaction.

There are two key attributes in question here, max SS table size, and max level size. Compaction itself is triggered, when max level size is reached. While compaction, merging of higher leveled SS tables is done to create lower level SS tables. Lower level SS tables of size max_ss_table_size are created. Once that size is reached, next ss table is created.

Each level size is a multiple of the previous level size. So if the level L0 has a size 160 MB, then L1 can have a size 1600 MB, L2 can have a size 16000 MB etc.

The special MemTable to L0 transfer

As data is written to mem table, the max size for the mem table is reached. This table is then flushed to the L0 level as is. Note that there is no compaction process when an item is moved from memory to the L0 level.

Once max level size of the L0 level is reached, then only the sorted run happens, which merges the SS tables inside L0 in a sorted fashion, to create SS tables inside L1.

How exactly is the compaction process happening?

Compaction is triggered when the max level size is reached. Now, we already have a list of SS tables in the higher level, which are required to be written into the lower level. Assuming that the SS tables themselves are sorted (current level is not L0), we will go one by one through the SS tables, and identify the key range. For this key range, we search valid SS tables inside the lower level. We merge this data, remove duplicates, and then create new SS tables. The older SS tables are removed.

Because of this de-duplication process, all keys inside a level are unique, and there are no duplicates.

Why does leveled compaction have a low read amplification?

Since the data inside each level is always kept in a sorted fashion, reading inside each level is very optimized. Once a key is found inside a level, you don’t need to search a lower level, or worry about duplicates of the same key inside the same level.

Why does leveled compaction have a low space amplification?

The temporary space problem

The SS tables in the case of leveled compaction remain of the same size (160 MB generally). Therefore, during compaction we don’t need a lot of temporary space unlike size tiered compaction.

To be exact we will have 10 SS tables of a lower level, which we are compacting in a sorted fashion, and 1 SS table to which we are writing. Once the 1 output SS table is written, we can write it to the next level, and then start working on creating the next SS table from the 10 SS tables of the lower level.

Therefore, we will only need 11 * SS table of temporary disc space, 160 * 11 = 1760 MB < 2 GB

The duplicate data problem

Since there is an exponential grown of level sizes, as we move from higher levels to lower levels, the majority of data in the case of leveled compaction, is in the lower most levels. In the case we have three levels, L0, L1 and L2, and L0 has x size, L1 has m×x size and L2 will have m2×x size. m mostly is 10, so $x,10\times x,100\times x$, therefore the amount of data inside the last level (L2) = $\frac{100 \times x}{111 \times x}$ which is equal to 90%

Now when data is pushed from L1 to L2, it’s sorted. Because of this, there can not be any duplicates inside a single level. Therefore, 90% of the data can not have any duplicates whatsoever.

Why does leveled compaction have a high write amplification?

During compaction, a single item might be written multiple times, during movement from a higher level to a lower level, along with in the case of overlapping key ranges, the lower level SS table might get re-written into new SS tables. Therefore, leveled compaction has a higher write amplification as compared to size tiered compaction.

Pros & Cons

Pros	Cons
Low space amplification	High write amplification
Low read amplification	High write amplification

Used by?

ScyllaDB and Apache Cassandra

Strategy 3 : [Bonus] Time window compaction

<TODO>

Appendix

Resources

Bloom filter : https://brilliant.org/wiki/bloom-filter/
DS210.15 Leveled Compaction | Operations with Apache Cassandra : https://www.youtube.com/watch?v=6yJEwqseMY4
DS210.16 Size-Tiered Compaction | Operations with Apache Cassandra : https://www.youtube.com/watch?v=TyTXOjFMi7k
DS210.17 Time Window Compaction | Operations with Apache Cassandra : https://youtu.be/P8jPlppa_8I?si=dtSCJ5WW_d9II36W
Compaction : https://github.com/facebook/rocksdb/wiki/Compaction
Choose a Compaction Strategy : https://opensource.docs.scylladb.com/stable/architecture/compaction/compaction-strategies.html#id1

Friday, 26 July 2024