Thursday 15 August 2024

Database Sharding

 

Database Sharding

First question : Why is it required?

Let’s say you have a single node ( = server) as a database, which is serving your traffic comfortably. Once this traffic increases, you will need to scale. This can include vertical scaling (beefing up the server itself), or horizontal scaling (adding more nodes into the database cluster).

Vertical scaling has limits - you can’t go beyond a super computer. Therefore, in the end you’ll need to consider horizontal scaling. The way this is handled is called sharding.

Avoiding sharding?

Sharding is a complex topic, and even more complex to implement properly. Before deciding to adding this functionality to your database, you should exhaust all other options. We will discuss on other options first.

Partitioning within the single node

The data inside the single node is increasing - leading to an increase in the read and write times. In order to mitigate this, without a drastic measure, we can partition the table itself.

We can divide the primary key into ranges, and each such range is given a partition. Because of this, the indexes built over this data (like B-tree), become smaller, leading to decreased read and write times.

Note that all these partitions still reside inside the same node, so physically no change has been done. The table is logically partitioned, but it still resides within the same node.

Also note that we still would need to maintain a metadata store, which knows which ranges are held by which partition, so that it can route the write & read requests properly.

Replication

Let’s say partitioning didn’t work, and you have reads & writes being slowed down. One more thing for read heavy loads can be done is - replication. We can create a master-replica architecture, where writes happen to a master node, but reads happen from the replicas. Note that the whole table is replicated, not just some parts of it.

Whenever there is a write in the master node, it broadcasts this information to all the replicas, which then update their copy of the table.

Problem #1 : Write load not decreased

Since writes are still happening to a single master node, we still have the issue of writes not scaling up - they are taking more time than expected.

One thing that can be done here is vertically scaling the master node - so that it can handle the increased writes.

Problem #2 : Reads are eventually consistent

Once a write is done on the master node, it still needs to send the notification to the replicas, which then need to write this to their copy. This takes time, and therefore if you want to read just after writing, you might not get the latest value that you wrote, as that might not have been propagated into the replica yet.

Sharding : How can it help?

By dividing the database table itself on multiple nodes, we allow infinite horizontal scaling, theoretically. The writes can be routed to the specific node in which the partition is present, which has the data related to the key. Again, reads can also be routed in a similar fashion.

This solves the issues arising from the need to scale for a higher load.

Sharding

Which data resides on which node?

We have multiple nodes ( = servers) that can store data for the same database table. As part of sharding, we have decided to put different partitions in different nodes. Note that this doesn’t imply that 1 node will only contain only a single partition. It implies that a single partition will reside only inside a single node (and possibly it’s replicas)

How should this partitioning be done? We will need to decide in which node should an item reside if it were to be written.

Partitioning by key range

We can decide to split the primary key range into sub-ranges, which can denote individual nodes where this data can reside. For example, let’s say we are saving the dictionary inside our table, we can distribute words starting with a-q in a single partition, and r-z in another partition.

This means that if I want to read / write “car” then I will be dealing with the first partition, and if I want to read / write “rose” then I will be dealing with the second partition.

An obvious flaw with this partitioning scheme is distribution of words in both the partitions. The r-z partition will have more words, therefore this partition is skewed.

Therefore it is critical to specify the boundaries of a partition properly - data base administrators typically decide the partitioning boundaries.

In some cases, it is possible that a single partition is very hot because of the access pattern. For example, if we are saving timeseries data, then because of key range based partitioning, we will be writing to the same partition mostly.

Partitioning by hash of the key

We can use hash functions which guarantee uniform distribution even for keys that are close. The output of the hash can be used to decide to which partition the item is written / read from.

This avoids the issue of deciding where the partition boundaries lie & hot partitions because of close key writes.

Because of random data being stored inside a partition, we lose the ability to do efficient range queries, since the items are no longer present inside the table in a range format.

How to create secondary indexes now?

Local indexes / document-partitioned indexes

A local index is specific to a partition. You need to specify the partition key AND the index value while making a read request.

This is implemented by keeping track of writes to specific indices inside the partition itself. When a read on the index is done, this information is queried to get the index specific information.

Global indexes / term-partitioned indexes

A global index is not specific to partitions. Rather it encompasses the whole table.

Index terms are distributed amongst partitions such that they store information related to all the items for that term within the whole table. This “distribution” again, is a partitioning problem - in which node should the index “term’s” information be saved?

We can again use either a key range, or a key hash based approach. In the first one, we will compromise randomness (and therefore might encounter hot partitions). In the second one we will compromise efficient range queries.

Rebalancing partitions

Why?

The horizontal scaling enabled by “sharding” - means that we can add more nodes to the system, and the database is able to scale gracefully because of this node’s addition. This process will involve “re-balancing” existing partitions, so that the old nodes shift data to this newly added node.

There are three approaches to rebalancing partitions.

Approach #1 : Have a fixed number of partitions >> number of nodes

Under this approach, we will have a high number of partitions from the get to. So we can have initially 1000 partitions, and 10 nodes. Each node will then have 100 partitions. If we add another node to the system, then we will have 1000/11≈91 partitions per node. This would mean that each existing node will shift around 9 partitions to the new node.

Note that the partition → key mapping doesn’t change in this case. The node → partition mapping, on the other hand, does change. So change will be reflected inside the metadata store, which stores information on which partition is present inside which node.

Note that this “movement” of partition from an older node to a newer node is a costly operation, because all the data will have to be moved over the network.


ProsCons
SimpleDeciding the initial number of partitions is not simple
Initially the partition will have a very low amount of data, and later partitions will have a high amount of data, as more data is ingested by the database. We can not go ahead and split / merge existing partitions on the basis of data size

Approach #2 : Dynamic partitioning

This is simply based on thresholds for the size of a partition. If the size crosses a threshold, then the partition is split into smaller partitions, and moved to a different node. This would update the partition → key mapping. Similarly if there are multiple partitions smaller than a certain threshold, then they are merged together to create a larger partition.

ProsCons
Number of partitions grow / shrink according to data size - so partitions don't grow to abnormal sizes, nor there are multiple partitions with approximately no dataComplex
Initially a single partition will suffer all reads & writes - can be mitigated by setting initial partitions

Approach #3 : Partitions proportional to the number of nodes

In this approach, each node has a fixed number of partitions. We can increase partitions, by simply adding a new node in the system.

This new node, will get data from randomly selected partitions (from existing nodes), which will be split to send the data to the new node. The old partitions will be split in half and one half of the data is sent to the new partition inside the new node, and the other half stays with the old partition inside the old node.

Now in the case of key hash partitioning, this “splitting” activity is simple. We go ahead and split the partition from the middle. But in the case of key range partitioning, the “middle” split might not work because of the data skew issue mentioned above. The first half of the range might have no data, and the other half of the range might have all the data. This makes the splitting of the partition hard in the case of key range partitioning, and easy in the case of key hash partitioning.

Request routing

When a request comes to a sharded database, we get the partition information from the key - either via key range partitioning or key hash partitioning. Now since there are multiple nodes in this system, we need to understand which particular node has this partition.

This information can be stored at three places:
  • Store partition ↔ node mapping inside the client library
  • Store partition ↔ node mapping inside a custom routing tier
  • Store partition ↔ node mapping inside each node

Now consider this case:
  • Partition P1 was present inside node N1 - this information is present inside the partition ↔ node mapping wherever that is present
  • We are using a “fixed” number of partitions, which are transferred if new nodes are added
  • We then reach an upper threshold in the total amount of data present, therefore we add a new node N2, and shift P1 to N2

Now if the partition ↔ node mapping is not updated correctly and in time, the request will actually route to N1 instead of N2.

The handling of this, and more complicated scenarios is done by a custom coordination service like Zookeeper, which tracks cluster metadata. Whenever there are updates in the partition, the nodes notify ZooKeeper, and it notifies the routing tier (wherever that maybe present), to update the routing information.

An assignment looks like this:
Screenshot 2024-08-15 at 1.49.19 PM.png

What’s the difference between “routing tier” and “ZooKeeper”?

ZooKeeper is responsible for the actual “storage” & “updation” of the routing information for the sharded database system. The “action” of routing is actually handled by the routing tier, using the information present inside ZooKeeper.

Instead of the fancy ZooKeeper, can’t we just use a simple memory based alternative?

Let’s say instead of ZooKeeper we use another “config” server and it’s memory for saving the routing information, that looks like the diagram mentioned above.

The first issue is - reliability - this server can go down, and then the whole database system will go down, because routing doesn’t work anymore. This will make this single “config” server the single point of failure for this system.

Now if we fix reliability by keeping replicas of this simple config server, and keep a master - replica architecture, where writes happen in the master and the reads happen from the replicas - we have the issue of consistency.

This means that even if some write has happened, to update the partition to node mapping, the read right after this write might not get this updated value, as the change has not yet propagated to the replica. ZooKeeper also solves this issue.


Appendix

Sources

1498C - Planar reflections

1498C - Planar reflections Problem We have a set of $n$ 2D planes. A particle comes from the left, and goes through all the planes. Initiall...