Wednesday, 4 September 2024

1498C - Planar reflections

1498C - Planar reflections

Problem

We have a set of $n$ 2D planes. A particle comes from the left, and goes through all the planes. Initially it has power $k$. When this particle touches a plane, another particle with power $k-1$ is created, but in the reverse direction.

Find the total number of particles created in this process.

Note that the created particles can create more particles as they themselves hit the planes.
 

Solution

As soon as a particle hits a plane, we know that a new particle with $k-1$ power will be created in the reverse direction. And the initial particle itself will carry on, with the same power $k$.

Note that the new particle now can touch the planes that were before (on the left of) the initial particle - therefore this particle should have knowledge on how many planes are on the left of it, and how many on the right.

Note that left + right will always be $n$, therefore we can just keep track of the number of particles on the left, and the right ones will be $n - left$.

So to represent the state of a particle, we can say that there are $left$ particles to the left of it, and it has $k$ power, and it's going in the direction $dir$, which can be left or right.

We can create a memoization based DP, where we save outputs for these values in a map.

The key of the map will be $\{left, \{k, dir\}\}$  which will be represented by `pair<long long, pair<long long, long long>>` . (Even though $dir$ can be 0 / 1 I have used a `long long` here)

This map will output the total number of particles created because of this particle.

Now we can create a simple recursion based on the original statement.

$$ getDp(left, k, dir) = getDp(left, k - 1, 1 - dir) + getDp(left + 1, k, dir) $$
 Here the first item in RHS is the new particle created going in the reverse direction with 1 less power, and the second item is the original particle which now has one less plane to see.

This is pretty much the logic to the problem.

The interesting part is in the implementation itself.

I tried to submit the solution 6 times - let's iterate on each of the submission, and see how we improved it until an AC.

Implementations

Submission 1 : 1:* DP


Link : https://codeforces.com/contest/1498/submission/278463113

The DP in this case was not the one mentioned above. Instead it was different.

As a particle goes through multiple planes, with power $k$, it will create particles of power $k-1$ in the reverse direction for each such plane.

So I created the DP around that with

If the particle is going in the right direction:

$$ getDp(left, right, k, direction) = 1 + \sum _{i = 1}^{right} getDp(left + i - 1, right - i + 1, k - 1, 1 - direction)$$

If the particle is going in the left direction:

$$ getDp(left, right, k, direction) = 1 + \sum _{i = 1}^{left} getDp(left - i + 1, right + i - 1, k - 1, 1 - direction)$$

The time complexity of this is *high* (what exactly and why?) therefore this easily gave me TLE

On time complexity of the above recursion


It's basically something like $O(right \times left \times right \times ...\text{k times})$
Since $O(right) = O(left) = O(n)$, the upper bound of this is $O(n^k)$.

But let's look into the total number of items possible inside DP. That is $O(left) \times O(right) \times O(k) \times 2 = 10^3 \times 10^3 \times 10^3 \times 2 = 2 \times 10^9$

But note that $left + right = n$ always, therefore there can be only $10^3$ valid combinations of left and right (ex : both left and right can not be 1)

Therefore a tighter bound on the number of items would be then $2 \times 10^6$

Which means the total number of possible states is within limits. But the number of times these states are accessed is high. (Basically one state inside the DP is accessed multiple times, leading to a high time complexity)

Submission 2 : 1:2 DP


Link : https://codeforces.com/contest/1498/submission/278717744

In this case, I improved upon by using a DP that had a lower time complexity. Basically similar to the one as used in the description, but note that I was still using a $right$ attribute which was actually not required. At this point I started to get a WA (wrong answer) instead of a TLE.
 

Time complexity of the new recursion


Earlier we had a n-ary tree, and now a binary one, with height = $O(n)$ or $O(k)$. Total number of nodes then are $2^{O(n)} = 2 ^ {10^3}$ which is lower than the time complexity of $n^k$ mentioned above.

Note that we are memoizing the answers so actually it's not $2^{10^3}$, as unique states are again $2 \times 10^6$ only. The *number* of times a state is accessed is reduced as compared to the first submission.
 

Submission 3 : MOD


Link : https://codeforces.com/contest/1498/submission/278717889

I realized that I was not MODing the answer. After adding that, I again started to get TLEs

Submission 4 : Removing $right$


Link : https://codeforces.com/contest/1498/submission/279037448

I realized that $right$ is not actually required in the DP, so I removed it, and used only $left$.

Earlier I was getting a TLE in TC 3, and not I was getting a TLE in TC 25, so this was an improvement
 

Exact improvements after removing the concept of `right`

I think the item size inside the map reduced, and that might have lead to some constant improvement in querying of the map.

As mentioned here : https://stackoverflow.com/a/22607155/5345646, https://stackoverflow.com/a/1843264/5345646

During searching inside the map (which in c++ is a red black tree), we compare the keys. Bigger the key, larger the time it will take to do this comparison - though it will always be a constant proportional to the key size.
 

Submission 5 : Emphasis on the reverse key


Link : https://codeforces.com/contest/1498/submission/279275375

So the key = $\{left, k, direction\}$ should give the same output for $\{n - left, k, 1 - direction\}$ because it's basically the same situation but in the reverse direction.

I tried to save this reverse key as well inside the DP when we calculate the main key, but that caused MLEs.
 

Submission 6 : Total removal of the reverse key concept


Link : https://codeforces.com/contest/1498/submission/279275427

As soon as I removed the concept of $reverseKey$ totally, the submission was accepted. This might not be because of asymptotic improvements, but instead because of constant improvements - because we are no longer making `count` calls on a `Map` which is $O(log\text{ } n)$. In this case if $n$ is large enough, this is going to take time. 
 

How much impactful was $reverseKey$?


In this case, as soon as we removed reverse key, we removed both items from the map, and queries to the map. Removing items from the map, improves the query performance for the remaining cases, if the number of items are large. Also, removing queries directly impacts the time complexity.

Thursday, 15 August 2024

Database Sharding

 

Database Sharding

First question : Why is it required?

Let’s say you have a single node ( = server) as a database, which is serving your traffic comfortably. Once this traffic increases, you will need to scale. This can include vertical scaling (beefing up the server itself), or horizontal scaling (adding more nodes into the database cluster).

Vertical scaling has limits - you can’t go beyond a super computer. Therefore, in the end you’ll need to consider horizontal scaling. The way this is handled is called sharding.

Avoiding sharding?

Sharding is a complex topic, and even more complex to implement properly. Before deciding to adding this functionality to your database, you should exhaust all other options. We will discuss on other options first.

Partitioning within the single node

The data inside the single node is increasing - leading to an increase in the read and write times. In order to mitigate this, without a drastic measure, we can partition the table itself.

We can divide the primary key into ranges, and each such range is given a partition. Because of this, the indexes built over this data (like B-tree), become smaller, leading to decreased read and write times.

Note that all these partitions still reside inside the same node, so physically no change has been done. The table is logically partitioned, but it still resides within the same node.

Also note that we still would need to maintain a metadata store, which knows which ranges are held by which partition, so that it can route the write & read requests properly.

Replication

Let’s say partitioning didn’t work, and you have reads & writes being slowed down. One more thing for read heavy loads can be done is - replication. We can create a master-replica architecture, where writes happen to a master node, but reads happen from the replicas. Note that the whole table is replicated, not just some parts of it.

Whenever there is a write in the master node, it broadcasts this information to all the replicas, which then update their copy of the table.

Problem #1 : Write load not decreased

Since writes are still happening to a single master node, we still have the issue of writes not scaling up - they are taking more time than expected.

One thing that can be done here is vertically scaling the master node - so that it can handle the increased writes.

Problem #2 : Reads are eventually consistent

Once a write is done on the master node, it still needs to send the notification to the replicas, which then need to write this to their copy. This takes time, and therefore if you want to read just after writing, you might not get the latest value that you wrote, as that might not have been propagated into the replica yet.

Sharding : How can it help?

By dividing the database table itself on multiple nodes, we allow infinite horizontal scaling, theoretically. The writes can be routed to the specific node in which the partition is present, which has the data related to the key. Again, reads can also be routed in a similar fashion.

This solves the issues arising from the need to scale for a higher load.

Sharding

Which data resides on which node?

We have multiple nodes ( = servers) that can store data for the same database table. As part of sharding, we have decided to put different partitions in different nodes. Note that this doesn’t imply that 1 node will only contain only a single partition. It implies that a single partition will reside only inside a single node (and possibly it’s replicas)

How should this partitioning be done? We will need to decide in which node should an item reside if it were to be written.

Partitioning by key range

We can decide to split the primary key range into sub-ranges, which can denote individual nodes where this data can reside. For example, let’s say we are saving the dictionary inside our table, we can distribute words starting with a-q in a single partition, and r-z in another partition.

This means that if I want to read / write “car” then I will be dealing with the first partition, and if I want to read / write “rose” then I will be dealing with the second partition.

An obvious flaw with this partitioning scheme is distribution of words in both the partitions. The r-z partition will have more words, therefore this partition is skewed.

Therefore it is critical to specify the boundaries of a partition properly - data base administrators typically decide the partitioning boundaries.

In some cases, it is possible that a single partition is very hot because of the access pattern. For example, if we are saving timeseries data, then because of key range based partitioning, we will be writing to the same partition mostly.

Partitioning by hash of the key

We can use hash functions which guarantee uniform distribution even for keys that are close. The output of the hash can be used to decide to which partition the item is written / read from.

This avoids the issue of deciding where the partition boundaries lie & hot partitions because of close key writes.

Because of random data being stored inside a partition, we lose the ability to do efficient range queries, since the items are no longer present inside the table in a range format.

How to create secondary indexes now?

Local indexes / document-partitioned indexes

A local index is specific to a partition. You need to specify the partition key AND the index value while making a read request.

This is implemented by keeping track of writes to specific indices inside the partition itself. When a read on the index is done, this information is queried to get the index specific information.

Global indexes / term-partitioned indexes

A global index is not specific to partitions. Rather it encompasses the whole table.

Index terms are distributed amongst partitions such that they store information related to all the items for that term within the whole table. This “distribution” again, is a partitioning problem - in which node should the index “term’s” information be saved?

We can again use either a key range, or a key hash based approach. In the first one, we will compromise randomness (and therefore might encounter hot partitions). In the second one we will compromise efficient range queries.

Rebalancing partitions

Why?

The horizontal scaling enabled by “sharding” - means that we can add more nodes to the system, and the database is able to scale gracefully because of this node’s addition. This process will involve “re-balancing” existing partitions, so that the old nodes shift data to this newly added node.

There are three approaches to rebalancing partitions.

Approach #1 : Have a fixed number of partitions >> number of nodes

Under this approach, we will have a high number of partitions from the get to. So we can have initially 1000 partitions, and 10 nodes. Each node will then have 100 partitions. If we add another node to the system, then we will have 1000/11≈91 partitions per node. This would mean that each existing node will shift around 9 partitions to the new node.

Note that the partition → key mapping doesn’t change in this case. The node → partition mapping, on the other hand, does change. So change will be reflected inside the metadata store, which stores information on which partition is present inside which node.

Note that this “movement” of partition from an older node to a newer node is a costly operation, because all the data will have to be moved over the network.


ProsCons
SimpleDeciding the initial number of partitions is not simple
Initially the partition will have a very low amount of data, and later partitions will have a high amount of data, as more data is ingested by the database. We can not go ahead and split / merge existing partitions on the basis of data size

Approach #2 : Dynamic partitioning

This is simply based on thresholds for the size of a partition. If the size crosses a threshold, then the partition is split into smaller partitions, and moved to a different node. This would update the partition → key mapping. Similarly if there are multiple partitions smaller than a certain threshold, then they are merged together to create a larger partition.

ProsCons
Number of partitions grow / shrink according to data size - so partitions don't grow to abnormal sizes, nor there are multiple partitions with approximately no dataComplex
Initially a single partition will suffer all reads & writes - can be mitigated by setting initial partitions

Approach #3 : Partitions proportional to the number of nodes

In this approach, each node has a fixed number of partitions. We can increase partitions, by simply adding a new node in the system.

This new node, will get data from randomly selected partitions (from existing nodes), which will be split to send the data to the new node. The old partitions will be split in half and one half of the data is sent to the new partition inside the new node, and the other half stays with the old partition inside the old node.

Now in the case of key hash partitioning, this “splitting” activity is simple. We go ahead and split the partition from the middle. But in the case of key range partitioning, the “middle” split might not work because of the data skew issue mentioned above. The first half of the range might have no data, and the other half of the range might have all the data. This makes the splitting of the partition hard in the case of key range partitioning, and easy in the case of key hash partitioning.

Request routing

When a request comes to a sharded database, we get the partition information from the key - either via key range partitioning or key hash partitioning. Now since there are multiple nodes in this system, we need to understand which particular node has this partition.

This information can be stored at three places:
  • Store partition ↔ node mapping inside the client library
  • Store partition ↔ node mapping inside a custom routing tier
  • Store partition ↔ node mapping inside each node

Now consider this case:
  • Partition P1 was present inside node N1 - this information is present inside the partition ↔ node mapping wherever that is present
  • We are using a “fixed” number of partitions, which are transferred if new nodes are added
  • We then reach an upper threshold in the total amount of data present, therefore we add a new node N2, and shift P1 to N2

Now if the partition ↔ node mapping is not updated correctly and in time, the request will actually route to N1 instead of N2.

The handling of this, and more complicated scenarios is done by a custom coordination service like Zookeeper, which tracks cluster metadata. Whenever there are updates in the partition, the nodes notify ZooKeeper, and it notifies the routing tier (wherever that maybe present), to update the routing information.

An assignment looks like this:
Screenshot 2024-08-15 at 1.49.19 PM.png

What’s the difference between “routing tier” and “ZooKeeper”?

ZooKeeper is responsible for the actual “storage” & “updation” of the routing information for the sharded database system. The “action” of routing is actually handled by the routing tier, using the information present inside ZooKeeper.

Instead of the fancy ZooKeeper, can’t we just use a simple memory based alternative?

Let’s say instead of ZooKeeper we use another “config” server and it’s memory for saving the routing information, that looks like the diagram mentioned above.

The first issue is - reliability - this server can go down, and then the whole database system will go down, because routing doesn’t work anymore. This will make this single “config” server the single point of failure for this system.

Now if we fix reliability by keeping replicas of this simple config server, and keep a master - replica architecture, where writes happen in the master and the reads happen from the replicas - we have the issue of consistency.

This means that even if some write has happened, to update the partition to node mapping, the read right after this write might not get this updated value, as the change has not yet propagated to the replica. ZooKeeper also solves this issue.


Appendix

Sources

1498C - Planar reflections

1498C - Planar reflections Problem We have a set of $n$ 2D planes. A particle comes from the left, and goes through all the planes. Initiall...