-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
The key problem with Prism's architecture is that you need to have a cloned database per region. This isn't that bad of an idea from a logistical standpoint (after all, this is needed for certain larger scale applications anyway) but it means that there is always a chance of database desyncs. Here is a current example of one:
NA1 <- Receives Update Request
NA1 = Performs Database Update on NA1 DB
NA1 -> Broadcasts Update to EU1 and AS1 peer nodes
EU1 <- Receives Update Request
AS1 <- Network Transmission Error (Never receives Update)
EU1 = Fails Database Update
In the example above, the EU and AS databases are now completely desynced from each other since a DB query was never made. There is really no way to avoid this at it's core, but we can implement fail safes in order to preserve data integrity. Here are some ideas I've collected (which I'll probably implement all of them) to circumvent the desync.
- 1. Add ACKs/NACKs to background updates. Similar to how the acknowledgements are done in the main packet sequence of Prism (as a result of UDP), we can simply do the same concept with retransmissions if a certain background update is not acknowledged by a peer node. Here is an example of how this could work:
NA1 <- Receives Update Request
NA1 = Performs Database Update on NA1 DB
NA1 -> Broadcasts Update to EU1 (Awaiting an ACK)
EU1 <- Network Transmission Error (Never receives Update)
NA1 = Realizes an ACK was never returned from EU1
NA1 -> Retries Update to EU1 (Awaiting an ACK)
EU1 <- Receives Update Request
EU1 = Successfully Performs Database Update
EU1 -> Sends ACK back to NA1
NA1 <- Receives ACK from EU1 (Shifts the ACK queue)
- 2. Utilize Failover Nodes. Prism was designed to be distributed and load balanced (which is still a WIP feature) such that multiple shards (e.g. NA1, NA2, NA3, etc.) can be spawned in for a given cluster (e.g. NA). This provides us a way to have multiple peer nodes for a given shard (which is an existing feature). Thus, we can specify a list of failover node for each of a given peer from the source node. For example, if NA1 sends to EU1 and the request fails or is never acknowledged, and we try all our retries on EU1, then we know that EU1 is probably down or broken. In this case, we can go through a list of failover nodes for EU1 (e.g. EU2) and connect to that to eliminate desync. Here is an example of how it could work.
NA1 <- Receives Update Request
NA1 = Performs Database Update on NA1 DB
NA1 -> Broadcasts Update to EU1 (Awaiting an ACK)
EU1 <- Network Transmission Error (Never receives Update)
NA1 = Realizes an ACK was never returned from EU1
NA1 -> Retries Update to EU1 (Awaiting an ACK)
EU1 <- Network Transmission Error (5x)
NA1 <- Retries Update to EU1 (5x)
NA1 = Realizes EU1 is not responding after 5 retries
NA1 = Switches to EU2 as the failover node
NA1 = Performs Update to EU2 (Awaiting an ACK)
EU2 <- Receives Update Request
EU2 = Successfully Performs Database Update
EU2 -> Sends ACK back to NA1
NA1 <- Receives ACK from EU2 (Shifts the ACK queue)
- 3. Forceful Database Connection. A last ditch effort is to completely bypass the background updates and rely on a regular SQL connection to write to the foreign database. This is possible because the background update is happening because of a database update, meaning the logic for whatever query is happening already exists on Prism itself. This means we can just open up an SQL connection to the foreign region and just write the update directly. Here is an example of how it could work.
... NA1 has exhausted all retries with EU1 and all of its failover
NA1 -> Creates SQL connection the EU Database
NA1 -> Writes update to EU Database
Other Ideas:
- Undo requests from peer nodes to source node
- Awaiting DB write on main node until successful writes on all peer nodes
- Storing unsynced requests in file system or a cache and retrying on startup
Metadata
Metadata
Assignees
Labels
No labels