Uber has started using driver phones to handle datacenter failovers so that active trip information is not lost when a failover occurs. As an external distributed storage system for recovery, Uber stores enough state on the driver phones to ensure that the information regarding the trip is available to the drivers.
Uber has chosen this approach to ensure that the customer has an excellent customer experience and losing information for an active trip, during a datacenter failover, would result in a horrible customer experience.
Even on datacenter failures, Uber has been able to preserve the trip data and achieve a seamless customer experience by building their syncing strategy around the phone. In a market with near zero switching costs, Uber has ensured that it makes their customers happy and satisfied with their service.
The ultimate goal is not to lose information regarding an active trip when a datacenter failover occurs. Use of a traditional database replication strategy cannot guarantee that the information cannot be lost. The reason that this guarantee cannot be made is how network management systems functions.
Networks are made up of devices and the network management system. The devices form the source of state information such as packet errors, packets sent and received etc. The network management system is responsible for configuration data like customer information and alarm thresholds. As the devices and the network management system function independently of each other, they are not always in sync. Issues such as bootup, failover and communication reconnection results in information loss and this has to be compensated by merging information between the two using a complicated process that ensures consistency and correctness.
In case of Uber, the smart phones represent the devices and it is imperative that the active trip information is preserved when a bootup, failover, and communication reconnection occurs. As the smartphone has an accurate record of all trip data, it is advisable that trip data should not be synced from the datacenter down to the phone during failures. Conducting such an exercise would result in a complete loss of correct information from the smartphone.
Another trick that Uber takes from network management systems is that the smartphones are periodically queried to test the integrity of information in the datacenter.
Motivation behind the idea of using Phones as Storage for Datacenter Failure
- In the past, a datacenter failure would result in information regarding customer trips to be lost. Nowadays, even in case of a datacenter failure, the customer is able to avail their trip without any noticeable downtime.
- State change transitions are classified as
- A requested trip which is offered to a driver
- Acceptance of a trip which involves picking up the customer
- End of a trip
A particular trip transaction remains valid for the entire length of the trip.
- As soon as the trip starts, the trip data or information about the trip continues to get populated in backend datacenter. It appears that a datacenter is designated per city.
- One of the common solutions for datacenter failure is to replicate the data from the active datacenter to the backup datacenter. However, there are a few drawbacks:
- The solution gets complicated when more than two datacenters are involved.
- Replication lag occurs between the datacenters
- Requirement of a constant high bandwidth between datacenters becomes essential when you have a database which does not provide support for datacenter replication or if the business model is not tuned to.
- Uber used the creative application aware solution where the data was saved to the driver’s smartphone as there is constant communication with driver phones. The solution has an advantage that the issue of the phone falling over to the wrong datacenter is resolved. If the phone falls over to a wrong datacenter, all active trip information will be lost.
- A replication protocol is required when using driver smart phones to hold datacenter backups.
- Challenges for implementing the solution:
- Care should be taken to ensure that all saved information about a particular trip is not accessible to the driver. A particular trip has lots of data on customers which should not be made accessible.
- It has to be assumed that there is a possibility that the smartphones of the drivers can be compromised. So it is imperative that the data which is saved on the phone is encrypted and made tamperproof.
- It is advisable to keep the replication protocol as simple as possible so that it becomes convenient for everyone who is involved with reasoning and debugging the protocol.
The Flow of data
When a driver makes an update/state change like picking up a passenger for a particular trip, the Dispatch Service receives that request which the driver has made. The trip model gets updated by the Dispatch Service for the trip. The Replication service then receives the update which queues the request and returns success to the Dispatch Service. After updating its own datastore with the relevant details, the Dispatch Service returns success to the mobile client.
In the background, the data is encrypted by the Replication Service which is then sent to the Messaging Service. All the drivers in Uber maintain a bidirectional channel with the Messaging Service. It must be noted that this channel is different from the original request channel which the drivers use to communicate with services. As a result of which normal business operations are not affected by the backup process. The smartphone receives the backup data from the Messenger Service
Ensuring A Reliable Operation
Uber tested the failover system constantly to establish the confidence that the system would be operational in the case of a failover. The first approach was to conduct manual failovers of individual cities. The plan was to then look at the success rate of restoration and identify and solve issues by looking at the logs. However it caused lots of operational issues as performing this process manually was not feasible. It also resulted in poor customer experience as fares had to be adjusted for some trips that did not restore correctly. This system or process could be tested on only a few cities at a time and the sample size did not reflect the entire population. So there was a possibility that some issues and their potential solutions would be missed. There was also ambiguity on whether the backup datacenter could handle the flood of requests that can happen during a failover.
In order to rectify these problems, Uber took note of the following concepts.
- Ensure that the flood of requests during failover can be handled by the backup datacenter
- Ensure that replication can make use of the stored data
- Monitoring Service was introduced to monitor the health of the system. Each hour this service receives a list of all active drivers and trips from the dispatch service. This service helped the enterprise to develop a lot of good health metrics which was helpful in identifying problems and issues that need to be resolved.
- In order to test the backup datacenter, shadow restoration was introduced. The data which was collected by the Monitoring Service was sent to the backup datacenter. This data was used for shadow restoration. The shadow restoration service helps to develop metrics on how efficiently the load could be handled by the backup datacenter and it also helps us to identify any configuration issues.
Thanks to Sandeep from Intergrid for this guest post. Intergrid offer a range of Australian dedicated servers to suit all businesses.