Uber uses driver phones as backup datacenter

Uber has started using driver phones to handle datacenter failovers so that active trip information is not lost when a failover occurs. As an external distributed storage system for recovery, Uber stores enough state on the driver phones to ensure that the information regarding the trip is available to the drivers.

Uber has chosen this approach to ensure that the customer has an excellent customer experience and losing information for an active trip, during a datacenter failover, would result in a horrible customer experience.

Even on datacenter failures, Uber has been able to preserve the trip data and achieve a seamless customer experience by building their syncing strategy around the phone. In a market with near zero switching costs, Uber has ensured that it makes their customers happy and satisfied with their service.

The ultimate goal is not to lose information regarding an active trip when a datacenter failover occurs. Use of a traditional database replication strategy cannot guarantee that the information cannot be lost. The reason that this guarantee cannot be made is how network management systems functions.

Networks are made up of devices and the network management system. The devices form the source of state information such as packet errors, packets sent and received etc. The network management system is responsible for configuration data like customer information and alarm thresholds. As the devices and the network management system function independently of each other, they are not always in sync. Issues such as bootup, failover and communication reconnection results in information loss and this has to be compensated by merging information between the two using a complicated process that ensures consistency and correctness.

In case of Uber, the smart phones represent the devices and it is imperative that the active trip information is preserved when a bootup, failover, and communication reconnection occurs. As the smartphone has an accurate record of all trip data, it is advisable that trip data should not be synced from the datacenter down to the phone during failures. Conducting such an exercise would result in a complete loss of correct information from the smartphone.

Another trick that Uber takes from network management systems is that the smartphones are periodically queried to test the integrity of information in the datacenter.

Motivation behind the idea of using Phones as Storage for Datacenter Failure

  1. In the past, a datacenter failure would result in information regarding customer trips to be lost. Nowadays, even in case of a datacenter failure, the customer is able to avail their trip without any noticeable downtime.
  1. State change transitions are classified as
  • A requested trip which is offered to a driver
  • Acceptance of a trip which involves picking up the customer
  • End of a trip

A particular trip transaction remains valid for the entire length of the trip.

  1. As soon as the trip starts, the trip data or information about the trip continues to get populated in backend datacenter. It appears that a datacenter is designated per city.
  1. One of the common solutions for datacenter failure is to replicate the data from the active datacenter to the backup datacenter. However, there are a few drawbacks:
  • The solution gets complicated when more than two datacenters are involved.
  • Replication lag occurs between the datacenters
  • Requirement of a constant high bandwidth between datacenters becomes essential when you have a database which does not provide support for datacenter replication or if the business model is not tuned to.
  1. Uber used the creative application aware solution where the data was saved to the driver’s smartphone as there is constant communication with driver phones. The solution has an advantage that the issue of the phone falling over to the wrong datacenter is resolved. If the phone falls over to a wrong datacenter, all active trip information will be lost.
  1. A replication protocol is required when using driver smart phones to hold datacenter backups.
  1. Challenges for implementing the solution:
  • Care should be taken to ensure that all saved information about a particular trip is not accessible to the driver. A particular trip has lots of data on customers which should not be made accessible.
  • It has to be assumed that there is a possibility that the smartphones of the drivers can be compromised. So it is imperative that the data which is saved on the phone is encrypted and made tamperproof.
  • It is advisable to keep the replication protocol as simple as possible so that it becomes convenient for everyone who is involved with reasoning and debugging the protocol.

The Flow of data

When a driver makes an update/state change like picking up a passenger for a particular trip, the Dispatch Service receives that request which the driver has made. The trip model gets updated by the Dispatch Service for the trip. The Replication service then receives the update which queues the request and returns success to the Dispatch Service. After updating its own datastore with the relevant details, the Dispatch Service returns success to the mobile client.

In the background, the data is encrypted by the Replication Service which is then sent to the Messaging Service. All the drivers in Uber maintain a bidirectional channel with the Messaging Service. It must be noted that this channel is different from the original request channel which the drivers use to communicate with services. As a result of which normal business operations are not affected by the backup process. The smartphone receives the backup data from the Messenger Service

Ensuring A Reliable Operation

Uber tested the failover system constantly to establish the confidence that the system would be operational in the case of a failover. The first approach was to conduct manual failovers of individual cities. The plan was to then look at the success rate of restoration and identify and solve issues by looking at the logs. However it caused lots of operational issues as performing this process manually was not feasible. It also resulted in poor customer experience as fares had to be adjusted for some trips that did not restore correctly. This system or process could be tested on only a few cities at a time and the sample size did not reflect the entire population. So there was a possibility that some issues and their potential solutions would be missed. There was also ambiguity on whether the backup datacenter could handle the flood of requests that can happen during a failover.

In order to rectify these problems, Uber took note of the following concepts.

  • Ensure that the flood of requests during failover can be handled by the backup datacenter
  • Ensure that replication can make use of the stored data
  • Monitoring Service was introduced to monitor the health of the system. Each hour this service receives a list of all active drivers and trips from the dispatch service. This service helped the enterprise to develop a lot of good health metrics which was helpful in identifying problems and issues that need to be resolved.
  • In order to test the backup datacenter, shadow restoration was introduced. The data which was collected by the Monitoring Service was sent to the backup datacenter. This data was used for shadow restoration. The shadow restoration service helps to develop metrics on how efficiently the load could be handled by the backup datacenter and it also helps us to identify any configuration issues.

Thanks to Sandeep from Intergrid for this guest post. Intergrid offer a range of Australian dedicated servers to suit all businesses.

Local Energy Storage (LES) unit reduces Microsoft Datacenter Costs by a Quarter

In yet another example of how distributed systems occasionally fare better than centralized ones, hardware engineers at Microsoft have invented a new battery-backed power supply for their servers which ensure that the expensive and massive battery rooms can be removed from the cloud giant’s datacenters.

Microsoft has named the new power supply as the Local Energy Storage (LES) unit and it was designed as part of the Open Cloud Server hyperscale system which Microsoft had donated to the Open Compute Project in 2014. It was updated in October 2014 with some major tweaks. In a strange move from Microsoft, in the spirit of openness, the company has even opened the new LES specification through the Open Compute community.

In 2012, Facebook had put forth the Open Compute designs which involved moving batteries into the Open Rack design so that there is an increase in efficiency. In April 2009, Google mentioned that, in addition to containerized datacenters that were being used by the enterprise to increase efficiency, 12 volt battery packs were loaded on its servers so that local failures could be eliminated. This information was provided a decade ago by Google which indicates the gap in technology advancement between Google and its rivals.

With this innovation and making the information on LES power supply-battery combination accessible to anyone, Microsoft is letting people understand how the datacenters can be run in a more efficient manner by making use of distributed small batteries instead of massive central ones.

Uninterruptible Power Supply (UPS)

Uninterruptible power supply (UPS), which are giant banks of lead acid batteries, are deployed by enterprises in traditional datacenter design. The servers, storage and network facilities are provided power by the UPS.

Use of the Panasonic 18650 lithium ion cell

Shaun Harris, Microsoft’s principal hardware engineer who invented the Local Energy Storage (LES) unit, mentioned that the server power supply made use of the lithium ion battery cells. These battery cells were used in battery-operated, rechargeable hand tools which were first used in the construction industry and then used extensively in homes. Specifically, the Panasonic 18650 lithium ion cell is being used by Microsoft, which according to Harris, is a bit larger than the AA battery which is used commonly in consumer electronics.

According to Harris, the Panasonic 18650 lithium ion cell carries UL certification, has a commodity price and has high quality. These batteries cost around $8 a pop when they are bought in a four pack and Panasonic made over 100 million in 2014 when Microsoft bought millions for its datacenter.

The engineers at Microsoft innovated by hacking into the switched mode power supply used in its Open Cloud Server machines and embedded the battery in the existing power supply circuits without any extra costs. As a result of which, unlike the ones in the Google servers in 2009, these batteries are not hanging off to one side. More importantly, these batteries do not lie in the power path between the electrical source and the server components. In the event of a power failure in the main feeds, these batteries increase the life of bulk capacitors which are used in the power supply.

Cost Savings through use of Local Energy Storage (LES) unit

Shaun Harris said that the final result was that the cost of providing battery backup power to Microsoft’s storage and server fleet was decreased by a factor of 5 times and the datacenters increased their power usage effectiveness (PUE) by 15 percent.

The LES approach results in cost savings as there is no need to build a separate room with special ventilation to house the large UPS batteries. According to Kushagra Vaid, general manager of server engineering for the Cloud and Enterprise Division at Microsoft, a typical 25 megawatt datacenter occupying an area of 600,000 square feet (which is roughly equal to an area of a dozen football fields) requires about 25 percent of the floor space (an area of 150,000 square feet)for accommodating the UPS gear. If you consider $220 per square foot for construction, savings of around $31 million can be achieved for a 25 megawatt facility when there is no need to build a UPS room.

Benefits of Local Energy Storage (LES) unit

In a normal datacenter that deploys a typical UPS scenario, the input power is converted from AC to DC so that the battery can be charged. The output power is then converted from DC to AC and distributed among the various power distribution units. The power is stepped down to 120 volts which is consumed by the servers.

By embedding batteries in the servers, Microsoft has to complete only one AC to DC power conversion and 380 volts DC is supplied directly to the power supplies in the Open Cloud Server. This power is further stepped down to 12 volts which is then delivered to the server and storage nodes.

Shaun Harris mentioned that by embedding batteries inside the power supply only two percent of the power is lost due to the overhead on battery charge. In a typical UPS scenario, around 8 percent of the energy is lost when the batteries are recharged, around 9 percent is lost during the double conversion of AC-DC power and around 1 or two percent is lost in the process of power management between the various systems and UPS. This results in a combined loss of around 19 percent which can be brought down to around 2 percent by adopting the LES approach.

Another benefit of the LES hybrid approach is that, with the use of compute and storage in the racks, the backup capacity of the battery scales up automatically. There is no need to conduct datacenter-wide capacity planning for UPS gear and the average time required to repair faults gets reduced to minutes instead of days as there is no requirement of rules and protocols to fix a battery-backed power supply which is not the case for a typical UPS installation. The LES installation is much more stable as it involves fewer parts as compared to a typical UPS installation.

According to Harris, hundreds of millions of dollars have been saved by adopting the LES approach and changing the architecture in the datacenters at Microsoft.

Moving to an Adiabatic Cooling Deployment

Normal datacenters have Compute Room Air Conditioners (CRAC), which are huge air conditioners which provide the required ventilation and air cooling so that the datacenter does not get overheated.

Water chillers, chilled water pumps, cooling towers and condenser water pumps are all deployed in the CRAC system which take water from the system so that the entire system can be fed. All of this deployment requires a lot of energy to function and it is one of the prominent reasons why most traditional data centers have a high PUE.

In 2010, the engineers at Microsoft replaced the entire old-fashioned cooling deployment with an adiabatic cooling system in its datacenters. This adiabatic cooling deployment requires less amount of water than a normal cooling tower and it functions by blowing hot air through streams of water so that the evaporation could remove the heat from the air. In order to keep the datacenter at a constant temperature, Microsoft uses a combination of outside air economizers in conjunction with the adiabatic cooling system so that cool air from outside the datacenter could be sucked in and hot air could be blown out.

There are no exact figures from Microsoft on how much money, energy and space was saved by moving to an adiabatic cooling deployment but it is very likely that the savings could be substantial.

Intelligent infrastructure for enterprise – part 1

A larger number of enterprises strive to improve the alignment between business operations and IT. Enterprise who are able to achieve this improvement are highly flexible and such enterprises have the ability to change business strategies which can adapt to economic and market trends which often change quickly.

In order to have a well-managed network, it is essential to understand what network management is. Effective management involves two dimensions:

  • Quick and flexible infrastructure
  • Intelligent infrastructure

First and foremost, in order to adapt to fast-changing market conditions, successful enterprises require IT infrastructure that is quick and flexible.

Downtime cannot be tolerated by such businesses. As about one third of all instances of downtime is a result of human error, it is imperative that a better managed network should be intelligent. A well-managed network will be less reliant on human supervision and would proactively alert administrators to small issues before they develop into large problems.

A large number of tools are available for monitoring, reporting and management of network devices. However, when it comes to cabling, we still continue to depend on what we can trace with our hands or see with our own eyes.

One tool that network managers increasingly make use of is intelligent infrastructure. Typically consisting of hardware and software, it allows you to monitor report and manage on physical cable connections.

The hardware is composed of patch panels which can sense when a patch cord has been unplugged or plugged in. This information is sent to the software via controllers which are embedded in the rack of the pack panels. Some systems have sensing embedded in the patch panel while others have sensing embedded in the patch cords.

Based on the incoming data, the software sends the alert. Every change in the documentation is noted by the software. As a result of which, the users has information on the occurrence of the change and the service is restored quickly.

Along with managing and providing alerts, new connections are managed by the software via the work-scheduling functionality. Also, it is possible to schedule a disconnection for a future date which ensures that only necessary ports are active. This results in greater security and minimizes the number of ports which are unused.


Adoption of hybrid cloud infrastructure over datacenters

Datacenter economics are being fundamentally changed by the spectacular rise of big data and the enormous amount of unstructured data created by sensors and connected devices. As a result of which a number of organizations as are considering greater IT colocation or virtualization as they try to follow a hybrid cloud approach.

According to recent market research, there has been an increase in the spending used to build the datacenter. But as it becomes harder to justify the cost of building new datacenters, the focus has now been shifted to cloud services or colocation.

Driven by Internet Of Things (IOT) and big data, the shift to cloud services by the enterprises are also being influenced by practical considerations such as reduction in latency in the datacenter. Also, colocation service providers and market analysts note that latency cab be reduced by implementing hybrid and all-flash drives and storage.

Colocation providers have started to acquire additional capacity as they anticipate the shift away from corporate datacenters. As part of this strategy, colocation vendor VXchnge acquired eight datacenters from Sungard Availability Services which is a market leader in this domain. As a result of this deal, the Tampa-based colocation provider has achieved a larger presence in close to 15 markets in North America.

In a blog post of VXchnge, the company pointed out that pushing the data closer to the businesses who analyze data will be an important factor which will help to increase speeds and reduce the latency.

According to colocation providers like VXchnge, more organizations are looking forward to investing in colocation or virtualization rather than spending their money on new and expensive datacenters. They view that the customers will arrive at a conclusion that cloud provides them with a lower cost and a reliable method to handle an increasing an increasing number of applications, IT infrastructure and workloads which run on top of open-source technologies.

VXchnge asserted that, in the future, Internet Of Things (IOT) and big data will be responsible for many changes in software, hardware and datacenters. Moving the data closer to the people who make use of the data will really help to improve the performance of the enterprise.

According to a quarterly survey released in July 2015, market researcher 451 Research found that 62 percent of the respondents planned to consolidate their existing IT infrastructure so that power and space availability could be accommodated. 41 percent of the respondents planned to make use of cloud service providers including software-as-a-service vendors as well as hosted private clouds.

Of all the companies surveyed by 451 Research, only 25 percent said that they would invest in building a new datacenter in 2015. According to reports from the researcher, high-end, centralized datacenters would be formed due to datacenter consolidation.

Analysts say that, if big data becomes a core enterprise capability in the future, market economics will fuel the adoption of hybrid cloud infrastructure by the enterprises rather than investing in datacenters.