Google Pins Gmail Outage on 'Routine Update' Gone Wrong



Monday’s Gmail outage didn’t last long — just 18 minutes, according to Google — but it disrupted an awful lot of users. When the popular webmail service went down around 9 a.m. PST, as much as 40 percent of all users were affected.


On Tuesday afternoon, Google published its explanation of what went wrong. The problem, the company said, was a “routine” update that caused Google’s load-balancing software — the code that helps split Google’s web workload between servers — to erroneously believe that some of Google’s data centers were unavailable.


Google operates nine of its own data centers all around the world and has built some of the most reliable computer systems on the planet. But as yesterday’s outage shows, even a big company like Google can mess up sometimes. Gmail seems to have been the hardest hit — between 8 percent and 40 percent of users were affected yesterday morning — but there were also problems with Google Drive, Google Chat, Google Calendar, and Google Play.


And remarkably, a combination of bugs in Google Sync and the company’s Chrome browser caused widespread crashes to the browser as well.


“The Google load balancers have a failsafe mechanism to prevent this type of failure from causing Google­wide service degradation, and they continued to route user traffic,” Google said in an incident report, prepared yesterday for Google Apps customers, and published on Google’s website today. “As a result, most Google services, such as Google Search, Maps, and AdWords, were unaffected. However, some services, including Gmail, that require specific data center information to efficiently route users’ requests, experienced a partial outage.”


The buggy software was rolled out between 8:45 a.m. and 9:13 a.m. PST, Google said.


Monitoring software picked up the problems at 9:06 am, and within seven minutes, Google started rolling back to the non-buggy load-balancing software. That roll-back was complete by 9:18 a.m.


In the future, Google plans to upgrade its load-balancing software to one data center before rolling it out worldwide, but that’s a bit of a tricky process. “The unique nature of load balancing systems makes this more difficult than with other software,” Google said in its report.


Google also plans to take another look at its internal processes “to ensure more timely updates to Google Apps Status Dashboard.” That’s the place where Google posts updates to customers when it has services outages such as yesterday’s events.


You're reading an article about
Google Pins Gmail Outage on 'Routine Update' Gone Wrong
This article
Google Pins Gmail Outage on 'Routine Update' Gone Wrong
can be opened in url
http://newsempty.blogspot.com/2012/12/google-pins-gmail-outage-on-update-gone.html
Google Pins Gmail Outage on 'Routine Update' Gone Wrong