Netflix has completed moving its streaming-video operations onto Amazon Web Services after a seven-plus year journey, and the massive project holds valuable lessons for any enterprise contemplating or undergoing a large-scale move to the cloud.
The company announced the milestone in a blog post this week that also provided plenty of background detail on why and how it all went down. Here are some of the highlights:
Our journey to the cloud at Netflix began in August of 2008, when we experienced a major database corruption and for three days could not ship DVDs to our members. That is when we realized that we had to move away from vertically scaled single points of failure, like relational databases in our datacenter, towards highly reliable, horizontally scalable, distributed systems in the cloud. We chose Amazon Web Services (AWS) as our cloud provider because it provided us with the greatest scale and the broadest set of services and features. The majority of our systems, including all customer-facing services, had been migrated to the cloud prior to 2015.
Since then, we've been taking the time necessary to figure out a secure and durable cloud path for our billing infrastructure as well as all aspects of our customer and employee data management. We are happy to report that in early January, 2016, after seven years of diligent effort, we have finally completed our cloud migration and shut down the last remaining data center bits used by our streaming service!
Netflix now accounts for more than a third of all North American Internet traffic during prime viewing hours. Its viewership count grew by three orders of magnitude during the past eight years as well, according to the blog.
At the same time, Netflix's product changed dramatically in breadth and sophistication, which made a traditional data center strategy even less feasible:
Supporting such rapid growth would have been extremely difficult out of our own data centers; we simply could not have racked the servers fast enough. Elasticity of the cloud allows us to add thousands of virtual servers and petabytes of storage within minutes, making such an expansion possible.
Netflix recently launched in another 130 countries and to support this is taking advantage of AWS' many zones around the world.
Keeping Those Streams Going
Availability is arguably the most important service metric for Netflix and its customers. Most viewers will put up with a lower-resolution stream of their favorite show, but "service not available" is anathema to binge-watchers everywhere. Netflix has managed to reach four nines of service uptime through deliberate architectural decisions and disaster recovery methods:
Failures are unavoidable in any large scale distributed system, including a cloud-based one. However, the cloud allows one to build highly reliable services out of fundamentally unreliable but redundant components. By incorporating the principles of redundancy and graceful degradation in our architecture, and being disciplined about regular production drills using Simian Army, it is possible to survive failures in the cloud infrastructure and within our own systems without impacting the member experience.
One of the members of Simian Army—Netflix's name for a collection of tools it built around system uptime and resiliency—is Chaos Monkey. The tool actually randomly shuts down production instances of Netflix, giving engineers the ability to examine the root causes of the failure in a controlled manner.
The Seven-Year Itch
For all the benefits of moving to the cloud—touted by both customers and providers—for Netflix the journey was not an easy one. It was also transformative:
Arguably, the easiest way to move to the cloud is to forklift all of the systems, unchanged, out of the data center and drop them in AWS. But in doing so, you end up moving all the problems and limitations of the data center along with it. Instead, we chose the cloud-native approach, rebuilding virtually all of our technology and fundamentally changing the way we operate the company.
Instead of a single, monothlitic application, Netflix today is composed of hundreds of microsources, the blog notes. Netflix also adopted a continuous delivery development model and empowered developers to make independent decisions, according to the blog: Many new systems had to be built, and new skills learned.
Analysis: Taking Measure of Netflix's Move to AWS
One of the most interesting aspects of the Netflix-AWS relationship is the fact that the companies battle so fiercely in the streaming video market, with Amazon offering its own Prime Video service.
To that end, apart from the scope and intricacy of Netflix's AWS migration lies an important statement, says Constellation Research VP and principal analyst Holger Mueller.
"The 21st century has rewritten many of the rules of how businesses operate," he says. "While a 20th century rule was to never help the competitor, the cooperation and 'coopetition' between Netflix and AWS shows it has advantages to do so. It's like the renaissance of famous economist Ricardo's principles of comparative advantages from 200-plus years ago—rediscovered in the cloud."
Reprints
Reprints can be purchased through Constellation Research, Inc. To request official reprints in PDF format, please contact Sales.