Netflix is celebrating its 14th year of streaming, along with unprecedented growth in subscribers brought on in part by the global pandemic, surpassing the 200m mark by the end of 2020. Yet through all this added pressure, it’s remained an incredibly stable platform, able to meet the needs of its millions of users.
How do they manage? Part of this credit can be given directly to chaos engineering.
From Chaos Monkeys to 200m+ Subscribers
We reached out to some of the experts who have led engineering efforts at Netflix for their input on how chaos engineering has continued to support Netflix’s wild success over the years.
“Chaos engineering was deliberately created at Netflix as a proactive discipline to understand and navigate complex systems,” says Casey Rosenthal, former chaos engineering manager at Netflix and co-founder and CEO at Verica. “It was born both out of Netflix’s pain moving from the datacenter to the cloud, and then from navigating the inherent uncertainty of building increasingly complex systems in the cloud.”
Netflix didn’t have a great track record for their availability in those early days — they have a stellar reputation now because of their adoption of chaos engineering.
Casey explained that key to this was the notion that chaos engineering is a form of experimentation on your systems, not simple testing. Experiments can tell you considerably more than a pass/fail test; they create new knowledge. We use experimentation all of the time to deal with other complex systems, such as the medical field. Clinical trials are a form of experimentation on complex systems (humans) that simply cannot be left to testing in a lab.
Experiments propose a hypothesis, and as long as the hypothesis is not disproven, confidence grows in that hypothesis. If it is disproven, then you learn something new. This kicks off an inquiry to figure out why your hypothesis is wrong. In a complex system, the reason why something happens is often not obvious. Experimentation either builds confidence, or it teaches you new properties about your own system.
“Better comprehension of systemic effects leads to better engineering in distributed systems, which improves reliability. Through years of consistent experimentation, Netflix has come to learn a great deal about its systems, gaining confidence in them even as they grow more complex by the day,” shared Casey. “Chaos engineering is now an integral part of Netflix’s engineering culture, supporting high velocity, experimentation, and confidence in teams and systems through empirical verification.”
You can learn how to bring this kind of stability to your own company systems straight from the top practitioners using chaos engineering at…
The event will cast a light on what chaos engineering is along with when and why you should consider it. We’ll move beyond the oft-used tropes of breaking things and instead show you how the brightest minds in chaos engineering are using the practice to learn and gain expertise at scale about their critical complex systems every day.