AWS Failure Injection Simulator: New Chaos Engineering as a Service Offering

Once only practiced among tech giants, today, chaos engineering is the latest trend with a growing number of companies adopting the discipline to detect failures within their systems.

Recently, AWS introduced the company’s chaos engineering as a service offering, AWS Fault Injection Simulator, built with the purpose of simplifying the process of running chaos experiments in the cloud, as highlighted in this TechCrunch article.


We asked Adrian Cockcroft, AWS VP of cloud architecture strategy about chaos engineering, the challenges surrounding it that put people off adopting it, and how AWS’s new tool fits in the space:


How would you define chaos engineering?

Adrian: Experiment to ensure that the impact of failures is mitigated.

The last strand that breaks is not the cause of the failure. Build resilient systems like a rope not a chain, but make sure you know how much margin you have and how “frayed” your system is.


Why is chaos engineering difficult?

Adrian: The need to test is difficult for a few reasons:

  • Stitching together tools and homemade scripts is hard.
  • You need a bunch of agents and libraries.
  • Difficult to ensure that the things that you’re doing are safe — you run a chaos test and you actually cause a real-world outage. How do you make sure you do these things safely? The point is you’re supposed to be testing your margin to absorb failures not actually tipping the whole system over.
  • It’s difficult to reproduce events.


What separates AWS Fault Injection Simulator from other tools in the space?

Adrian:
AWS FIS enables you to run experiments to ensure that both availability and performance impact of failures are mitigated. Most of the tools in the space look at just availability. We (AWS) decided we needed to look at performance impact as well.


What are some of the top upsides of AWS FIS?

Adrian: FIS is a fully managed chaos engineering service:

It’s easy to get started:

  • Fully managed service so you don’t have to integrate tools.
  • It uses standard interfaces that people use already when they’re using the cloud.
  • It uses pre-existing experiment templates so you can get started quickly.
  • You can share them with others.

Has real world conditions:

  • You can experiments in sequence or in parallel.
  • You can target host infrastructure or the network.
  • These are real faults injected at the service control plane level. If you tried to build this yourself or look at a third party tool they can only operate against the APIs that exist, this tool is able to go beneath the APIs, it works on the hypervisor and the control plane its working against internal capabilities that AWS has. It’s giving you an external API that lets you safely target and manage those things.

Has built in safeguards:

  • “Stop condition” alarms.
  • Integration with Amazon CloudWatch for monitoring.
  • Built-in rollbacks.
  • Fine-grain IAM controls so not everybody can do everything.