What is Chaos Engineering?

Chaos engineering is the intentional and controlled causing of failures in the production or pre-production environment to understand their impact and plan a better defense posture and incident maintenance strategy.

Every day creates a new opportunity for an organization’s critical application or infrastructure to fail, potentially threatening its ability to deliver services to customers. Causes of failure can vary between several issues, including security breaches, misconfigurations or service disruptions. The likelihood of errors or disruptions can rise as more applications and data are hosted in the cloud, which can create an increase in security issues.

One way to address disruptions is chaos engineering. It is not a random process where engineers terminate instances or services or otherwise cause systems to fail without any purpose. This process identifies potential future issues, allowing engineering teams to solve problems proactively and avoid them in the live environment further down the road.

Chaos engineering is important because an error or disruption can slow down an organization’s momentum, expending precious time figuring out a solution on the fly as downtime increases. Netflix learned this concept firsthand when it switched from on-premises to the cloud¹(link resides outside ibm.com)-they experienced an outage that led to a three-day interruption to service delivery in 2008.

This outage predates its transformation as a video streaming operation, which would have made that outage exponentially more costly. As a result, Netflix decided that it would do everything possible to minimize disruptions and it began to introduce chaos engineering into its workflows. This process allows them to identify issues before they happen and to minimize the damage if and when an unavoidable failure occurs.

Netflix created chaos monkey² (link resides outside ibm.com), an open source tool that creates random incidents in IT services and infrastructure meant to identify weaknesses that can be fixed or addressed through automatic recovery procedures. They implemented chaos monkey when it moved from a private data center to Amazon Web Services (AWS) in response to unreliability from the cloud. Many organizations now use chaos monkey to run their chaos engineering experiments.

Chaos engineering is an important defense against infrastructure failures, outages or missing components in an organization’s production environment. It helps site reliability engineers (SREs) and other members of the DevOps team to provide continuous delivery of services by avoiding significant disruptions to their service. Chaos engineering helps them understand their vulnerabilities better and informs how to minimize the impact if a disruption occurs.

Even a small issue in code can have a catastrophic effect on the overall production environment given different program dependencies. For instance, an error in the transaction software system for a financial services firm can result in the loss of millions of dollars³(link resides outside ibm.com).

Organizations might be unable to avoid all IT incidents, but they can minimize the damage by using chaos management to understand likely scenarios and their best-possible solutions.

Debunking the myths of observability

This ebook aims to debunk myths surrounding observability and showcase its role in the digital world.

Footnotes

¹ Chaos Engineering: System Resiliency in Practice, (link resides outside ibm.com) Casey Rosenthal, Nora Jones, 2020
²What is Chaos Monkey? Chaos engineering explained, (link resides outside ibm.com) InfoWorld, 13 May 2020
³Knight Capital Says Trading Glitch Cost It $440 Million, (link resides outside ibm.com) New York Times, 2012
⁴ There Is No Resilience without Chaos, The New Stack, (link resides outside ibm.com) 13 Apr 2023
⁵ Incident Management in the Cloud Era, (link resides outside ibm.com) Constellation Research, 2023
⁶ ChAP: Chaos Automation Platform, (link resides outside ibm.com) Netflix Blog, 26 July 2017
⁷ The I&O Leader’s Guide to Chaos Engineering, (link resides outside ibm.com) Gartner, 28 October 2021