Let’s find out how you can keep your production reliable with the help of Chaos Engineering tools.

Chaos engineering is a discipline where you experiment on your system or application to reveal its weaknesses and capacity failure. These are something that you did not think could happen while creating it. So, you would cause some failures on purpose on your system to show up its weaknesses to make the fixes and make your system and your application more resilient.

Many popular organizations like Netflix, LinkedIn, and Facebook perform chaos engineering to better understand their microservices architecture and distributed systems. It helps in finding new issues sooner than real user complaints and take necessary action to correct them. That’s how these organizations can serve millions of users, increase their productivity, and save millions of dollars 🤑.

Benefits of Chaos Engineering:

  • Control losses on revenue by finding critical issues
  • Reduction in system or application failure
  • Better user experience with less disruption and high service availability
  • It helps you learn about the system and gain confidence.

How confident are you about your production reliability? Is it real disaster-proof?

Let’s find out with the help of the following popular chaos testing tools.

Chaos Mesh

Chaos Mesh is a chaos engineering management solution that injects faults into every layer of a Kubernetes system. This includes pods, the network, system I/O, and the kernel. Chaos Mesh can automatically kill Kubernetes pods, simulate latencies. It can disrupt pod-to-pod communication and simulate read/write errors. It can schedule rules for the experiments and define their scope. These experiments are specified using YAML files.

Chaos Mesh has a dashboard to view analytics on experiments. It runs on top of Kubernetes and supports the majority of the cloud platform. It is open-source which was recently accepted as a CNCF sandbox project. You can add Chaos Mesh to your DevOps workflow to build resilient applications by using chaos engineering principles.

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

Chaos Engineering features:

  • Easily deployable on Kubernetes clusters with no modification in deployment logic
  • No unique dependencies required for deployment
  • Defines chaos objects using CustomResourceDefinitions (CRD)
  • Provides a dashboard to track all the experiments

Chaos ToolKit

Chaos ToolKit is an open-source and simple tool for Chaos Engineering Experiment Automation.

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

You integrate Chaos ToolKit with your system using a set of drivers or plugins it supports AWS, Google Cloud, Slack, Prometheus, etc.

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

Chaos ToolKit features:

  • Provides declarative Open API to create chaos experiments independent of a vendor or technology
  • Can be easily embedded in CICD pipelines for automation
  • Provides commercial and enterprise support also through ChaosIQ

ChaosKube

As you can guess by the name, it for Kubernetes.

Chaoskube is an open-source chaos tool that kills random pods periodically in the Kubernetes cluster. It helps you understand how your system will react when the pod fails. By default, it kills a pod in any namespace every 10 minutes. You can filter the target pods in Chaoskube using namespaces, labels, annotations, etc. It can be easily installed using Chaoskube.

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

Chaos Monkey

Chaos Monkey is a tool used to check the resilience of the cloud systems by purposely creating failures for those systems to understand their reaction. Netflix created it to test its AWS infrastructure resiliency and recoverability. It was named Chaos Monkey because it creates destruction like a wild and armed monkey to test the failures.

Also, it was Chaos Monkey, which gave birth to the new engineering practice Chaos Engineering. It was created on the principle that it is better to fail repeatedly to avoid any significant failure suddenly.

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

Chaos Monkey features:

  • It helps you prepare for random instance failures.
  • Encourages redundancy for unexpected failures
  • Uses Spinnaker to enable cross-cloud compatibility
  • Provides configurable schedule to simulate failures
  • Integrated with govendor to add any new dependencies to chaos monkey

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

Simmy

Simmy is a fault-injection chaos tool that integrates with the Polly resilience project for .NET. It allows you to create chaos-injection policies through Polly, where you execute your codes. Its offers different policies such as exceptions policy to inject exceptions in the system, behavior policy to inject any new behavior, etc. These policies are designed to inject the behavior randomly.

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

Simmy features:

  • Provides Monkey policies or Chaos policies to inject chaos
  • Easy to test any dependency failures
  • It helps to revert to the working model quickly and controls the blast radius.
  • It is production-grade ready.
  • It can define failures based on external factors also (for example, failures due to global configuration)

Pystol

Pystol is a tool that is used for injecting faulty injections in cloud-native environments. It watches events in the ETCD through Kubernetes operators. When a fault injection action is executed, the operators create the pods and run some Ansible collections. So, developers need not write their own actions to perform.

Pystol provides ready-made actions to test the system. Still, if a developer wants to create a new action, it can be done using GoLang and Python.

It provides a continuous integration dashboard to give a summary view of all the job operations. You can run Pystol locally or deploy it in a container using its docker image. Pystol provides two interfaces, one is Web UI, and the other one is through CLI. Obviously, Web UI is a better option.

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

Muxy

Muxy is a proxy to test your resilience and fault tolerance patterns for real-world distributed system failures. It can tamper with transport level (layer 4), TCP session-level (layer 5), HTTP protocol level (layer 7).

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

Muxy features:

  • Modular architecture and easily extensible
  • Has official docker container
  • Easy to install, no dependencies required.
  • Ideal for continuous testing of resilience
  • Simulates network connectivity issues for distributed system and mobile devices

Pumba

Pumba is a command-line tool that performs chaos testing for docker containers. With Pumba, you purposely crash the docker containers running the application to see how the system reacts. You can also perform stress testing on the container resources such as CPU, memory, file system, input/output, etc.

You can also run Pumba on a Kubernetes cluster. You have to use DaemonSets to deploy Pumba on Kubernetes nodes. You can use multiple Pumba containers to run multiple Pumba commands in the same DaemonSet.

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

ChaosBlade

ChaosBlade is an open-source tool to inject experiments in the systems by Alibaba. It tests all the failures that Alibaba has faced in the last ten years and applies best practices to avoid that. It follows chaos engineering principles to check the fault tolerance of distributed systems.

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

ChaosBlade features:

  • Provides experimental scenarios for multiple resources such as CPU, network, memory, disk, etc.
  • Provides experimental scenarios for nodes, network, pods on the Kubernetes platform
  • Provides easy to use CLI commands to execute experiments

Litmus

Litmus follows cloud-native chaos engineering principles. The litmus tool’s mission is to deliver a complete framework for finding weaknesses in your Kubernetes systems and your running applications on Kubernetes.

It has a chaos Operator and the CRDs (CustomResourceDefinitions) around that, allowing plug-and-play capability. It’s all about putting your chaos logic into a docker image, throwing it into a litmus framework, and getting them orchestrated using the CRDs.

Litmus features:

  • Helps Site Reliability engineers and developers to find weaknesses in the Kubernetes system
  • Provides ready-to-use generic experiments
  • Provides Chaos API for chaos workflow management
  • Litmus SDK supports Go, Python, and Ansible to create your own experiments.

Gremlin

Gremlin helps engineers build more resilient software. It provides a platform to run chaos engineering experiments in a safe, secure, and straightforward way.

You can thoughtfully inject failure into hosts or containers with gremlin regardless of where they are, whether that’s the public cloud or your own data center.

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

Gremlin features:

  • Installs lightweight agent on your hosts or containers to inject failures
  • Provides 10 different infrastructure attack modes
  • State gremlins let you manipulate system time, shut down or restart hosts and kill processors.
  • Network gremlins can inject latency to introduce packet loss or drop the traffic.
  • Gremlin’s Alfi library attacks can be configured, started, and stopped via the web app. API or CLI
  • Allows you to target the blast radius you want to attack precisely
  • Allows you to halt all attacks and roll the system back to a steady-state

Steadybit

Steadybit aims to reduce the downtime proactively and provides visibility into system issues. You can run this tool locally on your infrastructure or cloud as a service (SaaS).

Use Chaos Engineering Tools to Check Production Reliability Cloud Computing Sysadmin

To use Steadybit, you define the situation, simulate the experiments, execute the simulated experiments on production, and automate all the experiments. It runs intelligent agents on your system to discover potential issues and weaknesses. It integrates with multiple systems with ease.

Conclusion

Go ahead and be brave enough to apply chaos engineering principles and test your production with the tools I mentioned above. These tools will help you find multiple unidentified weaknesses in your system, and it will help you make your system more resilient.