An Introduction to Chaos Engineering for Testers

One of the things I’ve been looking at recently are the principles of Chaos Engineering as a part of testing my projects. It’s something that’s actually easy to do and can find some interesting restults.

Chaos Engineering is the discipline of exploring your system to see how it handles turbulent conditions in live. It’s a term that was coined by Netflix to see whether their systems and infrastructure could handle infrastructure failures, network failures, and application failures.

Basically Chaos Engineering asks, what would happen if this thing failed?

Fig 1. DeeDee would be great at Chaos Engineering.

So what have I been doing?

I’ve been working on some green field development prototype systems that’ve never been touched by users or deployed in the field before. We don’t know how they’ll respond or recover to unexpected faults so I’ve been running some Chaos Testing.

Network Throttling

  • Do requests from the front end just not make it to the back end?
  • Do web sockets recover after a drop of network?
  • Does data corrupt if there’s packet loss?
  • Do requests just plain old time out?

In the real world network conditions might not be stable (especially wifi and mobile network carriers) so we need to know what happens if the network drops. Frequently, this isn’t a risk that’s been considered so reestablishing connections after a drop may not occur, or timeouts may not respect realistic network performance.

UI to back end we can use the Chrome browser dev tools to throttle network. Head to the network tab and see the throttling dropdown. We can use this to simulate mobile network conditions or even put the UI in offline mode.

Fig 2. Chrome’s network throttling dropdown.

Back end services we can use tools like Fiddler to throttle and kill network connections across locally run back end services. This allows us to see what would happen if performance between our deployed services dropped and had to recover.

Fig 3. Fiddler’s performance rules.

Alternatively, a personal favourite of mine is the tool Clumsy (Windows only) which allows you to simulate network latency, packet loss and network throttling. I’ve used this to uncover issues with web sockets not recovering after a drop of connection and see at what point packet loss for mobile apps becomes a real problem.

Fig 4. Clumsy is an awesome network testing tool.

Killing Services / Pods

  • What happens after a DDOS attack?
  • Do we have service failover and backups?
  • Can we recover from error states and failures?

Even with the best of intentions sometimes our services might fail or error, this could be because of a hardware fault, a crash or a memory issue. We need to understand what happens if our back end services fail and have to recover at different points of the system’s usage.

If your system is deployed to kubenetes then it’s easy to simulate service failures; we can delete a service pod and it’ll automatically recover.

First we need to get a list of what services we have available by running the command kubectl get pods then after selecting a service we can run the command kubectl delete <pod name> to force that service to fail and recover.

In a non kubenetes deployed system we may have to find ways to manually kill services. We can move them to different folders or even delete the code as the service is running to see what happens. Work with your engineering team to see in what ways you can force failure or faults and recovery.

Specific Chaos Tooling

The standard tool for Chaos Engineering is Chaos Monkey which you can install locally to start randomly terminating services. This can be used to test deployed systems in production (or pre production deployments).

However, Chaos Monkey has no UI and can be fiddly to set up, it may be easier for you to manually shut things down / fail them using the methods above instead.

Fig 5. Chaos reigns on site, can we recover?

Blog at WordPress.com.