Without Observability, there is no Chaos Engineering. This is a line that I picked from a nice article on chaos, and I couldn't agree more. The very nature of voluntary chaos injection demands that we have the right monitoring aids to validate our experiment's hypothesis around application/microservices behavior. Typically, that can be mapped to The Four Golden Signals.
Having said that, observability in itself has many facets, just as chaos engineering does. Chaos not only helps test resiliency in terms of service availability (HA), it is also a means to refine alerting and notification mechanisms, streamline the incident response structure and measure key performance indicators (KPIs) such as mean time to detect an anomaly (MTTD), mean time to recovery, say, to optimal performance (MTTR) and sometimes even the time to resolve (another MTTR!), either via self-heal or manual effort, in cases where the chaos experiment is deliberately executed with a high blast radius. There are several tools one could employ today to obtain and visualize this data, which is the other facet to observability that I mentioned earlier. Some tools can even help with automated Root Cause Analysis. Check out this cool demo by folks from Zebrium which demonstrates automated detection of incidents induced via Litmus Chaos experiments.
While there is a lot to discuss and learn about the Whys & Hows of observability with chaos engineering, in this blog we shall get started with a simple means of mapping application behavior with chaos ongoings, i.e., find a way to juxtapose application metrics with chaos events. And to do that, we will make use of the de-facto open-source monitoring stack of Prometheus & Grafana. This is intended to get you rocking on your chaos observability journey, which will get more exciting with continuous enhancements being added into the LitmusChaos framework
What better than the sock-shop demo application to learn about microservices behavior? A quick set of commands should get you started. A Kubernetes cluster is all you need!
- Obtain the demo artefacts
git clone https://github.com/litmuschaos/chaos-observability.git cd chaos-observability/sample-application/sock-shop
- Setup Sock-Shop Microservices Application
kubectl create ns sock-shop kubectl apply -f deploy/sock-shop/
- Verify that the sock-shop microservices are running
kubectl get pods -n sock-shop
- Setup the LitmusChaos Infrastructure
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.6.0.yaml kubectl apply -f https://litmuschaos.github.io/litmus/litmus-admin-rbac.yaml kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.6.0?file=charts/generic/experiments.yaml -n litmus
The LitmusChaos framework provides various Kubernetes chaos events against the ChaosEngine & ChaosResult custom resources, right from the pre-chaos validation checks, through chaos injection and post-chaos health checks in order to trace the ongoings over the course of the chaos experiment. Converting these events into metrics is a great way to integrate with existing off-the-shelf application dashboards to gain a clear understanding of application behavior through chaos injection and revert actions.
In this exercise, we make use of Heptio's event router to convert the chaos events into metrics and then instrument the standard sock-shop application's Grafana dashboard with appropriate queries to achieve our goal.
Setup the Monitoring Infrastructure
- Step-1: Lets setup the event router with the HTTP sink to convert the kube cluster events into metrics.
kubectl apply -f deploy/litmus-metrics/01-event-router-cm.yaml kubectl apply -f deploy/litmus-metrics/02-event-router.yaml
- Step-2: We will set up Prometheus & Grafana deployments with NodePort (you could change it to Loadbalancer if you prefer) services
kubectl apply -f deploy/monitoring/01-monitoring-ns.yaml kubectl apply -f deploy/monitoring/02-prometheus-rbac.yaml kubectl apply -f deploy/monitoring/03-prometheus-configmap.yaml kubectl apply -f deploy/monitoring/04-prometheus-alert-rules.yaml kubectl apply -f deploy/monitoring/05-prometheus-deployment.yaml kubectl apply -f deploy/monitoring/06-prometheus-svc.yaml kubectl apply -f deploy/monitoring/07-grafana-deployment.yaml kubectl apply -f deploy/monitoring/08-grafana-svc.yaml
- Step-3: Access the grafana dashboard via the NodePort (or loadbalancer) service IP
Note: To change the service type to Loadbalancer, perform a
kubectl edit svc prometheus -n monitoring and replace type:
kubectl get svc -n monitoring
Default username/password credentials: admin/admin
- Step-4: Add the Prometheus datasource for Grafana via the Grafana Settings menu
- Step-5: Import the grafana dashboard "Sock-Shop Performance" provided here
Execute the Chaos Experiments
For the sake of illustration, let us execute a CPU hog experiment on the catalog microservice & a Memory Hog experiment on the orders microservice in a staggered manner
kubectl apply -f chaos/catalogue/catalogue-cpu-hog.yaml
Wait for ~60s
kubectl apply -f chaos/orders/orders-memory-hog.yaml
Verify execution of chaos experiments
kubectl describe chaosengine catalogue-cpu-hog -n litmus kubectl describe chaosengine orders-memory-hog -n litmus
Visualize Chaos Impact
Observe the impact of chaos injection through increased Latency & reduced QPS (queries per second) on the microservices under test.
As you can see, this is an attempt to co-relate application stats to the failure injected, and hence a good starting point in your chaos monitoring journey. Try this out & share your feedback! A lot more can be packed into the dashboards to make the visualization more intuitive. Join us in this effort and be part of SIG-Observability within LitmusChaos!!
Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join Our Community On Slack For Detailed Discussion, Feedback & Regular Updates On Chaos Engineering For Kubernetes: https://kubernetes.slack.com/messages/CNXNB0ZTN (#litmus channel on the Kubernetes workspace) Check out the Litmus Chaos GitHub repo and do share your feedback: https://github.com/litmuschaos/litmus Submit a pull request if you identify any necessary changes.