This previous post introduced LitmusChaos as a cloud-native chaos engineering framework, that provides both, native off-the-shelf chaos experiments as well as the ability to orchestrate experiments written in the BYOC (bring-your-own-chaos) mode. You may also have tried your hand with this quick litmus demo. Exciting as it already is, we have seen one more usage pattern evolve in the Litmus community: Chaos Workflows. Does this sound like some word-play between two popular dev(git)ops practices? Let me explain in detail.

Is it sufficient to just inject a failure?

One of the common reasons for injecting chaos (or, as it is commonly known: running a chaos experiment) in a microservices environment is to validate one’s hypothesis about system behavior in an unexpected failure. Today, this is a well-established practice with a multitude of chaos injection tools built for the container (read: Kubernetes) ecosystem, enabling SREs to verify resilience in the pre-production and production environments.

However, when simulating real-world failures via chaos injection on development/staging environments as part of a left-shifted, continuous validation strategy, it is preferable to construct potential failure sequence or chaos workflow over executing standalone chaos injection actions. Often, this translates into failures during a certain workload condition (such as, say, percentage load), multiple (parallel) failures of dependent & independent services, failures under (already) degraded infrastructure, etc. The observations and inferences from these exercises are invaluable in determining the overall resilience of the applications/microservices under question.

LitmusChaos + Argo = Chaos Workflows

While this is already practiced in some form, manually, by developers & SREs via gamedays and similar methodologies, there is a need to automate this, thereby enabling repetition of these complex workflows with different variables (maybe a product fix, a change to deployment environment, etc.). One of the early adopters of the Litmus project, Intuit, used the container-native workflow engine, Argo, to execute their chaos experiments (in BYOC mode via chaostoolkit) orchestrated by LitmusChaos to achieve precisely this. The community recognized this as an extremely useful pattern, thereby giving rise to Chaos Workflows.

Using Chaos Workflows as an aid for benchmark tests

In this blog, let's look at one use-case of chaos workflows. We shall examine how chaos impacts an Nginx server's performance characteristics using a workflow that executes a standard benchmark job with pod-kill chaos operation in parallel.

Prepare the Chaos Environment

In the next few sections, we shall lay the base for executing this workflow by setting up the infrastructure components.

Install Argo Workflow Infrastructure

The Argo workflow infrastructure consists of the Argo workflow CRDs, Workflow Controller, associated RBAC & Argo CLI. The steps are shown below to install Argo in the standard cluster-wide mode, where the workflow controller operates on all namespaces. Ensure that you have the right permission to be able to create the said resources.

Create argo namespace

root@demo:~/chaos-workflows# kubectl create ns argo
namespace/argo created

Create the CRDs, workflow controller deployment with associated RBAC

root@demo:~/chaos-workflows# kubectl apply -f https://raw.githubusercontent.com/argoproj/argo/stable/manifests/install.yaml -n argo

customresourcedefinition.apiextensions.k8s.io/clusterworkflowtemplates.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/cronworkflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtemplates.argoproj.io created
serviceaccount/argo created
serviceaccount/argo-server created
role.rbac.authorization.k8s.io/argo-role created
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-admin configured
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-edit configured
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-view configured
clusterrole.rbac.authorization.k8s.io/argo-cluster-role configured
clusterrole.rbac.authorization.k8s.io/argo-server-cluster-role configured
rolebinding.rbac.authorization.k8s.io/argo-binding created
clusterrolebinding.rbac.authorization.k8s.io/argo-binding unchanged
clusterrolebinding.rbac.authorization.k8s.io/argo-server-binding unchanged
configmap/workflow-controller-configmap created
service/argo-server created
service/workflow-controller-metrics created
deployment.apps/argo-server created
deployment.apps/workflow-controller created

Install the argo CLI on the test harness machine (where the kubeconfig is available)

root@demo:~# curl -sLO https://github.com/argoproj/argo/releases/download/v2.8.0/argo-linux-amd64

root@demo:~# chmod +x argo-linux-amd64

root@demo:~# mv ./argo-linux-amd64 /usr/local/bin/argo

root@demo:~# argo version
argo: v2.8.0
BuildDate: 2020-05-11T22:55:16Z
GitCommit: 8f696174746ed01b9bf1941ad03da62d312df641
GitTreeState: clean
GitTag: v2.8.0
GoVersion: go1.13.4
Compiler: gc
Platform: linux/amd64

Install a Sample Application: Nginx

Install a simple multi-replica stateless Nginx deployment with service exposed over nodeport

root@demo:~# kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/App/nginx.yaml

deployment.extensions/nginx created
root@demo:~# kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/App/service.yaml 
service/nginx created

Install Litmus Infrastructure

Apply the LitmusChaos Operator manifest:

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.5.0.yaml

Install the litmus-admin service account to be used by the chaos-operator while executing the experiment.

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-admin-rbac.yaml

Install the Chaos experiment of choice (in this example, we pick a pod-delete experiment)

kubectl apply -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/pod-delete/experiment.yaml -n litmus

Create the Argo Access ServiceAccount

Create the service account and associated RBAC, which will be used by the Argo workflow controller to execute the actions specified in the workflow. In our case, this corresponds to the launch of the Nginx benchmark job and creating the chaosengine to trigger the pod-delete chaos action. In our example, we place it in the namespace where the litmus chaos resources reside, i.e., litmus.

root@demo:~# kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/Argo/argo-access.yaml -n litmus

serviceaccount/argo-chaos created
clusterrole.rbac.authorization.k8s.io/chaos-cluster-role created
clusterrolebinding.rbac.authorization.k8s.io/chaos-cluster-role-binding created

Nginx traffic characteristics during a non-chaotic benchmark run

Before proceeding with the chaos workflows, let us first look at how the benchmark run performs under normal circumstances & what are the properties of note.

To achieve this:

Let us run a simple Kubernetes job that internally executes an apache-bench test on the Nginx service with a standard input of 10000000 requests over a 300s period.

root@demo:~# kubectl create -f https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/App/nginx-bench.yaml

job.batch/nginx-bench-c9m42 created

Observe the output post the 5 min duration & note the failed request count. Usually, it is 0, i.e., there was no disruption in Nginx traffic.

root@demo:~# kubectl logs -f nginx-bench-zq689-6mnrm

2020/06/23 01:42:29 Running: ab -r -c10 -t300 -n 10000000 http://nginx.default.svc.cluster.local:80/
2020/06/23 01:47:35 This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking nginx.default.svc.cluster.local (be patient)
Finished 808584 requests


Server Software:        nginx/1.19.0
Server Hostname:        nginx.default.svc.cluster.local
Server Port:            80

Document Path:          /
Document Length:        612 bytes

Concurrency Level:      10
Time taken for tests:   300.001 seconds
Complete requests:      808584
Failed requests:        0
Total transferred:      683259395 bytes
HTML transferred:       494857692 bytes
Requests per second:    2695.27 [#/sec] (mean)
Time per request:       3.710 [ms] (mean)
Time per request:       0.371 [ms] (mean, across all concurrent requests)
Transfer rate:          2224.14 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.7      0      25
Processing:     0    3   2.0      3      28
Waiting:        0    3   1.9      2      28
Total:          0    4   2.2      3      33
WARNING: The median and mean for the initial connection time are not within a normal deviation
        These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50%      3
  66%      4
  75%      5
  80%      5
  90%      7
  95%      8
  98%      9
  99%     11
 100%     33 (longest request)

Formulating a Hypothesis

Typically, in most production deployments, the Nginx service is set up to guarantee specific SLAs in terms of tolerated errors, etc., While, say, under normal circumstances, the server performs as expected, it is also necessary to gauge how much degradation is seen for different levels of failures & what the cascading impact may be on others applications. The results obtained by inducing chaos may give us an idea on how best to manage the deployment (improved high availability configuration, resources allocated, replica counts, etc.,) to continue to meet the SLA despite a certain degree of failure (while that is an interesting topic to discuss for another day, we shall restrict the scope of this blog to demonstrating how workflows can be used!)

In the next step, we shall execute a chaos workflow that runs the same benchmark job while a random pod-delete (Nginx replica failure) occurs and observe the degradation in the attributes we have noted: failed_requests.

Create the Chaos Workflow

Applying the workflow manifest performs the following actions in parallel:

Starts an Nginx benchmark job for the specified duration (300s)
Triggers a random pod-kill of the Nginx replica by creating the chaosengine CR. Cleans up after chaos.

root@demo:~# argo submit https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/Argo/argowf-native-pod-delete.yaml -n litmus
Name:                argowf-chaos-sl2cn
Namespace:           litmus
ServiceAccount:      argo-chaos
Status:              Pending
Created:             Fri May 15 15:31:45 +0000 (now)
Parameters:
  appNamespace:      default
  adminModeNamespace: litmus
  appLabel:          nginx

Visualize the Chaos Workflow

You can visualize the progress of the chaos workflow via the Argo UI. Convert the argo-server service to type NodePort & view the dashboard at https://<node-ip>:<nodeport>

root@demo:~# kubectl patch svc argo-server -n argo -p '{"spec": {"type": "NodePort"}}'
service/argo-server patched

Observe the Nginx benchmark results

Observing the Nginx benchmark results over 300s with a single random pod kill shows an increased count of failed requests.

root@demo:~# kubectl logs -f nginx-bench-7pnvv

2020/06/23 07:00:34 Running: ab -r -c10 -t300 -n 10000000 http://nginx.default.svc.cluster.local:80/
2020/06/23 07:05:37 This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking nginx.default.svc.cluster.local (be patient)
Finished 802719 requests


Server Software:        nginx/1.19.0
Server Hostname:        nginx.default.svc.cluster.local
Server Port:            80

Document Path:          /
Document Length:        612 bytes

Concurrency Level:      10
Time taken for tests:   300.000 seconds
Complete requests:      802719
Failed requests:        866
   (Connect: 0, Receive: 289, Length: 289, Exceptions: 288)
Total transferred:      678053350 bytes
HTML transferred:       491087160 bytes
Requests per second:    2675.73 [#/sec] (mean)
Time per request:       3.737 [ms] (mean)
Time per request:       0.374 [ms] (mean, across all concurrent requests)
Transfer rate:          2207.20 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0  11.3      0    3044
Processing:     0    3  57.2      3   16198
Waiting:        0    3  54.2      2   16198
Total:          0    4  58.3      3   16199

Percentage of the requests served within a certain time (ms)
  50%      3
  66%      4
  75%      4
  80%      5
  90%      6
  95%      7
  98%      9
  99%     11
 100%  16199 (longest request)

Further iterations of these tests with increased pod-kill instances over the benchmark period or an increased kill count (i.e., number of replicas killed at a time) can give more insights about the behavior of the service, in turn leading us to the mitigation procedures.

Note: To test with different variables, edit the ChaosEngine spec in the workflow manifest before re-submission.

Conclusion

You can use Argo with LitmusChaos to construct complex chaos workflows, with pre-conditioning & dependencies built-in. The parallel nature of execution can help you simulate multi-service/component failures to verify application behavior under worst-case scenarios. You can even sew in recovery procedures based on error conditions.

Do try this out & let us know what kind of workflows you would like to see being built within litmus!

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you?

Join Our Community On Slack For Detailed Discussion, Feedback & Regular Updates On Chaos Engineering For Kubernetes: https://kubernetes.slack.com/messages/CNXNB0ZTN (#litmus channel on the Kubernetes workspace) Check out the Litmus Chaos GitHub repo and do share your feedback: https://github.com/litmuschaos/litmus Submit a pull request if you identify any necessary changes. {% github litmuschaos/litmus %}