Deployment Modes in LitmusChaos

Introduction

Chaos Engineering as a practice is something that is steadily pervading various stages of the microservices development life cycle, with different persona (across the spectrum, right from app developers to SREs, DevOps functions maintaining CI pipelines to service owners) making use of it to ensure resilience, albeit with contextual differences. Thereby, it is expected that the chaos frameworks today lend themselves to operating in different modes: considering the permissions, security constraints & allowed blast radius these personas operate with.

In this blog, we shall discuss the various deployment/operational modes within Litmus, that takes into consideration these different personas & provides steps to install it in a specific mode using helm.

Note: It is assumed that readers have had a chance to try-out litmus chaos experiments and are aware of the chaos operator, chaos CRDs & the chaoshub. The subsequent sections can be understood better with this knowledge.

Modes of Execution in LitmusChaos

While the essential nature of the litmus chaos operator in terms of the ability to orchestrate chaos and that of a given experiment to inject a specific fault remains the same across different modes, their scope (cluster-wide v/s namespaced) and impact differ. As you may have guessed already, this corresponds to the way the RBAC is set up for the litmus components & operator's watch range.

Admin Mode

Persona: An SRE or a cluster-admin with complete autonomy over the cluster who wants to centralize chaos operations and avoid the need to place the chaos components in multiple target namespaces. He/She is also the executor of the chaos, so pulls the experiment CRs from the hub, tunes & runs them.

Operational Characteristics: The chaos operator is installed in a central admin ns (typically litmus) along with a superset cluster-wide RBAC that can execute all supported chaos experiments, including those against node resources. The operator is set up to watch for ChaosEngine CRs created in the same admin namespace, though the target application can reside in a different namespace (specified by .spec.appinfo.appns in the ChaosEngine CR). This ensures that all the chaos resources (chaos experiment CRs, runner/experiment/helper pods, chaosresult CRs) are maintained within the same admin namespace.

Steps to Deploy: The litmus helm chart consists of a flag in the values.yaml to specify the mode: operatorMode which can be set to admin during install.

kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmuschaos/litmus --version 1.7.0 --set operatorMode=admin
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.7.0?file=charts/generic/experiments.yaml -n litmus

Standard Mode

Persona: Chaos execution is delegated to service owners in staging/prod environments with varying degrees of permissions, while the operator continues to reside in a central/admin namespace. The expectation in this mode is for the SRE/cluster-admin to install the litmus infra components (operator, CRDs) beforehand, with the operator being set up to watch ChaosEngine CRs across namespaces. This is done considering that individual service owners can create chaos resources (ChaosExperiment CRs installed from the hub, ChaosEngine CRs) in the (app/service) namespaces where they have access.

Operational Characteristics: While technically quite similar to the admin mode (the operator continues to remain in a central ns & orchestrate chaos across apps), here the executor of chaos is different & the service owners are expected to use their own ChaosServiceAccounts (or use the recommended per-experiment RBAC available on the chaoshub) for chaos execution. The chaos pods (runner/experiment/helper) too are created in the service namespaces to aid in better visibility/debug.

Steps to Deploy: The operatorMode should be set to standard during install.

kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmuschaos/litmus --version 1.7.0 --set operatorMode=standard
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.7.0?file=charts/generic/experiments.yaml -n <service/app-namespace>

Namespaced Mode

Persona: All chaos operations (infra setup, orchestration, and chaos execution) are managed by developers/DevOps engineers in their respective namespaces in a strictly multi-tenant environment, where the persona doesn't have access to cluster-wide resources like nodes, crds and is generally restricted from running with elevated privileges, mounting hostpath/file volumes, etc., This mode of operation is especially useful in SaaS-based environments providing Kubernetes namespaces for use, for example, Okteto Cloud

Operational Characteristics: In this mode, the operator is installed in the user's namespace and is set up to target applications it is co-residing with, i.e., in the same namespace. Needless to say, the operator also expects & creates chaos CRs (ChaosExperiment & ChaosEngine) and pods (runner/experiment/helper pods) in the same namespace, respectively. However, the use of namespaced mode requires the custom resource definitions (CRDs) to be pre-installed on the cluster by the admin before the users can attempt the operator installation.

Steps to Deploy: The operatorMode should be set to namespaced during install with the --skip-crds flag used to ensure CRD install is not attempted.

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmuschaos/litmus --version 1.7.0 --set operatorMode=namespaced --namespace <developer-namespace> --skip-crds

Note: In the namespaced mode, only pod-level chaos experiments can be executed.

Conclusion

As of today, there is a sizeable ratio of usage across the different modes in the litmus community, with standard being the more commonly used one (which is the reason it is set as the default mode). Having said that, this is one aspect of the framework that we are betting on seeing changes to continuously, considering the rapid improvements in the Kubernetes ecosystem around role-based access, security & also the constant evolution around deployment practices. For now, though, we would love to get your feedback on the current options and which one you prefer most!

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join our community on Slack For detailed discussions & regular updates On Chaos Engineering For Kubernetes.

Check out the LitmusChaos GitHub repo and do share your feedback. Submit a pull request if you identify any necessary changes.