Chaos Faults for Kubernetes
Introduction
Kubernetes faults disrupt the resources running on a Kubernetes cluster. They can be categorized into pod-level faults and node-level faults.
Docker service kill
Docker service kill makes the application unreachable on the account of the node turning unschedulable (NotReady).
Kubelet service kill
Kubelet service kill makes the application unreachable on the account of the node turning unschedulable (NotReady).
Node CPU hog
Node CPU hog exhausts the CPU resources on a Kubernetes node.
Node drain
Node drain drains the node of all its resources running on it.
Node IO stress
Node IO stress causes I/O stress on the Kubernetes node.
Node memory hog
Node memory hog causes memory resource exhaustion on the Kubernetes node.
Docker service kill
Docker service kill makes the application unreachable on the account of the node turning unschedulable (NotReady).
- Docker service is stopped (or killed) on a node to make it unschedulable for a specific duration defined by the
TOTAL_CHAOS_DURATION
environment variable. - The application node goes back to normal state and services are resumed after the chaos duration.
Use cases
Kubelet service kill
Kubelet service kill makes the application unreachable on the account of the node turning unschedulable (NotReady).
- Kubelet service is stopped (or killed) on a node to make it unschedulable for a specific duration defined by the
TOTAL_CHAOS_DURATION
environment variable. - The application node goes back to normal state and services are resumed after the chaos duration.
Use cases
Node CPU hog
Node CPU hog exhausts the CPU resources on a Kubernetes node. The CPU chaos is injected using a helper pod running the Linux stress tool (a workload generator). The chaos affects the application for a period defined by the TOTAL_CHAOS_DURATION
environment variable.
Use cases
Node drain
Node drain drains the node of all its resources running on it. Due to this, services running on the target node should be rescheduled to run on other nodes.
Use cases
Node IO stress
Node IO stress causes I/O stress on the Kubernetes node. The amount of I/O stress is specifed as the size in percentage of the total free space available on the file system using FILESYSTEM_UTILIZATION_PERCENTAGE
environment variable or in gigabytes(GB) using FILESYSTEM_UTILIZATION_BYTES
environment variable. When both the values are provided, FILESYSTEM_UTILIZATION_PERCENTAGE
takes precendence. It tests application resiliency on replica evictions that occur due I/O stress on the available disk space.
Use cases
Node memory hog
Node memory hog causes memory resource exhaustion on the Kubernetes node. It is injected using a helper pod running the Linux stress-ng tool (a workload generator). The chaos affects the application foe a duration specified by the TOTAL_CHAOS_DURATION
environment variable.
Use cases
Node restart
Node restart disrupts the state of the node by restarting it. It tests deployment sanity (replica availability and uninterrupted service) and recovery workflows of the application pod.
Use cases
Node taint
Node taint taints (contaminates) the node by applying the desired effect. The resources that contain the corresponding tolerations only can bypass the taints.
Use cases
Container kill
Container kill is a Kubernetes pod-level chaos fault that causes container failure on specific (or random) replicas of an application resource.
- It tests an application's deployment sanity (replica availability and uninterrupted service) and recovery workflow.
- It tests the recovery of pods that possess sidecar containers.
Use cases
Disk fill
Disk fill is a Kubernetes pod-level chaos fault that applies disk stress by filling the pod's ephemeral storage on a node.
- It evicts the application pod if its capacity exceeds the pod's ephemeral storage limit.
- It tests the ephemeral storage limits and ensures that the parameters are sufficient.
- It evaluates the application's resilience to disk stress (or replica) evictions.
Use cases
Pod autoscaler
Pod autoscaler is a Kubernetes pod-level chaos fault that determines whether nodes can accomodate multiple replicas of a given application pod.
- It examines the node auto-scaling feature by determining whether the pods were successfully rescheduled within a specified time frame if the existing nodes are running at the specified limits.
Use cases
Pod CPU hog exec
Pod CPU hog exec is a Kubernetes pod-level chaos fault that consumes excess CPU resources of the application container.
- It simulates conditions where the application pods experience CPU spikes due to expected (or undesired) processes thereby testing the behaviour of application stack.
Use cases
Pod CPU hog
Pod CPU hog is a Kubernetes pod-level chaos fault that excessively consumes CPU resources, resulting in a significant increase in the CPU resource usage of a pod.
- Simulates a situation where an application's CPU resource usage unexpectedly spikes.
Use cases
Pod delete
Pod delete is a Kubernetes pod-level chaos fault that causes specific (or random) replicas of an application resource to fail forcibly (or gracefully).
- It tests an application's deployment sanity (replica availability and uninterrupted service) and recovery workflow.
Use cases
Pod DNS error
Pod DNS error is a Kubernetes pod-level chaos fault that injects chaos to disrupt DNS resolution in pods.
- It removes access to services by blocking the DNS resolution of host names (or domains).
Use cases
Pod DNS spoof
Pod DNS spoof is a Kubernetes pod-level chaos fault that injects chaos into pods to mimic DNS resolution.
- It resolves DNS target host names (or domains) to other IPs as specified in the
SPOOF_MAP
environment variable in the chaosengine configuration.
Use cases
Pod HTTP latency
Pod HTTP latency is a Kubernetes pod-level chaos fault that injects HTTP response latency by starting proxy server and redirecting the traffic through it.
- It injects the latency into the service whose port is specified using the
TARGET_SERVICE_PORT
environment variable. - It evaluates the application's resilience to lossy (or flaky) HTTP responses.
Use cases
Pod HTTP modify body
Pod HTTP modify body is a Kubernetes pod-level chaos fault that injects chaos on the service whose port is provided using the TARGET_SERVICE_PORT
environment variable.
- This is done by starting the proxy server and redirecting the traffic through the proxy server.
- Can be used to overwrite the HTTP response body by providing the new body value as
RESPONSE_BODY
. - It can test the application's resilience to error or incorrect HTTP response body.
Use cases
Pod HTTP modify header
Pod HTTP modify header is a Kubernetes pod-level chaos fault that injects chaos on the service whose port is provided using the TARGET_SERVICE_PORT
environment variable.
- This is done by starting the proxy server and redirecting the traffic through the proxy server.
- It can cause modification of headers of requests and responses of the service. This can be used to test service resilience towards incorrect or incomplete headers.
Use cases
Pod HTTP reset peer
Pod HTTP reset peer is a Kubernetes pod-level chaos fault that injects chaos on the service whose port is specified using the TARGET_SERVICE_PORT
environment variable.
- This stops the outgoing HTTP requests by resetting the TCP connection by starting the proxy server and redirecting the traffic through the proxy server.
- It can test the application's resilience to lossy/flaky HTTP connection.
Use cases
Pod HTTP status code
Pod HTTP status code is a Kubernetes pod-level fault injects chaos inside the pod by modifying the status code of the response from the application server to the desired status code provided by the user.
- The port for the service is specified using the
TARGET_SERVICE_PORT
environment variable by starting the proxy server and redirecting the traffic through the proxy server. - It tests the application's resilience to error code HTTP responses from the provided application server.
Use cases
Pod IO stress
Pod I/O stress is a Kubernetes pod-level chaos fault that causes IO stress on the application pod by spiking the number of input and output requests.
- Aims to verify the resiliency of applications that share this disk resource for ephemeral (or persistent) storage.
Use cases
Pod memory hog exec
Pod memory hog exec is a Kubernetes pod-level chaos fault that consumes memory resources on the application container in megabytes.
- It simulates conditions where app pods experience Memory spikes either due to expected/undesired processes thereby testing how the overall application stack behaves when this occurs.
Use cases
This fault causes stress within the target container, which may result in the primary process in the container to be constrained or eat up the available system memory on the node.
Pod memory hog
Pod memory hog is a Kubernetes pod-level chaos fault that consumes memory resources in excess, resulting in a significant spike in the memory usage of a pod.
- Simulates a condition where the memory usage of an application spikes up unexpectedly.
Use cases
This fault causes stress within the target container, which may result in the primary process in the container to be constrained or eat up the available system memory on the node.
Pod network corruption
Pod network corruption is a Kubernetes pod-level chaos fault that injects corrupted packets of data into the specified container by starting a traffic control (tc) process with netem rules to add egress packet corruption.
- Tests the application's resilience to lossy (or flaky) network.
Use cases
Pod network duplication
Pod network duplication is a Kubernetes pod-level chaos fault that injects chaos to disrupt the network connectivity to Kubernetes pods.
- It injects chaos on the specific container by starting a traffic control (tc) process with netem rules to add egress delays.
- It determines the application's resilience to duplicate network.
Use cases
Pod network latency
Pod network latency is a Kubernetes pod-level chaos fault that introduces latency (delay) to a specific container by initiating a traffic control (tc) process with netem rules to add egress delays.
- It tests the application's resilience to lossy (or flaky) networks.
View fault usage
This can be resolved by using middleware that switches traffic based on certain SLOs or performance parameters. Another way is to set up alerts and notifications to highlight a degradation so that it can be addressed and fixed. Another way is to understand the impact of the failure and determine the last point in the application stack before degradation.
The applications may stall or get corrupted while waiting endlessly for a packet. This fault limits the impact (blast radius) to only the traffic that you wish to test by specifying the IP addresses. This fault helps to improve the resilience of your services over time.
Pod network loss
Pod network loss is a Kubernetes pod-level chaos fault that causes packet loss in a specific container by starting a traffic control (tc) process with netem rules to add egress or ingress loss.
- It tests the application's resilience to lossy (or flaky) network.
Use cases
Pod network partition
Pod network partition is a Kubernetes pod-level fault that blocks 100% ingress and egress traffic of the target application by creating network policy.
- It can test the application's resilience to lossy (or flaky) network.
Use cases
Pod IO Latency
Pod IO latency is a Kubernetes pod-level fault that delays the system calls of files located within the mounted volume of the pod.
- It can test the application's resilience for the latency in i/o operations.
View fault usage
Pod IO Error
Pod IO error is a Kubernetes pod-level fault that returns an error on the system calls of files located within the mounted volume of the pod.
- It can test the application's resilience for the errors in i/o operations.
View fault usage
Pod IO Attribute Override
Pod IO attribute override is a Kubernetes pod-level fault that modify the properties of files located within the mounted volume of the pod.
- It can test the application's resilience for the different values of file properties.
View fault usage
Time Chaos
Time Chaos is a Kubernetes pod-level fault that introduces controlled time offsets to disrupt the system time of the target pod
- It can test the application's resilience for invalid system time.
Use cases
Pod API latency
Pod API latency is a Kubernetes pod-level chaos fault that injects api request and response latency by starting proxy server and redirecting the traffic through it.
- It injects the latency into the service whose port is specified using the
TARGET_SERVICE_PORT
environment variable. - It evaluates the application's resilience to lossy (or flaky) API requests and responses.
Use cases
Pod API Status Code
Pod API status code is a Kubernetes pod-level chaos fault that change the API response status code and optionally api response body through path filtering.
- It overrides the api status code of service whose port is specified using the
TARGET_SERVICE_PORT
environment variable. - It evaluates the application's resilience to lossy (or flaky) responses.
Use cases
Pod API modify header
Pod API modify header is a Kubernetes pod-level chaos fault that overrides the header values of API requests and responses with the user-provided values for the given keys.
- It modifies the headers of service whose port is specified using the
TARGET_SERVICE_PORT
environment variable. - It evaluates the application's resilience to lossy (or flaky) requests and responses.
Use cases
Pod API modify body
Pod API modify body is a Kubernetes pod-level chaos fault that modifies the api request and response body by replacing any portions that match a specified regular expression with a provided value.
- It modifies the body of service whose port is specified using the
TARGET_SERVICE_PORT
environment variable. - It evaluates the application's resilience to lossy (or flaky) requests and responses.