As you know, chaos-monkey is a solution of Chaos-Engineer, which is a popular topic of computer scientist domain. Chaos Engineer is aimed at simulating the unexpected situations, such as lost connection, network traffic, network latency and etc. to reduce the risks of production. We are in information era now, and everything grow faster and faster. If the broken product occurred in production environment, which may led to the lost the revenues, lost the marketplace, lost our clients without any sympathy. Therefore, we should take actions to break the passive situation and make a positive response to detect the something which will be occurred when it is published. Of course, there are many monkeys, like doctor monkey[scan the product with security], heavy monkey[overload flows] ..., Chaos monkey is a tool, which simulates the network issues to help engineer to detect the potential problems。
Our team has also implemented the simple chaos monkey, a tool to test suite robustness by randomly make something be broken of the whole system. The environment is deployed on the vm in kubernetes. We can see many pods in the cluster with active status. The tool is help us to analysis the system's health. If it met some errors like some pods are inactive, which means some backend services don't provide the support, whether the whole system still can save itself or not.
Kubernetes official website provides the interfaces "kubernetes-clients", which give us an opportunity to monitor or perform operation in the pod level. kubernetes-client is a module of python, we can use python to execute the some commands that we want to simulate the network traffic.
If we want to list all pods in cluster, we can use below codes:
from kubernetes import client, config
# Configs can be set in Configuration class directly or using helper utility
config.load_kube_config()
v1 = client.CoreV1Api()
print("Listing pods with their IPs:")
ret = v1.list_pod_for_all_namespaces(watch=False)
for i in ret.items:
print("%s\t%s\t%s" % (i.status.pod_ip, i.metadata.namespace, i.metadata.name))
as you see, the code is very short and small. At the meanwhile, you also can get the delete command as below:
body = client.V1DeleteOptions()
one_pod = pod_list[0]
namespace = one_pod.metadata.namespace
pod_name = one_pod.metadata.name
delete_pod_name_list.append(pod_name)
logging.info("start deleting the pod %s in namespace %s." % (pod_name, namespace))
result = self.client.delete_namespaced_pod(name=pod_name, namespace=namespace, body=body)
We define the pod list in excel, which contains the information of pods we want to execute the random delete action, cron express language: we want to do this action's period, label-selector: identity the pod to do operation, strategy: one, random and all, which used to define the matched pods to do operation. If pick up the one, and matched pods' number is more than one, it will only choose the first matched pod to execute the command.
image.pngIn order to perform the chaos money test in random environment, we also provide a simple road to achieve this target. We make a Dockerfile and build a image that contains our core features of chaos money. In tend to monitoring the status of cluster and describing the test result. we use flask architect to show them. If someone ask you to test the environment, you just need to copy the deployment, replace the value of placeholder "hostname" with the current cluster master node, after that, run the kubectl create/apply -f chaos-monkey.yaml, the pod will be started, and you will see some pods restarted randomly in cluster and the report will be shown in website.
If you want to disable the job, you can run this command: kubectl delete -f chaos-monkey.yaml, the job will be stopped.
currently, we only implement the kill pod automatically.It is very useful for us to ensure the published products more reliable and boost our confidence. we will continue to enhance this tool (add network latency, overload flows ...) in the future.
网友评论