asobti / kube-monkey Goto Github PK

View Code? Open in Web Editor NEW

3.0K 3.0K 252.0 30.83 MB

An implementation of Netflix's Chaos Monkey for Kubernetes clusters

License: Apache License 2.0

Makefile 1.00% Go 97.43% Dockerfile 0.40% Mustache 1.17%

chaos-monkey go kubernetes netflix-chaos-monkey

kube-monkey's People

Contributors

Stargazers

Watchers

Forkers

tuananh digideskio akamalov vishnudxb bobhenkel fushiyan davidnoyes ricjcosme paulwilljones eirwin keesee germansosa yuxijin-tobeyjin yottta quinnlin jahantech aergonus chaitanyaenr iliasku wyatt88 jinpalhawang maniacs-ops idoco schillyvanilly stakater mberger telefonica alexxnica kryndex huzhengchuan soitgoes887 ahmadposten huanwei biox fr34k8 osmancis paolocarta bygui86 celamb4 bbqballen healthpartners ryancox yao-xu joshua-rule nsidhaye oahayder chenqiangzhishen gzlatkov gap892003 chakri-nelluri etsangsplk jay-peaas awesomegolang jakewins maduhu onelapahead dhirenshumsher gabeio chen0031 ahillman3 worldtiki gogeof wanyinglong zhouyuanyuana syedimam0012 goungoun zhanghuidinah monzzo servicefoundation javacodemood kamilpi alberttwong eformat xiaoruiguo prageethw rkamisetti14792 docteur-rs faust64 xichengliudui sataqiu www6v codelingobot bartosa friendlydan dilyar85 mianguanwu daanikus avaussant chengyh2golang banghuaji weizai118 jackice markusjevringgoeuro qinzhao168 pawarrchetan mayzhou0401 fedor-chemashkin pifuguoshi sayanh skyionblue

kube-monkey's Issues

would not it better if dry run superseded schedule_immediate_kill: true

I think at the moment the design is to if schedule_immediate_kill: true to supersede everything else and start killing pods, even dry_run= true, I think it makes a more sense honor dry run as the first option of preference so users can actually experiment without waiting hours. I could have sent a PR proposing a fix, but I'm not much familiar with GO :( my sample config is as below which I used for verification.

dryRun: true
runHour: 8
startHour: 10
endHour: 16
blacklistedNamespaces: kube-system
whitelistedNamespaces:
timeZone: America/New_York
debug:
enabled: true # if you want to enable debugging and see how pods killed immediately set this to true
schedule_immediate_kill: true

Add unit tests

Please

Purpose of run_hour is not clear

This is config.toml sample from the docs:

[kubemonkey]
dry_run = true                           # Terminations are only logged
run_hour = 8                             # Run scheduling at 8am on weekdays
start_hour = 10                          # Don't schedule any pod deaths before 10am
end_hour = 16                            # Don't schedule any pod deaths after 4pm
blacklisted_namespaces = ["kube-system"] # Critical apps live here

Per this sample, what goes on between 8am and 10am? What is the purpose of specifying run_hour in addition to start_hour?

Also, is there no provision to run this on weekends?

Docker image

Hey there,

Wanted to give a quick spin to this projet but not sure what's the way to go?
I can't find a docker image hosted on Docker Hub, is there one? Or should I build the docker image myself? If so, can you provide some guidelines?

Thanks

Some improvements

Nice work. How about setting MTBF to seconds rather than days. Since currently it is an integer, you can't go below daily restarts.

You should also consider whitelisting namespaces.

Why are you setting mbtf on a label rather than in the config map?

When killing pods, it would be nice to know their replication scale to find out their kill frequency.

Will KUBE monkey works for open shift ?

Hi ,

Can you please let me know will Kube monkey works for the open shift origin whose base system is kubernettes only.

helm installation fails

Normal Scheduled 79s default-scheduler Successfully assigned kube-monkey-kubemonkey-5b58fcc9d8-59nbf to ip-192-168-14-36.us-west-2.compute.internal
Normal SuccessfulMountVolume 79s kubelet, ip-192-168-14-36.us-west-2.compute.internal MountVolume.SetUp succeeded for volume "default-token-v8h9h"
Warning FailedMount 15s (x8 over 79s) kubelet, ip-192-168-14-36.us-west-2.compute.internal MountVolume.SetUp failed for volume "config-volume" : configmaps "kube-monkey-kubemonkey" not

pod error: panic: open /usr/local/go/lib/time/zoneinfo.zip: no such file or directory

I tried building and deploying the projecto k8s and I get this error in the pod logs:

Starting kube-monkey...
panic: open /usr/local/go/lib/time/zoneinfo.zip: no such file or directory

goroutine 1 [running]:
github.com/asobti/kube-monkey/config.Timezone(0x8)
	/Users/pswens200/dev/gowork/src/github.com/asobti/kube-monkey/config/config.go:77 +0xc5
github.com/asobti/kube-monkey/kubemonkey.Run(0xc420537f68, 0x1)
	/Users/pswens200/dev/gowork/src/github.com/asobti/kube-monkey/kubemonkey/kubemonkey.go:46 +0x5b
main.main()
	/Users/pswens200/dev/gowork/src/github.com/asobti/kube-monkey/main.go:23 +0xa0

Any ideas?

config whitelist_namespaces required

I have configured kube-monkey in the kube-system namespace. The kube-monkey pod, deployment, replica-set and config are all seen in good health on the kubernetes dashboard.

$ sudo nano /etc/kube-monkey/config.toml
[kubemonkey]
dry_run = false
run_hour = 9
start_hour = 10
end_hour = 18
graceperiod_sec = 120
blacklisted_namespaces = ["kube-system","kube-chaos"]
time_zone = "Asia/Kolkata"
[debug]
enabled = true
schedule_delay=300
force_should_kill = true
schedule_immediate_kill = true

My deployment has the necessary labels to mark the pod for termination.
kube-chaos-tomcat-deployment.txt

Yet kube-monkey is unable to find any pods for termination.
Logs show :

I0703 18:16:03.623685 1 kubemonkey.go:19] Debug mode detected!
I0703 18:16:03.623705 1 kubemonkey.go:20] Status Update: Generating next schedule in 30 sec
I0703 18:16:33.623885 1 schedule.go:64] Status Update: Generating schedule for terminations
I0703 18:16:33.722343 1 schedule.go:57] Status Update: 0 terminations scheduled today
I0703 18:16:33.722398 1 kubemonkey.go:63] Status Update: Waiting to run scheduled terminations.
I0703 18:16:33.722410 1 kubemonkey.go:77] Status Update: All terminations done.

Could someone please assist me in debugging what my mistake could be?
Thanks.

No terminations scheduled issue

Hi, I have been some days trying to use kube-monkey in order to kill pods of a namespace but all the tries are not being succesfull. Every time I try a configuration, I receive the next message :
********** Today's schedule **********
No terminations scheduled
********** End of schedule *********

I don't understand what is happening, those are my configuration files:

deployment.yml:

  apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    name: kube-monkey
    namespace: kube-system
  spec:
    replicas: 1
    template:
      metadata:
        labels:
          kube-monkey/enabled: enabled
          kube-monkey/identifier: ignite-10
          kube-monkey/mtbf: '2'
          kube-monkey/kill-mode: "fixed"
          kube-monkey/kill-value: "1"
      spec:
        containers:
          -  name: kube-monkey
             command:
               - "/kube-monkey"
             args: ["-v=5", "-log_dir=/var/log/kube-monkey"]
             image: ayushsobti/kube-monkey:v0.2.3
             volumeMounts:
               - name: config-volume
                 mountPath: "/etc/kube-monkey"
        volumes:
          - name: config-volume
            configMap:
              name: kube-monkey-config-map

And configmap.yaml:

  apiVersion: v1
  kind: ConfigMap
  metadata:
    name: kube-monkey-config-map
    namespace: kube-system
  data:
    config.toml: |
      [kubemonkey]
      run_hour = 11
      start_hour = 12
      end_hour = 13
      blacklisted_namespaces = ["kube-system"]
      whitelisted_namespaces = [] # I have tried using "ignite" here, which is the namespace that I want to for kill pods
      time_zone="Europe/Madrid"

Those are the pods that I have in the ignite namespace:

This is the kube-system namespace:

This is the log of the kube-monkey:

Please, could anyone help me?

Allow setting config options with environment variables

viper should allow using environment variables

https://github.com/spf13/viper#working-with-environment-variables

Timezone is not working on the configmap

I am trying to set the internal timezone of the scheduler for my own like this

  apiVersion: v1
  kind: ConfigMap
  metadata:
    name: kube-monkey-config-map
    namespace: kube-system
  data:
    config.toml: |
      [kubemonkey]
      run_hour = 8
      start_hour = 10
      end_hour = 16
      blacklisted_namespaces = ["kube-system"]
      timezone = "Europe/Amsterdam"

As I saw in the code that it is there, but it is not working...

It's always like this

I1207 15:47:29.649440 1 kubemonkey.go:26] Generating next schedule at 2017-12-07 08:00:00 -0800 PST

So time is in time format but the scheduler is not

Changing the default in the code and recompiling it woks so that is what I did here, but would be nice to have it documented or if it's a bug fixed.

Thanks, awesome work!

Docker container dies when a pod uses an invalid kill type

Scenario:
Kube monkey generates the following schedule for the day:

Deployment A with kill type = potatoes at 10AM
Deployment B with kill type = fixed at 11 AM

At 10 AM the first attack against Deployment A will fail with the error below, and the docker container for Kube Monkey dies.
Kubernetes restarted the container but (as expected) the attack on Deployment B did not run.

Error log:
{"log":"panic: runtime error: invalid memory address or nil pointer dereference\n","stream":"stderr","time":"2018-08-31T12:06:00.021385567Z"} {"log":"[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xe4c072]\n","stream":"stderr","time":"2018-08-31T12:06:00.021395074Z"} {"log":"\n","stream":"stderr","time":"2018-08-31T12:06:00.021398393Z"}

Stream event logs to configurable channels

It would be useful for kube-monkey to have the ability to send events ("killed pod foo for app bar") to channels like Slack, HipChat etc.

In the event that an action taken by kube-monkey causes an issue, the ability to quickly correlate that could be valuable.

mtbf

Can i set kube-monkey/mtbf = "0.01",Or mtbf must with int value?

Prometheus metrics endpoint

It would be great if kube-monkey could expose a metrics endpoint with values for things it has done.

Inspiration can be drawn from linki/chaoskube#23 :)

Purpose of identifier annotation is not clear

What is the purpose of kube-monkey/identifier annotation?

Docs say:

A unique identifier for the k8 app (eg. the k8 app's name). This is used to identify the pods that belong to a k8 app as Pods inherit labels from their k8 app.

How is this annotation used? What is the guidance around what this annotation value should be? Should it match any other label in manifest?

Permissions required by kube-monkey?

What permissions does kube-monkey require to run? I am trying to run it in a cluster with RBAC enabled and have assigned a service account with the below cluster role to the container:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: kube-monkey
rules:
- apiGroups: ["", "extensions", "apps"]
  resources: ["deployments"]
  verbs:  ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["namespaces"]
  verbs: ["get","list","watch"]
- apiGroups: [""]
  resources: ["pods"]
  verbs:  ["get", "list", "watch", "delete"]

kube-monkey runs, but fails to delete any pods as it is unable to see the running pods for any deployment:

I1204 12:21:18.610160       1 config.go:71] Successfully validated configs
I1204 12:21:18.610604       1 main.go:52] Starting kube-monkey with v logging level 5 and local log directory /var/log/kube-monkey
I1204 12:21:18.635426       1 kubemonkey.go:26] Generating next schedule at 2017-12-05 08:00:00 +0000 UTC
I1205 08:00:00.000198       1 schedule.go:43] Generating schedule for terminations
I1205 08:00:00.010612       1 schedule.go:28] 	********** Today's schedule **********
I1205 08:00:00.010625       1 schedule.go:32] 	Deployment		Termination time
I1205 08:00:00.010628       1 schedule.go:33] 	----------		----------------
I1205 08:00:00.010634       1 schedule.go:35] 	my-api		2017-12-05 11:26:00 +0000 UTC
I1205 08:00:00.010649       1 schedule.go:39] 	********** End of schedule **********
I1205 08:00:00.010664       1 kubemonkey.go:64] Waiting for terminations to run
E1205 11:26:00.014078       1 kubemonkey.go:70] Failed to execute termination for deployment my-api Error: Deployment my-api has no running pods at the moment
I1205 11:26:00.014440       1 kubemonkey.go:77] All terminations done
I1205 11:26:00.014478       1 kubemonkey.go:26] Generating next schedule at 2017-12-06 08:00:00 +0000 UTC

This is in a 1.7.11 cluster, could it be due to incompatibilities with the client lib?

kube-monkey will kill a single random pod if KillType has invalid value and KillValue returns error

In chaos.go line 109, we check for KillValue and in case of an error, we delete a single random pod. This happens even if KillType has an invalid value since that check comes later.

whitelisting suggestion

Is there a possibility to add a whitelist to the kube monkey configuration? It could be exclusive with blacklisting. You either do the first or the second one. I find whitelisting better to deal with if you manage a lot of namespaces.

Thank you

why not submit for official helm repo?

time this helm to submit as an official helm chart?

panic: ./kube-monkey flag redefined: log_dir

Hey there!

I've tried following the instructions explicitly on the README, but unfortunately I keep hitting an interesting blocked, that I think may be tied to some circular dependencies in glog (I'm a super novice at Go, admittedly).

In my error log I get the following:

/kube-monkey flag redefined: log_dir
panic: /kube-monkey flag redefined: log_dir

goroutine 1 [running]:
flag.(*FlagSet).Var(0xc42004c060, 0x1d8fe00, 0xc42000efa0, 0x14c1722, 0x7, 0x14e3d51, 0x2f)
	/usr/local/Cellar/go/1.9.3/libexec/src/flag/flag.go:793 +0x5e1
flag.(*FlagSet).StringVar(0xc42004c060, 0xc42000efa0, 0x14c1722, 0x7, 0x0, 0x0, 0x14e3d51, 0x2f)
	/usr/local/Cellar/go/1.9.3/libexec/src/flag/flag.go:696 +0x8b
flag.(*FlagSet).String(0xc42004c060, 0x14c1722, 0x7, 0x0, 0x0, 0x14e3d51, 0x2f, 0xc4200f1f00)
	/usr/local/Cellar/go/1.9.3/libexec/src/flag/flag.go:709 +0x8b
flag.String(0x14c1722, 0x7, 0x0, 0x0, 0x14e3d51, 0x2f, 0x0)
	/usr/local/Cellar/go/1.9.3/libexec/src/flag/flag.go:716 +0x69
github.com/asobti/kube-monkey/vendor/k8s.io/client-go/1.5/vendor/github.com/golang/glog.init()
	/Users/joe/go/src/github.com/asobti/kube-monkey/vendor/k8s.io/client-go/1.5/vendor/github.com/golang/glog/glog_file.go:41 +0x14a
github.com/asobti/kube-monkey/vendor/k8s.io/client-go/1.5/kubernetes.init()
	<autogenerated>:1 +0x48
github.com/asobti/kube-monkey/deployments.init()
	<autogenerated>:1 +0x53
github.com/asobti/kube-monkey/chaos.init()
	<autogenerated>:1 +0x5d
github.com/asobti/kube-monkey/kubemonkey.init()
	<autogenerated>:1 +0x53
main.init()
	<autogenerated>:1 +0x5d

This is before even trying to run it in k8s, and just running it locally, but we saw the exact same issues when running this in k8s.

I've tried some attempts at resolving conflicts using glide install --strip-vendor but to no avail.

Just wondering if you have any words of advice, or something stupid I may be doing as a Go noob (:

Additionally, do you think it's possible to push your image as a release to hub.docker.com?

Failed to verify Kube ApiServer Access

Problem
Output of kubectl logs

Starting kube-monkey...
--
open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
panic: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
 
goroutine 1 [running]:
main.main()
/home/user/go/src/github.com/asobti/kube-monkey/main.go:24 +0x108

Environment
Trying to run kube-monkey in a kube cluster that is behind a firewall hence no auth configured.
This error could be because I'm running v1.5.2 of kubernetes.

Is there any reason not to upgrade the kubernetes/client-go version?

[user@box] $ curl http://localhost:8080/api/
{
  "kind": "APIVersions",
  "versions": [
    "v1"
  ],
  "serverAddressByClientCIDRs": [
    {
      "clientCIDR": "0.0.0.0/0",
      "serverAddress": "xxx.xxx.xxx.xxx:x443"
    }
  ]
}
[user@box] $ kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"c55cf2b7d8bfeb947f77453415d775d7f71c89c2", GitTreeState:"clean", BuildDate:"2017-02-06T23:54:31Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"c55cf2b7d8bfeb947f77453415d775d7f71c89c2", GitTreeState:"clean", BuildDate:"2017-02-06T23:54:31Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

Debugging
Tracing the error goes to
https://github.com/asobti/kube-monkey/blob/master/kubernetes/kubernetes.go#L11
which calls Google's Kubernetes APi verification at
https://github.com/asobti/kube-monkey/blob/master/vendor/k8s.io/client-go/1.5/rest/config.go#L262

In the kubernetes docs, "/var/run/secrets/kubernetes.io/serviceaccount/token" should always be automatically generated. It's possible that this problem occurs if the service account admission controller is not enabled. Additionally "spec.serviceAccountName field has been automatically set" is not true for any deployed pods.

Our current configuration does not generate anything in the service account. i.e.

[user@box ~]$ kubectl describe sa
Name:           default
Namespace:      kube-system
Labels:         <none>

Image pull secrets:     <none>

Mountable secrets:      <none>

Tokens:                 <none>

Reason for Issue
I'm primarily opening this issue

if I find a fix, other people can resolve this quickly.
ask why not upgrade the kubernetes/client-go version

kube monkey permission issue

Failed to fetch eligible deployments for namespace due to error: deployments.apps is forbidden: User "system:serviceaccount:kube-system:default" cannot list deployments.apps at the cluster scope

Has anyone tried to deploy this on OpenShift?

Just curious.....

Run only on specific days

Hi,

We have low traffic on Sundays, so I would like to schedule the monkey to run only on Sunday.

There is any way to do that with the current configuration options?

Terminate services too?

I want to fork this and add destroying entire Service's to kube-monkey. You guys open to the idea?

This would be useful for testing an application's resiliency to a network partition of a particular dependency, for example:

Your application's Pods talk to another application via a Service
You're using a ClusterIP Service as a proxy for an external IP
You have an Ingress or OpenShift Route pointing at a Service

[Feature request] Attacks based on pod disruption budget

Proposal
Ability to define a kill mode or kill value that is based on the pod disruption budget for that deployment.

The idea behind this is that if you define in a PDB that it's ok to have x% of max unavailability, then it should stand to reason that it's also safe for kube to kill a number of pods based on this percentage.

Example:

kill-mode: pdb-based
kill-value: equals

kill-mode: pdb-based
kill-value: +25%

I'm not sure how best to deal with scenarios where this kill mode is set but there is no PDB. I guess it could either do nothing, or use some default value. At this point I just wanted to gauge interest in this new feature. I may be able to contribute part of the code for this.

Failing test in victims package

This might be an issue with my local environment since these failures do not show up when Travis CI runs tests (https://github.com/asobti/kube-monkey/commits/master).

Opening this issue to investigate. Might be a go version thing

$ go test -v ./victims 
# github.com/asobti/kube-monkey/victims
victims/victims.go:284: Verbose.Infof format % is missing verb at end of string
victims/victims.go:288: Verbose.Infof format % is missing verb at end of string
FAIL	github.com/asobti/kube-monkey/victims [build failed]

$ go version
go version go1.11 darwin/amd64

Not able to run in debug mode

I am trying to use the debug mode which can immediately kill the pod, but not sure what I did wrong, the immediately kill never happened.

Here is the config.yaml I have:

apiVersion: v1
kind: ConfigMap
metadata:
name: kube-monkey-config-map
namespace: dmp-system
data:
config.toml: |
[kubemonkey]
DryRun = false
run_hour = 18
start_hour = 19
end_hour = 20
blacklisted_namespaces = ["kube-system"]
DebugEnabled = true
DebugScheduleImmediateKill = true

Update short name for Kubernetes

Short name for Kubernetes is k8s, not k8. Please update the documentation throughout.

https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-does-kubernetes-mean-k8s

Client Connectivity

I'm currently attempting at getting kube-monkey up and running on a kubernetes cluster, and I keep getting an error stating:

panic: Unable to verify client connectivity to Kubernetes server

Any ideas or suggestions on what's going wrong and how to fix it?
Thanks.

Failed to fetch eligible deployments for namespace whereas the namespace exists

I0221 03:49:36.201590 7 kubemonkey.go:24] Status Update: Generating next schedule at 2018-02-22 00:00:00 +0530 IST
I0221 18:30:00.000194 7 schedule.go:54] Status Update: Generating schedule for terminations
W0221 18:30:00.015033 7 factory.go:44] Failed to fetch eligible deployments for namespace cam-km due to error: User "system:serviceaccount:cam-km:default" cannot list deployments.extensions in the namespace "cam-km". (get deployments.extensions)
I0221 18:30:00.015728 7 schedule.go:47] Status Update: 0 terminations scheduled today
I0221 18:30:00.015775 7 kubemonkey.go:63] Status Update: Waiting to run scheduled terminations.
I0221 18:30:00.015789 7 kubemonkey.go:77] Status Update: All terminations done.
I0221 18:30:00.015892 7 kubemonkey.go:24] Status Update: Generating next schedule at 2018-02-23 00:00:00 +0530 IST
********** Today's schedule **********
No terminations scheduled
********** End of schedule **********

panic: runtime error: invalid memory address or nil pointer dereference

the values.yaml

replicaCount: 1
namespace: default
rbac:
  enabled: true
image:
  repository: ayushsobti/kube-monkey
  tag: v0.2.3
  pullPolicy: IfNotPresent
config:
  dryRun: false
  runHour: 8
  startHour: 10
  endHour: 16
  blacklistedNamespaces: kube-system
  whitelistedNamespaces:
  timeZone: Australia/Melbourne
  debug:
   enabled: true # if you want to enable debugging and see how pods killed immediately set this to true
   schedule_immediate_kill: true
args:
  logLevel: 5
  logDir: /var/log/kube-monkey

ms values.yaml

kubemonkey:
  enabled: enabled # to disable set this to "disabled", to enable set it to "enabled"
  mtbf: 1
  kill-mode: fixed-percent
  kill-value: 80

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xe28cd2]

goroutine 21 [running]:
github.com/asobti/kube-monkey/chaos.(*Chaos).terminate(0xc42015da10, 0x114ba00, 0xc4200b23c0, 0x0, 0x0)
/Users/asobti/Coding/go/src/github.com/asobti/kube-monkey/chaos/chaos.go:127 +0x1a2
github.com/asobti/kube-monkey/chaos.(*Chaos).Execute(0xc42015da10, 0xc420084660)
/Users/asobti/Coding/go/src/github.com/asobti/kube-monkey/chaos/chaos.go:66 +0x202
github.com/asobti/kube-monkey/chaos.(*Chaos).Schedule(0xc42015da10, 0xc420084660)
/Users/asobti/Coding/go/src/github.com/asobti/kube-monkey/chaos/chaos.go:42 +0x51
created by github.com/asobti/kube-monkey/kubemonkey.ScheduleTerminations
/Users/asobti/Coding/go/src/github.com/asobti/kube-monkey/kubemonkey/kubemonkey.go:56 +0xb1

Support for services/replica sets

Looking at the code, it seems this only supports labelling Deployments as victims:

kube-monkey/victims/factory/deployments/eligible_deployments.go

Line 50 in 6a1c7df

 deployment, err := clientset.AppsV1().Deployments(d.Namespace()).Get(d.Name(), metav1.GetOptions{}) 

Any plans to support services/replica sets? Is there a workaround if we don't use deployments?

EDIT: I thikn I can just add the label selector to pods and have it work, is that correct:

kube-monkey/victims/victims.go

Line 113 in b42263d

podlist, err := clientset.CoreV1().Pods(v.namespace).List(*labelSelector)

Consider to support some other k8s controllers(rc、daemonset etc.)

Really very good job，sir.
I‘ve made some practice in my kubernetes cluster.I found kube-monkey only kill pod under deployment.I've created some pods under rc and taged with kube-monkey tag as I used in my deployment.It seemed that kube-monkey don't have any impact on pod under rc.
Sir，do you have a plan to support some other k8s controllers(rc、daemonset etc.) ,this will make kube-monkey much more useful.
Many tks.

Failed to generate NewInClusterClient: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

Facing issue while running kube-monkey on our kube cluster.

Error Log:

I0620 21:44:56.642419       1 main.go:52] Starting kube-monkey with v logging level 5 and local log directory /var/log/kube-monkey
E0620 21:44:56.643397       1 kubernetes.go:39] failed to obtain config from InClusterConfig: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
F0620 21:44:56.644008       1 main.go:55] Failed to generate NewInClusterClient: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
goroutine 1 [running]:
github.com/asobti/kube-monkey/vendor/github.com/golang/glog.stacks(0xc42024f500, 0xc420264000, 0xa5, 0xb6)
        /go/src/github.com/asobti/kube-monkey/vendor/github.com/golang/glog/glog.go:766 +0xcf
github.com/asobti/kube-monkey/vendor/github.com/golang/glog.(*loggingT).output(0x1785600, 0xc400000003, 0xc4200ce580, 0x1716d4a, 0x7, 0x37, 0x0)
        /go/src/github.com/asobti/kube-monkey/vendor/github.com/golang/glog/glog.go:717 +0x30f
github.com/asobti/kube-monkey/vendor/github.com/golang/glog.(*loggingT).printDepth(0x1785600, 0x7f8b00000003, 0x1, 0xc42014ff48, 0x1, 0x1)
        /go/src/github.com/asobti/kube-monkey/vendor/github.com/golang/glog/glog.go:646 +0x129
github.com/asobti/kube-monkey/vendor/github.com/golang/glog.(*loggingT).print(0x1785600, 0x3, 0xc42014ff48, 0x1, 0x1)
        /go/src/github.com/asobti/kube-monkey/vendor/github.com/golang/glog/glog.go:637 +0x5a
github.com/asobti/kube-monkey/vendor/github.com/golang/glog.Fatal(0xc42014ff48, 0x1, 0x1)
        /go/src/github.com/asobti/kube-monkey/vendor/github.com/golang/glog/glog.go:1125 +0x53
main.main()
        /go/src/github.com/asobti/kube-monkey/main.go:55 +0x1ea

Not sure this is configuration related problem or something else.

configmap.yaml

---
  apiVersion: v1
  kind: ConfigMap
  metadata:
    name: kube-monkey-config-map
    namespace: kube-system
  data:
    config.toml: |
      [kubemonkey]
      host="http://localhost:8080"
      run_hour = 8
      start_hour = 10
      end_hour = 16
      blacklisted_namespaces = ["kube-system"]
      whitelisted_namespaces = [""]
      time_zone = "America/New_York"

deployment.yaml

---
  apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    name: kube-monkey
    namespace: kube-system
  spec:
    replicas: 1
    template:
      metadata:
        labels:
          app: kube-monkey
      spec:
        containers:
          -  name: kube-monkey
             command:
               - "/kube-monkey"
             args: ["-v=5", "-log_dir=/var/log/kube-monkey"]
             #image: andrewsrobertamary/kube-monkey:latest
             image: kube-monkey:ubuntu
             volumeMounts:
               - name: config-volume
                 mountPath: "/etc/kube-monkey"
        volumes:
          - name: config-volume
            configMap:
              name: kube-monkey-config-map

We are using insecure bind address in api server.

KUBE_API_ADDRESS="--insecure-bind-address=0.0.0.0"

Even I tried with adding ServiceAccount in KUBE_ADMISSION_CONTROL but that worsen the problem so I rollback changes.

In our cluster /var/run/secrets/kubernetes.io/serviceaccount/token never created.

We are using heapster with --source=kubernetes:http://localhost:8080?inClusterConfig=false

Ref: kubernetes/kubernetes#27973

Bug: Reloading configmap for KM

General Description:

When the config file on the pod changes, KM should use the new configs. Referencing https://github.com/asobti/kube-monkey/blob/master/config/config.go#L46

How to Diagnose Problem:

Have debugging and dry run turned on:

[kubemonkey]
dry_run = true
run_hour = 8
start_hour = 10
end_hour = 16
blacklisted_namespaces = ["default","kube-system"]
[debug]
enabled = true
schedule_delay = 10
force_should_kill = true
schedule_immediate_kill = true

Have kube-monkey running and a test deployment/pod. ex:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: counter
  namespace: test-system
spec:
  template:
    metadata:
      labels:
        kube-monkey/enabled: enabled
        kube-monkey/identifier: monkey-victim-counter
        kube-monkey/mtbf: '1'
    spec:
      containers:
      - name: count
        image: busybox
        args: [/bin/sh, -c, 'i=0; while true; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done']

Edit the config file directly with kubectl edit cm/cm-config-map or edit the config file locally and reload with kubectl create configmap cm-config-map --from-file=config.toml=cm-config-map.toml -o yaml --dry-run | kubectl replace -f -

[kubemonkey]
dry_run = false
run_hour = 8
start_hour = 10
end_hour = 16
blacklisted_namespaces = ["default","kube-system"]
[debug]
enabled = true
schedule_delay = 10
force_should_kill = true
schedule_immediate_kill = true

Check the counter pod logs and watch the kube-monkey pod config file to see when it gets changed

kubectl get pods --namespace=kube-system | grep -o '^kube-monkey[^[:space:]]*' | xargs -L 1 -I f watch kubectl exec -ti "f" cat /etc/kube-monkey/config.toml

kubectl get pods --namespace=test-system | grep -o '^counter[^[:space:]]*' | xargs -L 1 -I f watch kubectl logs "f" --namespace="test-system" --tail=20

Expected behavior:

KM should detect the config change, switch off dry-run and kill the counter pod.

kubectl get pods --namespace=test-system | grep -o '^counter[^[:space:]]*' | xargs -L 1 -I f watch kubectl logs "f" --namespace="test-system" --tail=20

Pods of Deployment victims are not scheduled for termination

Inconsistent behavior in how KillType and KillValue errors are handled

If retrieving KillType returns an error, a single pod is terminated (see https://github.com/asobti/kube-monkey/blob/master/chaos/chaos.go#L103)

However, if KillType returns a value, but we do not recognize that value, we log an error and do nothing (see https://github.com/asobti/kube-monkey/blob/master/chaos/chaos.go#L128).

This is inconsistent. Error condition and invalid value should both trigger the same behavior. IMO, this behavior should be of logging an error and doing nothing.

A similar inconsistency exists with KillValue too. If KillValue returns an error, it defaults to killing a single pod (see https://github.com/asobti/kube-monkey/blob/master/chaos/chaos.go#L112), but if the value is incorrect (eg. greater than 100 or < 0), then we log an error and don't kill anything (see https://github.com/asobti/kube-monkey/blob/master/victims/victims.go#L255)

golang.org/x/sys in glide.lock too old?

Hi,

I've tried to clone and compile kube-monkey but get this error:

> go get github.com/asobti/kube-monkey
[...]
> cd $GOPATH/src/github.com/asobti/kube-monkey
> glide install --strip-vendor
[...]
> make clean
[...]
> make build                                                                                                                                                                          
rm -f kube-monkey                                                                                                                                                                                                    
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o kube-monkey                                                                                                                                                        
# github.com/asobti/kube-monkey/vendor/golang.org/x/crypto/ssh/terminal                                                                                                                                              
vendor/golang.org/x/crypto/ssh/terminal/util.go:30:12: undefined: unix.IoctlGetTermios                                                                                                                               
vendor/golang.org/x/crypto/ssh/terminal/util.go:38:18: undefined: unix.IoctlGetTermios                                                                                                                               
vendor/golang.org/x/crypto/ssh/terminal/util.go:54:12: undefined: unix.IoctlSetTermios                                                                                                                               
vendor/golang.org/x/crypto/ssh/terminal/util.go:64:18: undefined: unix.IoctlGetTermios                                                                                                                               
vendor/golang.org/x/crypto/ssh/terminal/util.go:75:9: undefined: unix.IoctlSetTermios                                                                                                                                
vendor/golang.org/x/crypto/ssh/terminal/util.go:80:13: undefined: unix.IoctlGetWinsize                                                                                                                               
vendor/golang.org/x/crypto/ssh/terminal/util.go:98:18: undefined: unix.IoctlGetTermios                                                                                                                               
vendor/golang.org/x/crypto/ssh/terminal/util.go:107:12: undefined: unix.IoctlSetTermios                                                                                                                              
vendor/golang.org/x/crypto/ssh/terminal/util.go:112:3: undefined: unix.IoctlSetTermios                                                                                                                               
make: *** [Makefile:9: build] Error 2

As a fix I've just changed the commit used for golang.org/x/sys (golang/sys@9a2e24c) to the latest one to date (golang/sys@fff93fa).

Not sure if it's something on my side, or just a wrong dependency.

Deleting entire deploy does not error if no pods avail

Expected behavior:
I expect just as if deleting one pod to get the error message:

kubemonkey.go:68] Failed to execute termination for deployment fail-deploy. Error: Deployment fail-deploy has no running pods at the moment

Currently it deletes the counter and does not complain. Sample output:

        ********** Today's schedule **********
        Deployment                      Termination time
        ----------                      ----------------
        counter                 12/12/27127 128:1527:00 -0500 UTC
        ********** End of schedule **********
I1227 22:02:49.181134       1 chaos.go:120] Terminating ALL pods for deployment counter
I1227 22:02:49.189360       1 kubemonkey.go:70] Termination successfully executed for deployment counter
I1227 22:02:49.189371       1 kubemonkey.go:73] Status Update: 0 scheduled terminations left.
I1227 22:02:49.189375       1 kubemonkey.go:76] Status Update: All terminations done.
I1227 22:02:49.189416       1 kubemonkey.go:18] Debug mode detected!
I1227 22:02:49.189420       1 kubemonkey.go:19] Status Update: Generating next schedule in 60 sec

---
apiVersion: v1
kind: Namespace
metadata:
  name: test-system
---
  apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    name: counter
    namespace: test-system
  spec:
    replicas: 2
    template:
      metadata:
        labels:
          kube-monkey/enabled: enabled
          kube-monkey/identifier: monkey-victim-counter
          kube-monkey/mtbf: "1"
          kube-monkey/kill-all: "kill-all"
      spec:
        containers:
        - args:
          - /bin/sh
          - -c
          - 'i=0; while true; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'
          image: DOESNOTEXIST:latest
          imagePullPolicy: Always
          name: test-counter
        restartPolicy: Always

kubectl get pods --namespace=test-system
NAME                      READY     STATUS             RESTARTS   AGE
counter-9bcfdf745-fnr2r   0/1       InvalidImageName   0          22s
counter-9bcfdf745-vwmpz   0/1       InvalidImageName   0          22s

Bug in random-max-percent calculation

I believe there is a bug in:

kube-monkey/chaos/chaos.go

Line 125 in e1052df

return c.Victim().DeleteRandomPods(clientset, killValue*100/(r.Intn(100)+1))

According to the README, killValue should be between 0-100. Therefore, the value passed to DeleteRandomPods will be between 0 and 10,000 (100*100/(0+1)).

Example:
https://play.golang.org/p/c0z6RijiR1D

Think one way to calculate the correct value is:

nPods := 3
maxPods := math.Round(float64(nPods*killValue) / 100)
toKill := r.Intn(int(maxPods+1))

Happy to submit a PR with a fix and some tests.

Example deployment.yaml fails to start kube monkey

0.2.3 is the only tag available at DockerHub:

0.1.0 tag not available on DockerHub. As a result, https://github.com/asobti/kube-monkey/blob/master/examples/deployment.yaml#L19 fails to start kube monkey with the following error:

kube-monkey-d46b987b9-lrg67                                           0/1       ImagePullBackOff   0          1m

Also, the image's repo name needs to be specified as well.

The updated deployment.yaml that works is:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: kube-monkey
  namespace: kube-system
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: kube-monkey
    spec:
      containers:
        -  name: kube-monkey
           command:
             - "/kube-monkey"
           args: ["-v=5", "-log_dir=/var/log/kube-monkey"]
           image: ayushsobti/kube-monkey:v0.2.3
           volumeMounts:
             - name: config-volume
               mountPath: "/etc/kube-monkey"
      volumes:
        - name: config-volume
          configMap:
            name: kube-monkey-config-map

api v1 is invalid for older k8s versions

I got an error called Failed to fetch eligible deployments for namespace testns-ns due to error: the server could not find the requested resource.
The configmap, deployment, pod and victims are all in namespace testns-ns.

Details of logs are shown below

I0604 22:07:37.873927       1 config.go:74] Successfully validated configs
I0604 22:07:37.873958       1 main.go:52] Starting kube-monkey with v logging level 5 and local log directory /var/log/kube-monkey
I0604 22:07:38.157628       1 kubemonkey.go:24] Status Update: Generating next schedule at 2018-06-04 16:00:00 -0700 PDT
LM-SJC-11012475:kube-monkey yxu2$ kubectl logs kube-monkey-6bbccb7699-t5wnw --namespace=testns-ns
I0604 22:07:37.873927       1 config.go:74] Successfully validated configs
I0604 22:07:37.873958       1 main.go:52] Starting kube-monkey with v logging level 5 and local log directory /var/log/kube-monkey
I0604 22:07:38.157628       1 kubemonkey.go:24] Status Update: Generating next schedule at 2018-06-04 16:00:00 -0700 PDT
I0604 23:00:00.000173       1 schedule.go:64] Status Update: Generating schedule for terminations
W0604 23:00:00.010772       1 factory.go:45] Failed to fetch eligible deployments for namespace testns-ns due to error: the server could not find the requested resource
	********** Today's schedule **********
No terminations scheduled
	********** End of schedule **********
I0604 23:00:00.011352       1 schedule.go:57] Status Update: 0 terminations scheduled today
I0604 23:00:00.011388       1 kubemonkey.go:63] Status Update: Waiting to run scheduled terminations.
I0604 23:00:00.011394       1 kubemonkey.go:77] Status Update: All terminations done.
I0604 23:00:00.011479       1 kubemonkey.go:24] Status Update: Generating next schedule at 2018-06-05 16:00:00 -0700 PDT

Details of configmap is shown below

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
  config.toml: |
    [kubemonkey]
    host="https://your-apiserver-url.com:apiport"
    run_hour = 16
    start_hour = 17
    end_hour = 18
    blacklisted_namespaces = ["kube-system"]
    whitelisted_namespaces = ["testns-ns"]
    time_zone = "America/Los_Angeles"
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"config.toml":"[kubemonkey]\nrun_hour = 12\nstart_hour = 12\nend_hour = 13\nblacklisted_namespaces = [\"kube-system\"]\nwhitelisted_namespaces = [\"testns-ns\"]\ntime_zone = \"America/Los_Angeles\"\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"kube-monkey-config-map","namespace":"testns-ns"}}
  creationTimestamp: 2018-06-01T18:09:18Z
  name: kube-monkey-config-map
  namespace: testns-ns
  resourceVersion: "2018008066"
  selfLink: /api/v1/namespaces/testns-ns/configmaps/kube-monkey-config-map
  uid: e6f8283a-65c6-11e8-9b2c-74dbd180d2d0

[Feature Request] Notifications for schedules and attacks

Ability to send an HTTP POST when the attacks are scheduled and when an attack runs.

This is specially useful when running Kubemonkey in a production environment. I'm not sure how other folks are monitoring the attacks but I guess it's based on the logs.

This feature could work based on a config map, where users would define where to post these messages to, and possibly the format as well (we could have keywords such as $attack_time, $pod_name, etc). When a new attack schedule is generated or when a new attack runs, a POST would be sent to the endpoint configured.

Example

kind: ConfigMap
data:
  alerts_configs.json: |-
    {
      "report_schedules": false,
      "report_attacks": true,
      "endpoint": "http://somewhere:80/alerts"
    }
  attacks_message.json: |-
    {
      "pod": $pod_name,
      "foo": "$bar"
    }```

Fix Termination time format

Currently, termination time is logged to microseconds granularity along with monotonic clock reading.

I1109 13:40:56.471721      11 schedule.go:28] 	********** Today's schedule **********
I1109 13:40:56.471749      11 schedule.go:32] 	Deployment		Termination time
I1109 13:40:56.471752      11 schedule.go:33] 	----------		----------------
I1109 13:40:56.471757      11 schedule.go:35] 	hello-world		2017-11-09 13:41:28.4717057 -0800 PST m=+62.093720077

See https://golang.org/pkg/time/#Time.String

Suggested fix:

The returned string is meant for debugging; for a stable serialized representation, use t.MarshalText, t.MarshalBinary, or t.Format with an explicit format string.