twosigma / satellite Goto Github PK

View Code? Open in Web Editor NEW

143.0 143.0 18.0 379 KB

Satellite monitors, alerts on, and self-heals your Mesos cluster.

License: Apache License 2.0

Clojure 97.83% Shell 2.17%

satellite's People

Contributors

Stargazers

Watchers

Forkers

tnn1t1s icexelloss juicedm3 wyegelwel gzyl ch3lo solarisyan zhangwusheng adamdonahue welcomemandeep cadelaren ghosthamlet letslego guangyun1013 vt100 jiapei100

satellite's Issues

satellite-slave sample config, satellite-recipes not found

Hello,

The sample config for the slave process doesn't work well and says that satellite-recipes cannot be found.

riemann.codec.Event{:host "ip-10-66-84-35", :service "mesos/slave/", :state "critical", :description "command: ["satellite-recipes" "free-memory"]java.util.concurrent.ExecutionException: java.io.IOException: Cannot run program "satellite-recipes": error=2, No such file or directory", :metric nil, :tags nil, :time 1439514431, :ttl nil}

If riemann-config isn't specified, fails silently

The leader-watcher thread is never started, the logs say it's following even when it's the leader.

Long term: Provide Satellite packages for Linux distributions?

E.g. a .deb for Debian / Ubuntu.

This way, we could piggyback on existing dependency version management and init systems of various Linux distributions in order to ensure that Satellite is installed on a system with compatible dependencies and is always running when it needs to be.

Please use atomic filesystem operations to update the whitelist

It seems that the whitelist is written in such a way that the whitelist file is truncated then new whitelist is written to the same file. I determined this behavior by watching the whitelist and noticing that inode number never changes.

It would be better to create a new file, write the new whitelist contents there, then rename that new file to whitelist. This would be an atomic operation and there wouldn't be a time when the whitelist contained zero hosts which does happen now.

Thank you,

Rusty

Slave config: "mesos-work-dir" should default to mesos' default

According to mesos docs:

 --work_dir=VALUE                           Directory path to place framework work directories
                                             (default: /tmp/mesos)

Handle Mesos Master election

Based on the current configuration defaults a satellite-master points directly at a single Mesos master. Without a 1-to-1 mapping of satellite-masters to mesos-masters, this could be problematic in the event of a leader election for Mesos master.

It would be helpful if Satellite could watch zookeeper for a Mesos-master election and automatically point to the active master.

Rename satellite-slave to satellite-agent

This fits with Mesos language change from slave to agent

Ansible script to deploy satellite

Missing namespace satellite-slave.recipes.mesos referred to in sample config

Line 16

Provide docker images that launch satellite-master & satellite-slave

Could serve as a form of distribution, alternative to Linux system packages or DCOS packages.

Slave state.json metrics on influxdb

Hello.

I've managed to install satellite on my mesos cluster. Reimann is pushing data in a local influxdb for grafana, but I'd like to push slave state.json.

My riemann config is very simple:

(def inflxdb (graphite {:host "master1" :port 2003}))
(streams
  inflxdb
)

My slave config:

(def mesos-work-dir "/tmp/mesos")
(def settings
  {:satellites [{:host "master1"}]
  :service "mesos/slave/"
)

In influxdb, I see localhost.mesos/* with master info, and the consolidated informations for my frameworks, but no slave informations. Is it possible?

Thanks.

Not all slaves appear on whitelist

I have 2 Mesos clusters. One with 3 servers each running ZK, Mesos master, Mesos slave, Satellite master & Satellite slave. My 2nd cluster has 6 servers separating out masters and slaves. On both clusters all 3 Mesos slaves are active.

At one point, on the first cluster (3 servers), I was able to see all slaves registered in the whitelist. I had a high load on 2 of the servers. Those 2 servers dropped from the whitelist. Once the load came down, one slave re-registered with the whitelist and the other didn't.

On my 2nd cluster (6 servers), 1 of the slave servers never registers.

In both cases, I see mesos/slave events from all 3 satellite slaves reaching the satellite master leader.

Cluster 1:

CentOS 6.5
kernel 3.10.75-1.el6.elrepo.x86_64
OpenJDK 1.7.0_51
ZooKeeper 3.4.3
Mesos 0.22.0
Satellite 0.2.0

Cluster 2:

CentOS 7.1.1503
kernel 3.10.0-229.11.1.el7.x86_64
OpenJDK 1.8.0_51
ZooKeeper 3.4.6
Mesos 0.23.0
Satellite 0.2.0

/usr/sbin/mesos-master --zk=zk://jb5.example.com:2181,jb6.example.com:2181,jb7.example.com:2181/mesos --port=5050 --log_dir=/var/log/mesos --quorum=2 --whitelist=/etc/mesos/slave_list.txt --work_dir=/var/lib/mesos

On all 3 Mesos masters, the whitelist file starts out like this:

$ cat /etc/mesos/slave_list.txt
jb8.example.com
jb9.example.com
jb10.example.com

Each Satellite master has a satellite-config.clj file that looks similar to this. The only difference is the config will point to the server it's installed on. For example, on jb6, all the jb5s are replaced with jb6.

(def settings
  (merge settings
         {:mesos-master-url (url/url "http://jb5.example.com:5050")
          :sleep-time 5000
          :zookeeper "jb5.example.com:2181"
          :local-whitelist-path "/etc/mesos/slave_list.txt"
          :riemann-tcp-server-options {:host "jb5.example.com" }
          :service-host "jb5.example.com"}))

I made no changes to the satellite-config.clj file. My slave-config.clj file looks like this:

(def mesos-work-dir "/var/lib/mesos")

(def settings
  {:satellites [{:host "jb5.example.com"}
                {:host "jb6.example.com"}
                {:host "jb7.example.com"}]
   :service "mesos/slave/"
   :comets [(satellite-slave.recipes/free-memory 50 (-> 60 t/seconds))
            (satellite-slave.recipes/free-swap   50 (-> 60 t/seconds))
            (satellite-slave.recipes/percentage-used 90 "/tmp" (-> 60 t/seconds))
            (satellite-slave.recipes/percentage-used 90 "/var" (-> 60 t/seconds))
            (satellite-slave.recipes/percentage-used 90 mesos-work-dir
                                                     (-> 60 t/seconds))
            (satellite-slave.recipes/num-uninterruptable-processes 10 (-> 60 t/seconds))
            (satellite-slave.recipes/load-average 4.5 (-> 60 t/seconds))
            {:command ["echo" "17"]
             :schedule (every (-> 60 t/seconds))
             :output (fn [{:keys [out err exit]}]
                       (let [v (-> out
                                   (clojure.string/split #"\s+")
                                   first
                                   (Integer/parseInt))]
                         [{:state "ok"
                           :metric v
                           :ttl 300
                           :description "example test -- number of files/dirs in cwd"}]))}]})

So the only change I made was to add my satellite masters. If I query the whitelist API I see the following results:

{"jb10.np.ev1.yellowpages.com":
  {"managed-events":{},"manual-events":{},"managed-flag":"on","manual-flag":null},
"jb8.np.ev1.yellowpages.com":
  {"managed-events":{},"manual-events":{},"managed-flag":"on","manual-flag":null},
"jb9.np.ev1.yellowpages.com":
  {"managed-events":{},"manual-events":{},"managed-flag":null,"manual-flag":null}}

In my slave-config.clj, I changed the mesos-work-dir from /tmp/mesos to /var/lib/mesos. I restarted the slaves. This caused expired events for /tmp/mesos. That expired event ended up changing the whitelist and now jb9 & jb10 are 'on' and jb8 is 'off'.

{"jb10.example.com":
  {"managed-events":
    {"mesos\/slave\/percentage used of \/tmp\/mesos":
      {"time":1.444076213197E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb10.example.com"}
    },
    "manual-events":{},"managed-flag":"on","manual-flag":null
  },
"jb8.example.com":
  {"managed-events":
    {"mesos\/slave\/percentage used of \/tmp\/mesos":
      {"time":1.444076258248E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb8.example.com"}
    },
    "manual-events":{},"managed-flag":"off","manual-flag":null
  },
"jb9.example.com":
  {"managed-events":
    {"mesos\/slave\/percentage used of \/tmp\/mesos":
      {"time":1.444076198146E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb9.example.com"}
    },
    "manual-events":{},"managed-flag":"on","manual-flag":null
  }
}

And /metrics/snapshot also says 2 out of 3:

{"prop-available-hosts":0.6666666666666667,"num-available-hosts":3,"num-hosts-up":2}

Add a license

Here's a great place to get started. http://choosealicense.com

Limit number of hosts that can be disabled via black hole detector

The concern here is that if a host becomes considered a black hole, there is no way for it to get automatically re-added to the whitelist (because once removed from the whitelist, it will no longer be assigned tasks, so it will never get a chance to prove that it is now capable once again of completing tasks!). Thus, if something happens that blackholes a very large number of hosts, extensive manual intervention would be required in order to re-enable hosts.

Suggested safeguards:

configurable limit of # of hosts that can be removed via black hole detection per minute
configurable maximum % of total hosts that can be considered black holes at any given time

If a host would be "black holed", but this would cause one of the safeguards above to be violated, log the issue and just move on (take no other action). If the host continues to fail a lot of tasks, it will be re-evaluated as a black hole soon enough (and maybe at that point there will be enough room in the black hole dungeon).

joda time "seconds" object can't be multiplied

The sample config passes in a joda time object, then the recipe tries to multiply it by 5. Not sure which type of object is expected by whatever code processes this data (riemann?).

Exception in thread "main" java.lang.ClassCastException: org.joda.time.Seconds cannot be cast to java.lang.Number, compiling:(sample.clj:8:13)
    at clojure.lang.Compiler$InvokeExpr.eval(Compiler.java:3558)
    at clojure.lang.Compiler$VectorExpr.eval(Compiler.java:3101)
    at clojure.lang.Compiler$MapExpr.eval(Compiler.java:2931)
    at clojure.lang.Compiler$DefExpr.eval(Compiler.java:417)
    at clojure.lang.Compiler.eval(Compiler.java:6708)
    at clojure.lang.Compiler.load(Compiler.java:7130)
    at clojure.lang.Compiler.loadFile(Compiler.java:7086)
    at clojure.lang.RT$3.invoke(RT.java:318)
    at satellite_slave.config$include.invoke(config.clj:75)
    at satellite_slave.core$_main.doInvoke(core.clj:212)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at satellite_slave.core.main(Unknown Source)
Caused by: java.lang.ClassCastException: org.joda.time.Seconds cannot be cast to java.lang.Number
    at clojure.lang.Numbers.multiply(Numbers.java:146)
    at clojure.lang.Numbers.multiply(Numbers.java:3659)
    at satellite_slave.recipes$free_memory.invoke(recipes.clj:22)
    at clojure.lang.AFn.applyToHelper(AFn.java:156)
    at clojure.lang.AFn.applyTo(AFn.java:144)
    at clojure.lang.Compiler$InvokeExpr.eval(Compiler.java:3553)
    ... 11 more

Cannot start satellite-master, dependency problem?

Hello,

I've tried to build satellite and start it using the sample config:
java -jar satellite-master-0.2.0-SNAPSHOT-standalone.jar config/satellite-config.clj

The error is this:
Exception in thread "main" java.lang.RuntimeException: Unable to resolve symbol: ensure-all-tests-on-whitelisted-host-pass in this context, compiling:(/home/ubuntu/satellite/satellite-master/config/riemann-config.clj:35:18)
at clojure.lang.Compiler.analyze(Compiler.java:6464)
....and so on...

I've dug into lein abit and there are similar results that state it may be a dependency problem, so I ran lein deps :tree and saw that there was some overrides and recommendations around the reimann verision used but I'm not sure how much to trust it due to my lack of knowledge on this subject. Any ideas or anyway I can assist troubleshooting?

Run tests on travis CI

We do have some automated tests, and I'd like to add more as well. It would be nice for CI to run the whole suite for all branches associated with PR's.

We should run the test suite using Travis CI. I think that for now it should be as simple as setting travis to run "lein test" on both satellite-master and satellite-slave (which doesn't yet have any tests, but this may change).

Delay writing state to Zookeeper causes hosts to expire

When Zookeeper has latency in saving events, hosts will be added to the blacklist due to their events timing out and being expired. Disabling the persistence of events prevents this problem and allows hosts to appear on the whitelist.

The way to diagnose this is to look at the time when an event is fully processed compared to the time stored in the event. If this value is greater than five minutes this problem will occur.

There should be a way to warn about this problem and detect it rather than comparing the TTL of events manually with the time that they were processed.

Thank you,

Rusty

Document issue where riemann tcp config needs to be specified

Riemann should be listening on the master process on 0.0.0.0:5555 (or whatever the desired interface is). I'm not sure yet where this needs to be specified.

Consider a way to automatically re-enable black hole hosts

It would be nice if somehow, manual intervention weren't always required to bring a host back out of its black hole status and get it re-added to the whitelist.

Need to think about a specific strategy for this.

Adding new recipes should not require a jar be re-deployed to satellite-slaves

Currently there are 3 components to generating a notification for issues raised by satellite:

Grab a metric specified in recipes.clj
Publish a condition if the metric meets some threshold criteria specified in a json file (called a comet I think)
Determine how to handle the condition (notification mechanisms, automated resolution actions, etc.) in riemann-config.clj on the satellite-master.

Unfortunately, adding a new metric (step 1 above) requires that we change the recipes.clj file in the satellite-slave. In order to deploy the change we need to recompile the jar, which is not ideal. It would be nice if the recipes could be stored outside of the jar.

Provide a user interface for Satellite

This will be used alongside the existing Mesos user interface. It will not serve as a replacement for the Mesos UI.

In general, the Satellite-specific UI would go into detail about hosts, whereas the existing Mesos UI would be the definitive source for more information about Tasks.

Adevertise host

Hi,

I recently deployed 3 mesos/master (master<1|2|3>) and 2 mesos/slave (slave<1|2>) each one with satellite/master and satellite/slave respectively. Each server has 2 interfaces for internal (10.10.10.X) and external traffic (x.x.x.x).

I'm using this config in master:

satellite:

(ns satellite.core)
(def settings
  (merge settings
         {:mesos-master-url (url/url "http://master3.internal.domain.com:5050")
          :sleep-time 5000
          :zookeeper "10.10.10.1:2181,10.10.10.2:2181,10.10.10.3:2181"
          :local-whitelist-path "/etc/mesos/whitelist"
          :riemann-config "./config/riemann-debug.clj"
          :riemann-tcp-server-options {:host "0.0.0.0" }
          :service-host "0.0.0.0"}))

riemann:

(streams prn)

Here is my satellite/slave config:

(def mesos-work-dir "/tmp/mesos")

(def settings
  {:satellites [{:host "master1.internal.domain.com"}
                {:host "master2.internal.domain.com"}
                {:host "master3.internal.domain.com"}]
   :service "mesos/slave/"
   :comets [(satellite-slave.recipes/free-memory 50 (-> 60 t/seconds))
            (satellite-slave.recipes/free-swap   50 (-> 60 t/seconds))
            (satellite-slave.recipes/percentage-used 90 "/tmp" (-> 60 t/seconds))
            (satellite-slave.recipes/percentage-used 90 "/var" (-> 60 t/seconds))
            (satellite-slave.recipes/percentage-used 90 mesos-work-dir
                                                     (-> 60 t/seconds))
            (satellite-slave.recipes/num-uninterruptable-processes 10 (-> 60 t/seconds))
            (satellite-slave.recipes/load-average 30 (-> 60 t/seconds))

            {:command ["echo" "17"]
             :schedule (every (-> 60 t/seconds))
             :output (fn [{:keys [out err exit]}]
                       (let [v (-> out
                                   (clojure.string/split #"\s+")
                                   first
                                   (Integer/parseInt))]
                         [{:state "ok"
                           :metric v
                           :ttl 300
                           :description "example test -- number of files/dirs in cwd"}]))}]})

I can see metrics output in my slaves and satellite/leader. When I restart mesos the satellite/leader linked with mesos/leader changes his verbosity and I can see more logs in the new satellite/leader. I can think it's the desired behavior.

In zookeeper I can see my whitelist with 2 entries (my 2 slaves), but they are registered with my external domain.

I configured the internal interface but in satellite/leader I see logs with external.domain.com and internal.domain.com. Why I have this behavior? How can I force his hostname.

Regards

Stuck getting the slaves to report to the master

I'm stuck and could use a little guidance.

I have the master runnign on the mesos-masters. The two masters that are not on the mesos master node are following each other (according to the logs). The leader master comes up with

2015-11-26 10:40:21,977 INFO  satellite.core [main] - Starting Satellite
2015-11-26 10:40:21,978 INFO  satellite.core [main] - Reading config from file: /etc/satellite/satellite-config.clj
2015-11-26 10:40:22,397 INFO  satellite.riemann.services.leader [clojure-agent-send-off-pool-0] - Starting leader service
2015-11-26 10:40:22,401 INFO  satellite.riemann.services.leader [clojure-agent-send-off-pool-0] - Leader service started
2015-11-26 10:40:22,402 INFO  satellite.riemann.services.whitelist [clojure-agent-send-off-pool-0] - Starting whitelist service
2015-11-26 10:40:22,429 INFO  satellite.riemann.services.http [clojure-agent-send-off-pool-1] - Starting http server on port 5001
2015-11-26 10:40:22,429 INFO  riemann.core [main] - Hyperspace core online
2015-11-26 10:40:22,441 INFO  satellite.riemann.services.curator [clojure-agent-send-off-pool-3] - Starting curator service
2015-11-26 10:40:22,898 INFO  org.apache.curator.framework.imps.CuratorFrameworkImpl [clojure-agent-send-off-pool-3] - Starting
2015-11-26 10:40:22,931 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
2015-11-26 10:40:22,931 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:host.name=s-mesos-master-2
2015-11-26 10:40:22,932 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.version=1.8.0_66-internal
2015-11-26 10:40:22,932 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.vendor=Oracle Corporation
2015-11-26 10:40:22,932 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2015-11-26 10:40:22,932 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.class.path=satellite-master.jar
2015-11-26 10:40:22,932 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2015-11-26 10:40:22,932 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.io.tmpdir=/tmp
2015-11-26 10:40:22,933 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.compiler=<NA>
2015-11-26 10:40:22,933 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:os.name=Linux
2015-11-26 10:40:22,933 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:os.arch=amd64
2015-11-26 10:40:22,933 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:os.version=3.13.0-63-generic
2015-11-26 10:40:22,933 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:user.name=root
2015-11-26 10:40:22,933 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:user.home=/root
2015-11-26 10:40:22,933 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:user.dir=/opt/satellite/satellite-master
2015-11-26 10:40:22,934 INFO  org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Initiating client connection, connectString=10.0.137.248:2181 sessionTimeout=180000 watcher=org.apache.curator.ConnectionState@4279f493
2015-11-26 10:40:22,937 INFO  riemann.transport.tcp [clojure-agent-send-off-pool-4] - TCP server 10.0.137.248 5555 online
2015-11-26 10:40:22,944 INFO  satellite.riemann.monitor [clojure-agent-send-off-pool-4] - Tracking follower http://10.0.137.248:5050
2015-11-26 10:40:22,999 INFO  org.apache.zookeeper.ClientCnxn [clojure-agent-send-off-pool-3-SendThread(10.0.137.248:2181)] - Opening socket connection to server 10.0.137.248/10.0.137.248:2181. Will not attempt to authenticate using SASL (unknown error)
2015-11-26 10:40:23,012 INFO  org.apache.zookeeper.ClientCnxn [clojure-agent-send-off-pool-3-SendThread(10.0.137.248:2181)] - Socket connection established to 10.0.137.248/10.0.137.248:2181, initiating session
2015-11-26 10:40:23,026 INFO  org.apache.zookeeper.ClientCnxn [clojure-agent-send-off-pool-3-SendThread(10.0.137.248:2181)] - Session establishment complete on server 10.0.137.248/10.0.137.248:2181, sessionid = 0x251398577df45d4, negotiated timeout = 40000
2015-11-26 10:40:23,031 INFO  org.apache.curator.framework.state.ConnectionStateManager [clojure-agent-send-off-pool-3-EventThread] - State change: CONNECTED
2015-11-26 10:40:24,045 INFO  satellite.riemann.services.curator [clojure-agent-send-off-pool-3] - Curator service started
2015-11-26 10:40:24,089 INFO  satellite.riemann.services.whitelist [clojure-agent-send-off-pool-0] - Whitelist service started
2015-11-26 10:40:24,105 INFO  satellite.riemann.services.http [clojure-agent-send-off-pool-1] - Http server started
2015-11-26 10:40:24,121 INFO  org.eclipse.jetty.server.Server [clojure-agent-send-off-pool-0] - jetty-9.2.z-SNAPSHOT
2015-11-26 10:40:24,150 INFO  org.eclipse.jetty.server.ServerConnector [clojure-agent-send-off-pool-0] - Started ServerConnector@56b79007{HTTP/1.1}{10.0.137.248:5001}
2015-11-26 10:40:24,151 INFO  org.eclipse.jetty.server.Server [clojure-agent-send-off-pool-0] - Started @11995ms

All the slaves output:

2015-11-26 10:41:06,483 INFO  satellite-slave.core [main] - Starting Satellite-Slave
2015-11-26 10:41:06,484 INFO  satellite-slave.core [main] - Reading config from file: /etc/satellite/slave-config.clj

My riemann config is simply (as per the debugging section in the docs)

(stream prn)

The above is all I ever see the the logs.

I used the riemann-cli tool to send messages directly from the slave node to the master node, and those DO print out in the master log. But nothing seems to get there from the satellite slaves.

I also ran the slave in --dry-run and it DID emit messages for all checks on the intervals specified in the config.

I'm not sure what else I can do before I fork the repo and start adding log statements.

How to Install it without Ansible

How can I install satellite if I do not have Ansible?

Long term: create a DCOS package for Satellite?

Mesosphere appears to be pushing DCOS hard.

If we made a DCOS package for Satellite, would that provide the ideal installation experience for people who run Mesos via DCOS?

How to send riemann metrics to other Riemann host?

Guys I am trying to send satellite Riemann metrics to my Riemann server, but it's not working, every time I start satellite master it start Riemann server also

"riemann-tcp-server-options": { "host": "monitor-000-staging.abhishekamralkar.com" },

I am making changes in satellite-settings.json file on Satellite Master servers.

Can start two master process on same node, neither one shows any errors

yaml configuration options

Users who are not comfortable w/ Clojure have asked for a yaml or json based configuration file option satellite. we can discuss design and details here.

Expired events should be reaped after some time period

I have an expired event from yesterday and it still shows in the whitelist response. The expired event would have gone away if I continued to monitor /tmp/mesos, assuming the state changed back to ok. But I changed the recipe to monitor /var/lib/mesos and this old event is just hanging around. The Satellite masters were restarted yesterday after the event occurred. I haven't tried restarting the slaves.

{"jb10.example.com":
  {"managed-events":
    {"mesos\/slave\/percentage used of \/tmp\/mesos":
      {"time":1.444076213197E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb10.example.com"}
    },
  "manual-events":{},"managed-flag":"on","manual-flag":null},
"jb8.example.com":
  {"managed-events":
    {"mesos\/slave\/percentage used of \/tmp\/mesos":
      {"time":1.444076258248E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb8.example.com"}
    },
    "manual-events":{},"managed-flag":"on","manual-flag":null},
"jb9.example.com":
  {"managed-events":
    {"mesos\/slave\/percentage used of \/tmp\/mesos":
      {"time":1.444076198146E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb9.example.com"}
    },
    "manual-events":{},"managed-flag":"on","manual-flag":null}
}

I could probably send a test event with the same service and a state of "ok" to clean this up, but it would be nice if Satellite could do some housekeeping.

Improve docs for getting started

User reported these confusions from lack of docs--we shouldn't need to trace through the code to understand these:

Riemann was using default port 5555, which I did not have exposed via docker nor open in my aws security group. To find about this port I had to trace into the riemann code. It would be nice if the examples for :riemann-tcp-server-options and :satellites included the port. While now it obvious, for a bit it was not clear to me that the master was listening on two different ports (rest-api and riemann).

I was using (stream prn) vs (streams prn) for a bit there.

Configuration management mechanism for satellite-slaves

In order to update the rules run by a slave I need to deploy either a .clj or a .json file to each satellite-slave host.

It would be nice if I could manage configurations on a satellite-master and be able to "reload" the satellite-slaves after making changes. Even better if they periodically reload themselves (would require syntactic validation on the master so that a slave doesn't attempt to load an invalid configuration).

Upgrade clojure/schema/plumbing dependencies

Looks like the AOT issue with Plumbing has been resolved:

plumatic/plumbing#74

...so we should be free to upgrade now as long as we're on Clojure 1.7.

Currently Satellite is at 1.6...
Clojure 1.7 became the stable release this past June 30

force config to chroot zookeeper

Clarify the satellite landing page

Currently, it's not obvious how there's lots of docs in the master & slave subfolders. We should make the main README.md clearer, by imitating the one from Cook.

Blacklist?

Is there way to use the manual whitelist to simulate forcing a node off the whitelist. For draining a node that is going down or into maintenance?

Deprecating /stats.json endpoints

Hello,

This project looks very interesting but its now not compatible with the last version of Mesos (0.23) :
https://issues.apache.org/jira/browse/MESOS-2058

Indicating satellite-masters to the slaves?

It seems that the only way to indicate the masters is to explicitly list them in the slaves config. I really don't want to have to update all the slave config files if a master is lost/added/removed. Our masters will be registered with consul so I can provide the dns in the slave config. But then only one master would be resolved by the slave.

Has any thought been given about master discover for this scenario?

cache-state function broken for riak integration

With the changes to specify comets in json files it looks like the comet specification for cache-state needs to be updated for users who store mesos task state in Riak via Satellite.

(defn cache-state
  "Return a map which could be used as input for core/run-test.

   Args:
       threshold: Long, specifies the how many failures would trigger a riemann event with critical state.
       period: Seconds, specifies the period this task / test expected to run.
       conifg: Map, specifies the config of this task / test which is expected to have keys :slave (optional), :riak-url, and :bucket.
   Output:
       Return a map which could be used as input for core/run-test."
  [period config]
    {:command ["echo" "cache-state"]
     :schedule (every period)
     :output (fn [& params]
               (let [{threshold :threshold
                      riak-url :riak-url
                      bucket   :bucket} config
                      failures (try
                                 (cache/cache-state-and-count-failures (:slave-host config/settings) riak-url bucket)
                                 (catch Throwable e
                                   (log/error e "Failed to cache state.")
                                   ;; Return a value strictly larger that the threshold.
                                   (inc threshold)))]
                 [{:ttl (* 5 (.getSeconds period))
                  :service "cache state"
                  :state (if (> failures threshold) "critical" "ok")
                  :metric failures}]))})

Provide a CLI

Satellite should expose a Command Line Interface, more full-featured than the web UI (which would be read-only). We should leverage the DCOS CLI in order to implement this.

Boolean inversion in whitelist hostname-malformed check

satellite/satellite-master/src/satellite/services/whitelist.clj

Line 63 in bb39fe0

(.isReachable addr 1000)

(.isReachable addr 1000)

Should probably say:
(not (.isReachable addr 1000))

Detect black-hole hosts and automatically disable them

We should be able to detect when an agent is failing some # of tasks per minute, and if that failure rate exceeds a target, we should automatically disable that agent until it can be investigated.

satellite-slave won't start using json specification of comets

When trying to start a satellite-slave using the new json method for specifying comets I'm getting an unexpected stacktrace. Could you take a look and provide any insight into what's happening?

main clojure config file
`
(require 'satellite-slave.mesos.recipes)
(def mesos-work-dir "/some/path/to/mesos/work")

(def settings
(-> "/some/path/to/comets.json"
clojure.java.io/reader
(cheshire/parse-stream true)
))
`

comets.json
{ "satellites": [ {"host": "fakehostname.fakedomain.fake"}, {"host": "fakehostname.fakedomain.fake"}, {"host": "fakehostname.fakedomain.fake"} ], "service": "mesos/slave/", "comets": [ {"name": "satellite-slave.recipes/free-memory", "period": "60s", "params": {"threshold": 1024}}, {"name": "satellite-slave.recipes/free-swap", "period": "60s", "params": {"threshold": 50}}, {"name": "satellite-slave.recipes/percentage-used", "period": "60s", "params": {"threshold": 90, "path": "/tmp"}}, {"name": "satellite-slave.recipes/percentage-used", "period": "60s", "params": {"threshold": 90, "path": "/var"}}, {"name": "satellite-slave.recipes/percentage-used", "period": "60s", "params": {"threshold": 90, "path": "/some/path/to/work/dir"}}, {"name": "satellite-slave.recipes/num-uninterruptable-processes", "period": "60s", "params": {"threshold": 10}}, {"name": "satellite-slave.recipes/load-average", "period": "60s", "params": {"threshold": 128}} ] }

Resultant stacktrace:
Exception in thread "async-dispatch-3" Exception in thread "async-dispatch-5" Exception in thread "async-dispatch-1" java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread "async-dispatch-7" java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread "async-dispatch-16" Exception in thread "async-dispatch-13" java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread "async-dispatch-19" java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Recipe which calculates free memory excludes buffers/cache

Hosts are being de-whitelisted because the memory considered to be "free" is under our threshold. This is problematic because there is plenty of memory in buffers and caches which could be offered by the mesos slave to run jobs.

We want to consider all memory in (free|buffer|cache) to be "free". I believe the change should go in the arguments for "sed" in satellite/satellite-slave/bin/satellite-recipes:

"free-memory") free -m | tr -s ' '| cut -d' ' -f4 | sed -n '3p'

Satellite-slave configuration management

In order to update comets run by a satellite-slave, a file must be updated on each satellite-slave host and the process restarted. It would be helpful to have a central location (possibly on the masters) for managing the slave configurations. It would also be helpful if it were possible to trigger a "reload" of a given slave's comet configuration.

Instructions unclear

Couple of things are missing from the documentation:

Instruction to run the project
What to expect when you run it

Can someone post screenshots or execution command of the same?

Add integration tests for satellite using minimesos

http://minimesos.org/

This would establish some confidence that satellite-master and satellite-slave are behaving correctly when working together.

twosigma / satellite Goto Github PK

satellite's People

Contributors

Stargazers

Watchers

Forkers

satellite's Issues

Recommend Projects

Recommend Topics

Recommend Org