twosigma / satellite Goto Github PK
View Code? Open in Web Editor NEWSatellite monitors, alerts on, and self-heals your Mesos cluster.
License: Apache License 2.0
Satellite monitors, alerts on, and self-heals your Mesos cluster.
License: Apache License 2.0
I have 2 Mesos clusters. One with 3 servers each running ZK, Mesos master, Mesos slave, Satellite master & Satellite slave. My 2nd cluster has 6 servers separating out masters and slaves. On both clusters all 3 Mesos slaves are active.
At one point, on the first cluster (3 servers), I was able to see all slaves registered in the whitelist. I had a high load on 2 of the servers. Those 2 servers dropped from the whitelist. Once the load came down, one slave re-registered with the whitelist and the other didn't.
On my 2nd cluster (6 servers), 1 of the slave servers never registers.
In both cases, I see mesos/slave events from all 3 satellite slaves reaching the satellite master leader.
Cluster 1:
Cluster 2:
/usr/sbin/mesos-master --zk=zk://jb5.example.com:2181,jb6.example.com:2181,jb7.example.com:2181/mesos --port=5050 --log_dir=/var/log/mesos --quorum=2 --whitelist=/etc/mesos/slave_list.txt --work_dir=/var/lib/mesos
On all 3 Mesos masters, the whitelist file starts out like this:
$ cat /etc/mesos/slave_list.txt
jb8.example.com
jb9.example.com
jb10.example.com
Each Satellite master has a satellite-config.clj file that looks similar to this. The only difference is the config will point to the server it's installed on. For example, on jb6, all the jb5s are replaced with jb6.
(def settings
(merge settings
{:mesos-master-url (url/url "http://jb5.example.com:5050")
:sleep-time 5000
:zookeeper "jb5.example.com:2181"
:local-whitelist-path "/etc/mesos/slave_list.txt"
:riemann-tcp-server-options {:host "jb5.example.com" }
:service-host "jb5.example.com"}))
I made no changes to the satellite-config.clj file. My slave-config.clj file looks like this:
(def mesos-work-dir "/var/lib/mesos")
(def settings
{:satellites [{:host "jb5.example.com"}
{:host "jb6.example.com"}
{:host "jb7.example.com"}]
:service "mesos/slave/"
:comets [(satellite-slave.recipes/free-memory 50 (-> 60 t/seconds))
(satellite-slave.recipes/free-swap 50 (-> 60 t/seconds))
(satellite-slave.recipes/percentage-used 90 "/tmp" (-> 60 t/seconds))
(satellite-slave.recipes/percentage-used 90 "/var" (-> 60 t/seconds))
(satellite-slave.recipes/percentage-used 90 mesos-work-dir
(-> 60 t/seconds))
(satellite-slave.recipes/num-uninterruptable-processes 10 (-> 60 t/seconds))
(satellite-slave.recipes/load-average 4.5 (-> 60 t/seconds))
{:command ["echo" "17"]
:schedule (every (-> 60 t/seconds))
:output (fn [{:keys [out err exit]}]
(let [v (-> out
(clojure.string/split #"\s+")
first
(Integer/parseInt))]
[{:state "ok"
:metric v
:ttl 300
:description "example test -- number of files/dirs in cwd"}]))}]})
So the only change I made was to add my satellite masters. If I query the whitelist API I see the following results:
{"jb10.np.ev1.yellowpages.com":
{"managed-events":{},"manual-events":{},"managed-flag":"on","manual-flag":null},
"jb8.np.ev1.yellowpages.com":
{"managed-events":{},"manual-events":{},"managed-flag":"on","manual-flag":null},
"jb9.np.ev1.yellowpages.com":
{"managed-events":{},"manual-events":{},"managed-flag":null,"manual-flag":null}}
In my slave-config.clj, I changed the mesos-work-dir from /tmp/mesos to /var/lib/mesos. I restarted the slaves. This caused expired events for /tmp/mesos. That expired event ended up changing the whitelist and now jb9 & jb10 are 'on' and jb8 is 'off'.
{"jb10.example.com":
{"managed-events":
{"mesos\/slave\/percentage used of \/tmp\/mesos":
{"time":1.444076213197E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb10.example.com"}
},
"manual-events":{},"managed-flag":"on","manual-flag":null
},
"jb8.example.com":
{"managed-events":
{"mesos\/slave\/percentage used of \/tmp\/mesos":
{"time":1.444076258248E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb8.example.com"}
},
"manual-events":{},"managed-flag":"off","manual-flag":null
},
"jb9.example.com":
{"managed-events":
{"mesos\/slave\/percentage used of \/tmp\/mesos":
{"time":1.444076198146E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb9.example.com"}
},
"manual-events":{},"managed-flag":"on","manual-flag":null
}
}
And /metrics/snapshot also says 2 out of 3:
{"prop-available-hosts":0.6666666666666667,"num-available-hosts":3,"num-hosts-up":2}
(.isReachable addr 1000)
Should probably say:
(not (.isReachable addr 1000))
I have an expired event from yesterday and it still shows in the whitelist response. The expired event would have gone away if I continued to monitor /tmp/mesos, assuming the state changed back to ok. But I changed the recipe to monitor /var/lib/mesos and this old event is just hanging around. The Satellite masters were restarted yesterday after the event occurred. I haven't tried restarting the slaves.
{"jb10.example.com":
{"managed-events":
{"mesos\/slave\/percentage used of \/tmp\/mesos":
{"time":1.444076213197E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb10.example.com"}
},
"manual-events":{},"managed-flag":"on","manual-flag":null},
"jb8.example.com":
{"managed-events":
{"mesos\/slave\/percentage used of \/tmp\/mesos":
{"time":1.444076258248E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb8.example.com"}
},
"manual-events":{},"managed-flag":"on","manual-flag":null},
"jb9.example.com":
{"managed-events":
{"mesos\/slave\/percentage used of \/tmp\/mesos":
{"time":1.444076198146E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb9.example.com"}
},
"manual-events":{},"managed-flag":"on","manual-flag":null}
}
I could probably send a test event with the same service and a state of "ok" to clean this up, but it would be nice if Satellite could do some housekeeping.
In order to update comets run by a satellite-slave, a file must be updated on each satellite-slave host and the process restarted. It would be helpful to have a central location (possibly on the masters) for managing the slave configurations. It would also be helpful if it were possible to trigger a "reload" of a given slave's comet configuration.
The sample config passes in a joda time object, then the recipe tries to multiply it by 5. Not sure which type of object is expected by whatever code processes this data (riemann?).
Exception in thread "main" java.lang.ClassCastException: org.joda.time.Seconds cannot be cast to java.lang.Number, compiling:(sample.clj:8:13)
at clojure.lang.Compiler$InvokeExpr.eval(Compiler.java:3558)
at clojure.lang.Compiler$VectorExpr.eval(Compiler.java:3101)
at clojure.lang.Compiler$MapExpr.eval(Compiler.java:2931)
at clojure.lang.Compiler$DefExpr.eval(Compiler.java:417)
at clojure.lang.Compiler.eval(Compiler.java:6708)
at clojure.lang.Compiler.load(Compiler.java:7130)
at clojure.lang.Compiler.loadFile(Compiler.java:7086)
at clojure.lang.RT$3.invoke(RT.java:318)
at satellite_slave.config$include.invoke(config.clj:75)
at satellite_slave.core$_main.doInvoke(core.clj:212)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at satellite_slave.core.main(Unknown Source)
Caused by: java.lang.ClassCastException: org.joda.time.Seconds cannot be cast to java.lang.Number
at clojure.lang.Numbers.multiply(Numbers.java:146)
at clojure.lang.Numbers.multiply(Numbers.java:3659)
at satellite_slave.recipes$free_memory.invoke(recipes.clj:22)
at clojure.lang.AFn.applyToHelper(AFn.java:156)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.lang.Compiler$InvokeExpr.eval(Compiler.java:3553)
... 11 more
Currently, it's not obvious how there's lots of docs in the master & slave subfolders. We should make the main README.md
clearer, by imitating the one from Cook.
How can I install satellite if I do not have Ansible?
Could serve as a form of distribution, alternative to Linux system packages or DCOS packages.
Hi there.
Thanks for satellite. Is there more documentation somewhere or an irc channel to discuss deeper details with this project?
Thanks in advance!
Riemann should be listening on the master process on 0.0.0.0:5555 (or whatever the desired interface is). I'm not sure yet where this needs to be specified.
Hello.
I've managed to install satellite on my mesos cluster. Reimann is pushing data in a local influxdb for grafana, but I'd like to push slave state.json.
My riemann config is very simple:
(def inflxdb (graphite {:host "master1" :port 2003}))
(streams
inflxdb
)
My slave config:
(def mesos-work-dir "/tmp/mesos")
(def settings
{:satellites [{:host "master1"}]
:service "mesos/slave/"
)
In influxdb, I see localhost.mesos/* with master info, and the consolidated informations for my frameworks, but no slave informations. Is it possible?
Thanks.
Hello,
The sample config for the slave process doesn't work well and says that satellite-recipes cannot be found.
In order to update the rules run by a slave I need to deploy either a .clj or a .json file to each satellite-slave host.
It would be nice if I could manage configurations on a satellite-master and be able to "reload" the satellite-slaves after making changes. Even better if they periodically reload themselves (would require syntactic validation on the master so that a slave doesn't attempt to load an invalid configuration).
Here's a great place to get started. http://choosealicense.com
Couple of things are missing from the documentation:
Can someone post screenshots or execution command of the same?
Currently there are 3 components to generating a notification for issues raised by satellite:
Unfortunately, adding a new metric (step 1 above) requires that we change the recipes.clj file in the satellite-slave. In order to deploy the change we need to recompile the jar, which is not ideal. It would be nice if the recipes could be stored outside of the jar.
Hi,
I recently deployed 3 mesos/master (master<1|2|3>) and 2 mesos/slave (slave<1|2>) each one with satellite/master and satellite/slave respectively. Each server has 2 interfaces for internal (10.10.10.X) and external traffic (x.x.x.x).
I'm using this config in master:
satellite:
(ns satellite.core)
(def settings
(merge settings
{:mesos-master-url (url/url "http://master3.internal.domain.com:5050")
:sleep-time 5000
:zookeeper "10.10.10.1:2181,10.10.10.2:2181,10.10.10.3:2181"
:local-whitelist-path "/etc/mesos/whitelist"
:riemann-config "./config/riemann-debug.clj"
:riemann-tcp-server-options {:host "0.0.0.0" }
:service-host "0.0.0.0"}))
riemann:
(streams prn)
Here is my satellite/slave config:
(def mesos-work-dir "/tmp/mesos")
(def settings
{:satellites [{:host "master1.internal.domain.com"}
{:host "master2.internal.domain.com"}
{:host "master3.internal.domain.com"}]
:service "mesos/slave/"
:comets [(satellite-slave.recipes/free-memory 50 (-> 60 t/seconds))
(satellite-slave.recipes/free-swap 50 (-> 60 t/seconds))
(satellite-slave.recipes/percentage-used 90 "/tmp" (-> 60 t/seconds))
(satellite-slave.recipes/percentage-used 90 "/var" (-> 60 t/seconds))
(satellite-slave.recipes/percentage-used 90 mesos-work-dir
(-> 60 t/seconds))
(satellite-slave.recipes/num-uninterruptable-processes 10 (-> 60 t/seconds))
(satellite-slave.recipes/load-average 30 (-> 60 t/seconds))
{:command ["echo" "17"]
:schedule (every (-> 60 t/seconds))
:output (fn [{:keys [out err exit]}]
(let [v (-> out
(clojure.string/split #"\s+")
first
(Integer/parseInt))]
[{:state "ok"
:metric v
:ttl 300
:description "example test -- number of files/dirs in cwd"}]))}]})
I can see metrics output in my slaves and satellite/leader. When I restart mesos the satellite/leader linked with mesos/leader changes his verbosity and I can see more logs in the new satellite/leader. I can think it's the desired behavior.
In zookeeper I can see my whitelist with 2 entries (my 2 slaves), but they are registered with my external domain.
I configured the internal interface but in satellite/leader I see logs with external.domain.com and internal.domain.com. Why I have this behavior? How can I force his hostname.
Regards
User reported these confusions from lack of docs--we shouldn't need to trace through the code to understand these:
Riemann was using default port 5555, which I did not have exposed via docker nor open in my aws security group. To find about this port I had to trace into the riemann code. It would be nice if the examples for :riemann-tcp-server-options and :satellites included the port. While now it obvious, for a bit it was not clear to me that the master was listening on two different ports (rest-api and riemann).
I was using (stream prn) vs (streams prn) for a bit there.
According to mesos docs:
--work_dir=VALUE Directory path to place framework work directories
(default: /tmp/mesos)
It would be nice if somehow, manual intervention weren't always required to bring a host back out of its black hole status and get it re-added to the whitelist.
Need to think about a specific strategy for this.
This will be used alongside the existing Mesos user interface. It will not serve as a replacement for the Mesos UI.
In general, the Satellite-specific UI would go into detail about hosts, whereas the existing Mesos UI would be the definitive source for more information about Tasks.
When trying to start a satellite-slave using the new json method for specifying comets I'm getting an unexpected stacktrace. Could you take a look and provide any insight into what's happening?
main clojure config file
`
(require 'satellite-slave.mesos.recipes)
(def mesos-work-dir "/some/path/to/mesos/work")
(def settings
(-> "/some/path/to/comets.json"
clojure.java.io/reader
(cheshire/parse-stream true)
))
`
comets.json
{ "satellites": [ {"host": "fakehostname.fakedomain.fake"}, {"host": "fakehostname.fakedomain.fake"}, {"host": "fakehostname.fakedomain.fake"} ], "service": "mesos/slave/", "comets": [ {"name": "satellite-slave.recipes/free-memory", "period": "60s", "params": {"threshold": 1024}}, {"name": "satellite-slave.recipes/free-swap", "period": "60s", "params": {"threshold": 50}}, {"name": "satellite-slave.recipes/percentage-used", "period": "60s", "params": {"threshold": 90, "path": "/tmp"}}, {"name": "satellite-slave.recipes/percentage-used", "period": "60s", "params": {"threshold": 90, "path": "/var"}}, {"name": "satellite-slave.recipes/percentage-used", "period": "60s", "params": {"threshold": 90, "path": "/some/path/to/work/dir"}}, {"name": "satellite-slave.recipes/num-uninterruptable-processes", "period": "60s", "params": {"threshold": 10}}, {"name": "satellite-slave.recipes/load-average", "period": "60s", "params": {"threshold": 128}} ] }
Resultant stacktrace:
Exception in thread "async-dispatch-3" Exception in thread "async-dispatch-5" Exception in thread "async-dispatch-1" java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread "async-dispatch-7" java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread "async-dispatch-16" Exception in thread "async-dispatch-13" java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread "async-dispatch-19" java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.lang.IllegalArgumentException: No implementation of method: :before? of protocol: #'clj-time.core/DateTimeProtocol found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:544) at clj_time.core$fn__115$G__56__122.invoke(core.clj:94) at chime$ms_between.invoke(chime.clj:8) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232$fn__6234.invoke(chime.clj:41) at chime$chime_ch$fn__6231$state_machine__4346__auto____6232.invoke(chime.clj:37) at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940) at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944) at chime$chime_ch$fn__6231.invoke(chime.clj:37) at clojure.lang.AFn.run(AFn.java:22) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
This fits with Mesos language change from slave to agent
We should be able to detect when an agent is failing some # of tasks per minute, and if that failure rate exceeds a target, we should automatically disable that agent until it can be investigated.
We do have some automated tests, and I'd like to add more as well. It would be nice for CI to run the whole suite for all branches associated with PR's.
We should run the test suite using Travis CI. I think that for now it should be as simple as setting travis to run "lein test" on both satellite-master and satellite-slave (which doesn't yet have any tests, but this may change).
Hello,
This project looks very interesting but its now not compatible with the last version of Mesos (0.23) :
https://issues.apache.org/jira/browse/MESOS-2058
The concern here is that if a host becomes considered a black hole, there is no way for it to get automatically re-added to the whitelist (because once removed from the whitelist, it will no longer be assigned tasks, so it will never get a chance to prove that it is now capable once again of completing tasks!). Thus, if something happens that blackholes a very large number of hosts, extensive manual intervention would be required in order to re-enable hosts.
Suggested safeguards:
If a host would be "black holed", but this would cause one of the safeguards above to be violated, log the issue and just move on (take no other action). If the host continues to fail a lot of tasks, it will be re-evaluated as a black hole soon enough (and maybe at that point there will be enough room in the black hole dungeon).
It seems that the only way to indicate the masters is to explicitly list them in the slaves config. I really don't want to have to update all the slave config files if a master is lost/added/removed. Our masters will be registered with consul so I can provide the dns in the slave config. But then only one master would be resolved by the slave.
Has any thought been given about master discover for this scenario?
Based on the current configuration defaults a satellite-master points directly at a single Mesos master. Without a 1-to-1 mapping of satellite-masters to mesos-masters, this could be problematic in the event of a leader election for Mesos master.
It would be helpful if Satellite could watch zookeeper for a Mesos-master election and automatically point to the active master.
E.g. a .deb for Debian / Ubuntu.
This way, we could piggyback on existing dependency version management and init systems of various Linux distributions in order to ensure that Satellite is installed on a system with compatible dependencies and is always running when it needs to be.
Satellite should expose a Command Line Interface, more full-featured than the web UI (which would be read-only). We should leverage the DCOS CLI in order to implement this.
Guys I am trying to send satellite Riemann metrics to my Riemann server, but it's not working, every time I start satellite master it start Riemann server also
"riemann-tcp-server-options": { "host": "monitor-000-staging.abhishekamralkar.com" },
I am making changes in satellite-settings.json
file on Satellite Master servers.
It seems that the whitelist is written in such a way that the whitelist file is truncated then new whitelist is written to the same file. I determined this behavior by watching the whitelist and noticing that inode number never changes.
It would be better to create a new file, write the new whitelist contents there, then rename that new file to whitelist. This would be an atomic operation and there wouldn't be a time when the whitelist contained zero hosts which does happen now.
Thank you,
Rusty
Line 16
With the changes to specify comets in json files it looks like the comet specification for cache-state needs to be updated for users who store mesos task state in Riak via Satellite.
(defn cache-state
"Return a map which could be used as input for core/run-test.
Args:
threshold: Long, specifies the how many failures would trigger a riemann event with critical state.
period: Seconds, specifies the period this task / test expected to run.
conifg: Map, specifies the config of this task / test which is expected to have keys :slave (optional), :riak-url, and :bucket.
Output:
Return a map which could be used as input for core/run-test."
[period config]
{:command ["echo" "cache-state"]
:schedule (every period)
:output (fn [& params]
(let [{threshold :threshold
riak-url :riak-url
bucket :bucket} config
failures (try
(cache/cache-state-and-count-failures (:slave-host config/settings) riak-url bucket)
(catch Throwable e
(log/error e "Failed to cache state.")
;; Return a value strictly larger that the threshold.
(inc threshold)))]
[{:ttl (* 5 (.getSeconds period))
:service "cache state"
:state (if (> failures threshold) "critical" "ok")
:metric failures}]))})
Users who are not comfortable w/ Clojure have asked for a yaml or json based configuration file option satellite. we can discuss design and details here.
Hosts are being de-whitelisted because the memory considered to be "free" is under our threshold. This is problematic because there is plenty of memory in buffers and caches which could be offered by the mesos slave to run jobs.
We want to consider all memory in (free|buffer|cache) to be "free". I believe the change should go in the arguments for "sed" in satellite/satellite-slave/bin/satellite-recipes:
"free-memory") free -m | tr -s ' '| cut -d' ' -f4 | sed -n '3p'
I'm stuck and could use a little guidance.
I have the master runnign on the mesos-masters. The two masters that are not on the mesos master node are following each other (according to the logs). The leader master comes up with
2015-11-26 10:40:21,977 INFO satellite.core [main] - Starting Satellite
2015-11-26 10:40:21,978 INFO satellite.core [main] - Reading config from file: /etc/satellite/satellite-config.clj
2015-11-26 10:40:22,397 INFO satellite.riemann.services.leader [clojure-agent-send-off-pool-0] - Starting leader service
2015-11-26 10:40:22,401 INFO satellite.riemann.services.leader [clojure-agent-send-off-pool-0] - Leader service started
2015-11-26 10:40:22,402 INFO satellite.riemann.services.whitelist [clojure-agent-send-off-pool-0] - Starting whitelist service
2015-11-26 10:40:22,429 INFO satellite.riemann.services.http [clojure-agent-send-off-pool-1] - Starting http server on port 5001
2015-11-26 10:40:22,429 INFO riemann.core [main] - Hyperspace core online
2015-11-26 10:40:22,441 INFO satellite.riemann.services.curator [clojure-agent-send-off-pool-3] - Starting curator service
2015-11-26 10:40:22,898 INFO org.apache.curator.framework.imps.CuratorFrameworkImpl [clojure-agent-send-off-pool-3] - Starting
2015-11-26 10:40:22,931 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
2015-11-26 10:40:22,931 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:host.name=s-mesos-master-2
2015-11-26 10:40:22,932 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.version=1.8.0_66-internal
2015-11-26 10:40:22,932 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.vendor=Oracle Corporation
2015-11-26 10:40:22,932 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2015-11-26 10:40:22,932 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.class.path=satellite-master.jar
2015-11-26 10:40:22,932 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2015-11-26 10:40:22,932 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.io.tmpdir=/tmp
2015-11-26 10:40:22,933 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:java.compiler=<NA>
2015-11-26 10:40:22,933 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:os.name=Linux
2015-11-26 10:40:22,933 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:os.arch=amd64
2015-11-26 10:40:22,933 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:os.version=3.13.0-63-generic
2015-11-26 10:40:22,933 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:user.name=root
2015-11-26 10:40:22,933 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:user.home=/root
2015-11-26 10:40:22,933 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Client environment:user.dir=/opt/satellite/satellite-master
2015-11-26 10:40:22,934 INFO org.apache.zookeeper.ZooKeeper [clojure-agent-send-off-pool-3] - Initiating client connection, connectString=10.0.137.248:2181 sessionTimeout=180000 watcher=org.apache.curator.ConnectionState@4279f493
2015-11-26 10:40:22,937 INFO riemann.transport.tcp [clojure-agent-send-off-pool-4] - TCP server 10.0.137.248 5555 online
2015-11-26 10:40:22,944 INFO satellite.riemann.monitor [clojure-agent-send-off-pool-4] - Tracking follower http://10.0.137.248:5050
2015-11-26 10:40:22,999 INFO org.apache.zookeeper.ClientCnxn [clojure-agent-send-off-pool-3-SendThread(10.0.137.248:2181)] - Opening socket connection to server 10.0.137.248/10.0.137.248:2181. Will not attempt to authenticate using SASL (unknown error)
2015-11-26 10:40:23,012 INFO org.apache.zookeeper.ClientCnxn [clojure-agent-send-off-pool-3-SendThread(10.0.137.248:2181)] - Socket connection established to 10.0.137.248/10.0.137.248:2181, initiating session
2015-11-26 10:40:23,026 INFO org.apache.zookeeper.ClientCnxn [clojure-agent-send-off-pool-3-SendThread(10.0.137.248:2181)] - Session establishment complete on server 10.0.137.248/10.0.137.248:2181, sessionid = 0x251398577df45d4, negotiated timeout = 40000
2015-11-26 10:40:23,031 INFO org.apache.curator.framework.state.ConnectionStateManager [clojure-agent-send-off-pool-3-EventThread] - State change: CONNECTED
2015-11-26 10:40:24,045 INFO satellite.riemann.services.curator [clojure-agent-send-off-pool-3] - Curator service started
2015-11-26 10:40:24,089 INFO satellite.riemann.services.whitelist [clojure-agent-send-off-pool-0] - Whitelist service started
2015-11-26 10:40:24,105 INFO satellite.riemann.services.http [clojure-agent-send-off-pool-1] - Http server started
2015-11-26 10:40:24,121 INFO org.eclipse.jetty.server.Server [clojure-agent-send-off-pool-0] - jetty-9.2.z-SNAPSHOT
2015-11-26 10:40:24,150 INFO org.eclipse.jetty.server.ServerConnector [clojure-agent-send-off-pool-0] - Started ServerConnector@56b79007{HTTP/1.1}{10.0.137.248:5001}
2015-11-26 10:40:24,151 INFO org.eclipse.jetty.server.Server [clojure-agent-send-off-pool-0] - Started @11995ms
All the slaves output:
2015-11-26 10:41:06,483 INFO satellite-slave.core [main] - Starting Satellite-Slave
2015-11-26 10:41:06,484 INFO satellite-slave.core [main] - Reading config from file: /etc/satellite/slave-config.clj
My riemann config is simply (as per the debugging section in the docs)
(stream prn)
The above is all I ever see the the logs.
I used the riemann-cli tool to send messages directly from the slave node to the master node, and those DO print out in the master log. But nothing seems to get there from the satellite slaves.
I also ran the slave in --dry-run and it DID emit messages for all checks on the intervals specified in the config.
I'm not sure what else I can do before I fork the repo and start adding log statements.
Is there way to use the manual whitelist to simulate forcing a node off the whitelist. For draining a node that is going down or into maintenance?
This would establish some confidence that satellite-master and satellite-slave are behaving correctly when working together.
When Zookeeper has latency in saving events, hosts will be added to the blacklist due to their events timing out and being expired. Disabling the persistence of events prevents this problem and allows hosts to appear on the whitelist.
The way to diagnose this is to look at the time when an event is fully processed compared to the time stored in the event. If this value is greater than five minutes this problem will occur.
There should be a way to warn about this problem and detect it rather than comparing the TTL of events manually with the time that they were processed.
Thank you,
Rusty
Mesosphere appears to be pushing DCOS hard.
If we made a DCOS package for Satellite, would that provide the ideal installation experience for people who run Mesos via DCOS?
Hello,
I've tried to build satellite and start it using the sample config:
java -jar satellite-master-0.2.0-SNAPSHOT-standalone.jar config/satellite-config.clj
The error is this:
Exception in thread "main" java.lang.RuntimeException: Unable to resolve symbol: ensure-all-tests-on-whitelisted-host-pass in this context, compiling:(/home/ubuntu/satellite/satellite-master/config/riemann-config.clj:35:18)
at clojure.lang.Compiler.analyze(Compiler.java:6464)
....and so on...
I've dug into lein abit and there are similar results that state it may be a dependency problem, so I ran lein deps :tree and saw that there was some overrides and recommendations around the reimann verision used but I'm not sure how much to trust it due to my lack of knowledge on this subject. Any ideas or anyway I can assist troubleshooting?
The leader-watcher thread is never started, the logs say it's following even when it's the leader.
Looks like the AOT issue with Plumbing has been resolved:
...so we should be free to upgrade now as long as we're on Clojure 1.7.
Currently Satellite is at 1.6...
Clojure 1.7 became the stable release this past June 30
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.