Comments (14)
@janisz @tomez So I am noticing that we will get periodic connection refused errors in our Marathon cluster nodes. I see from the logging timing that marathon-consul attempts to connect very quickly--the first couple apparently within a second, which appears to be within the window of time that marathon refuses connections.
Perhaps what is needed is a configurable retry time interval for marathon connection failures?
from marathon-consul.
@janisz This works great! I deployed the patch and either marathon-consul recovers from loss of connectivity or exits after four failed tries. Since I am running marathon-consul as a service, it simply restarts and continues working. Thanks for implementing this patch so quickly.
from marathon-consul.
Thanks for reporting. I'll grep our logs to see if we experience similar problems. I think we have similar error when upgrading marathon and when we have network maintenance issues.
marathon-consul restarts five times--each time logging that it is getting a connection refused from marathon--eventually staying in a connection refused state and therefore failing to capture marathon events from the event stream.
Did marathon-consul stay in failing state? After 5 retires it should shut down. With some supervisor configured it should be restarted and fail again if problem wasn't fixed.
from marathon-consul.
@janisz It restarted 4-5 times (I forget) and then remained up with the last logging statement being connection refused by marathon.
from marathon-consul.
It shouldn't works like this. Can you share the logs? I'll try to reproduce it in test and fix it.
from marathon-consul.
Sure thing, will do so when I get back in the office tomorrow
from marathon-consul.
I've checked our logs and we have logged
Leader poll failed. Check marathon and previous errors. Exiting
AFAIR we don't have a situation when marathon-consul hangs in unconnected state. @tomez Do you recall this situation?/
from marathon-consul.
Marathon-consul definitely shouldn't stay in disconnected state.
@janisz I cant remember that situation ever happened, and I am pretty sure we would notice it because in that case, our registrations would simply stop working after any marathon downtime (eg. update).
After janisz investigation I can't add anything more right now.
@hokiegeek2 if You could reproduce it, and provide log info we will investigate further.
from marathon-consul.
@janisz Here's the log messages of interest:
time="2017-06-16T19:50:14Z" level=error msg="Error when parsing the event" error=EOF
time="2017-06-16T19:50:14Z" level=error msg="This should never happen. Not handled event type" EventType= error="Unsuported event type: "
time="2017-06-16T19:50:14Z" level=fatal msg="Unable to recover streamer" error="Get http://localhost:8080/v2/events?event_type=status_update_event&event_type=health_status_changed_event: dial tcp 127.0.0.1:8080: getsockopt: connection refused"
Usual startup message
time="2017-06-16T19:50:14Z" level=warning msg="Error on http request" Location="localhost:8080" Protocol=http error="Get http://localhost:8080/v2/events?embed=apps.tasks&label=consul: dial tcp 127.0.0.1:8080: getsockopt: connection refused" statusCode="???"
time="2017-06-16T19:50:14Z" level=error msg="An error occured while performing sync" error="Can't get Marathon apps: Get http://localhost:8080/v2/events?embed=apps.tasks&label=consul: dial tcp 127.0.0.1:8080: getsockopt: connection refused"
time="2017-06-16T19:50:15Z" level=debug msg="Leader detection disable"
time="2017-06-16T19:50:15Z" level=info msg=Listening Port=":4000"
time="2017-06-16T19:50:15Z" level=error msg="Unable to start streamer" error="Get http://localhost:8080/v2/events?embed=apps.tasks&label=consul: dial tcp 127.0.0.1:8080: getsockopt: connection refused"
Usual startup message plus bolded logging block repeated 3 times, and then on the third time logging stops at:
time="2017-06-16T19:50:20Z" level=error msg="Unable to start streamer" error="Get http://localhost:8080/v2/events?embed=apps.tasks&label=consul: dial tcp 127.0.0.1:8080: getsockopt: connection refused"
No further log messages, app is hung, no marathon events for consul-tagged Marathon tasks are captured
from marathon-consul.
@tomez Thanks for investigating! Just provided the salient log messages, just restarted marathon-consul and I'll confirm if the same scenario repeats
from marathon-consul.
@hokiegeek2 Thanks for logs. I can confirm there is a bug. The log you provided comes from sse/sse_handler.go a couple lines above we have similar log but with Fatal.
The problem you described do not occur when Marathon Master detection is enabled because detector can't connect to marathon so the streamer is not created. In your case streamer is created but it can't connect to marathon so it waits forever for events to handle but the events won't come because it's not connected.
We should not swallow connection error but propagate it to a caller and handle in main.go. We will provide a patch for this shortly.
from marathon-consul.
@janisz That's awesome, thanks!
from marathon-consul.
@hokiegeek2 Could you please create separated ticket for configurable retry time interval / reconnect policy?
from marathon-consul.
@janisz will do!
from marathon-consul.
Related Issues (20)
- marathon-consul does not recover from connection refused in leader retrieval HOT 11
- please add configurable number of retries/retry interval for marathon connection errors HOT 1
- Remove web handler
- PPA broken HOT 2
- marathon-consul should support multiple marathon clusters HOT 4
- "consul" label naming case sensitivity HOT 4
- Syncing the TASK_KILLING servcies HOT 5
- Migrate from Glide to dep HOT 2
- Update Consul to 0.9.3
- Update Consul to 1.0.0 HOT 1
- Catalog get operation randomness problematic in heterogenous ACL environments HOT 1
- Parallel tests run into a data race
- marathon-consul sometimes failing updation HOT 1
- USER network mode can't register to consul HOT 4
- Unexpected response code: 500 (rpc error: No path to datacenter): Sync not working, Services not being de-registered HOT 6
- marathon-consul (de)registers consul services in all datacenters HOT 3
- Documents on deployment with consul? HOT 4
- Task ID wrong in health status event HOT 10
- APT repository not signed HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from marathon-consul.