Giter Club home page Giter Club logo

Comments (14)

hokiegeek2 avatar hokiegeek2 commented on June 18, 2024 1

@janisz @tomez So I am noticing that we will get periodic connection refused errors in our Marathon cluster nodes. I see from the logging timing that marathon-consul attempts to connect very quickly--the first couple apparently within a second, which appears to be within the window of time that marathon refuses connections.

Perhaps what is needed is a configurable retry time interval for marathon connection failures?

from marathon-consul.

hokiegeek2 avatar hokiegeek2 commented on June 18, 2024 1

@janisz This works great! I deployed the patch and either marathon-consul recovers from loss of connectivity or exits after four failed tries. Since I am running marathon-consul as a service, it simply restarts and continues working. Thanks for implementing this patch so quickly.

from marathon-consul.

janisz avatar janisz commented on June 18, 2024

Thanks for reporting. I'll grep our logs to see if we experience similar problems. I think we have similar error when upgrading marathon and when we have network maintenance issues.

marathon-consul restarts five times--each time logging that it is getting a connection refused from marathon--eventually staying in a connection refused state and therefore failing to capture marathon events from the event stream.

Did marathon-consul stay in failing state? After 5 retires it should shut down. With some supervisor configured it should be restarted and fail again if problem wasn't fixed.

from marathon-consul.

hokiegeek2 avatar hokiegeek2 commented on June 18, 2024

@janisz It restarted 4-5 times (I forget) and then remained up with the last logging statement being connection refused by marathon.

from marathon-consul.

janisz avatar janisz commented on June 18, 2024

It shouldn't works like this. Can you share the logs? I'll try to reproduce it in test and fix it.

from marathon-consul.

hokiegeek2 avatar hokiegeek2 commented on June 18, 2024

Sure thing, will do so when I get back in the office tomorrow

from marathon-consul.

janisz avatar janisz commented on June 18, 2024

I've checked our logs and we have logged

Leader poll failed. Check marathon and previous errors. Exiting

AFAIR we don't have a situation when marathon-consul hangs in unconnected state. @tomez Do you recall this situation?/

from marathon-consul.

tomez avatar tomez commented on June 18, 2024

Marathon-consul definitely shouldn't stay in disconnected state.

@janisz I cant remember that situation ever happened, and I am pretty sure we would notice it because in that case, our registrations would simply stop working after any marathon downtime (eg. update).

After janisz investigation I can't add anything more right now.
@hokiegeek2 if You could reproduce it, and provide log info we will investigate further.

from marathon-consul.

hokiegeek2 avatar hokiegeek2 commented on June 18, 2024

@janisz Here's the log messages of interest:

time="2017-06-16T19:50:14Z" level=error msg="Error when parsing the event" error=EOF
time="2017-06-16T19:50:14Z" level=error msg="This should never happen. Not handled event type" EventType= error="Unsuported event type: "
time="2017-06-16T19:50:14Z" level=fatal msg="Unable to recover streamer" error="Get http://localhost:8080/v2/events?event_type=status_update_event&event_type=health_status_changed_event: dial tcp 127.0.0.1:8080: getsockopt: connection refused"

Usual startup message

time="2017-06-16T19:50:14Z" level=warning msg="Error on http request" Location="localhost:8080" Protocol=http error="Get http://localhost:8080/v2/events?embed=apps.tasks&label=consul: dial tcp 127.0.0.1:8080: getsockopt: connection refused" statusCode="???"
time="2017-06-16T19:50:14Z" level=error msg="An error occured while performing sync" error="Can't get Marathon apps: Get http://localhost:8080/v2/events?embed=apps.tasks&label=consul: dial tcp 127.0.0.1:8080: getsockopt: connection refused"
time="2017-06-16T19:50:15Z" level=debug msg="Leader detection disable"
time="2017-06-16T19:50:15Z" level=info msg=Listening Port=":4000"
time="2017-06-16T19:50:15Z" level=error msg="Unable to start streamer" error="Get http://localhost:8080/v2/events?embed=apps.tasks&label=consul: dial tcp 127.0.0.1:8080: getsockopt: connection refused"

Usual startup message plus bolded logging block repeated 3 times, and then on the third time logging stops at:

time="2017-06-16T19:50:20Z" level=error msg="Unable to start streamer" error="Get http://localhost:8080/v2/events?embed=apps.tasks&label=consul: dial tcp 127.0.0.1:8080: getsockopt: connection refused"

No further log messages, app is hung, no marathon events for consul-tagged Marathon tasks are captured

from marathon-consul.

hokiegeek2 avatar hokiegeek2 commented on June 18, 2024

@tomez Thanks for investigating! Just provided the salient log messages, just restarted marathon-consul and I'll confirm if the same scenario repeats

from marathon-consul.

janisz avatar janisz commented on June 18, 2024

@hokiegeek2 Thanks for logs. I can confirm there is a bug. The log you provided comes from sse/sse_handler.go a couple lines above we have similar log but with Fatal.

The problem you described do not occur when Marathon Master detection is enabled because detector can't connect to marathon so the streamer is not created. In your case streamer is created but it can't connect to marathon so it waits forever for events to handle but the events won't come because it's not connected.

We should not swallow connection error but propagate it to a caller and handle in main.go. We will provide a patch for this shortly.

from marathon-consul.

hokiegeek2 avatar hokiegeek2 commented on June 18, 2024

@janisz That's awesome, thanks!

from marathon-consul.

janisz avatar janisz commented on June 18, 2024

@hokiegeek2 Could you please create separated ticket for configurable retry time interval / reconnect policy?

from marathon-consul.

hokiegeek2 avatar hokiegeek2 commented on June 18, 2024

@janisz will do!

from marathon-consul.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.