Giter Club home page Giter Club logo

Comments (4)

kuujo avatar kuujo commented on May 24, 2024

Ditto.

@madjam here's my experience: I've actually been looking at this for a few days and am struggling to find any issue in Copycat. The interesting thing is, all tests seem to pass 100% of the time when run individually as far as I can tell, but when the cluster tests are run together one or more frequently fail. What I see in most failures is random leader elections happening. What seems to happen is the leader is blocked for several seconds and the followers election timeouts expire. I spent several hours trying to figure out what was blocking, but it seems to be nothing at all, or at least I couldn't find the culprit. It seemed to be blocking at random points that don't do any I/O.

I did do some profiling and noticed that some clients aren't being closed in test cleanup. Indeed, until recently the Atomix tests were failing because threads weren't being cleaned up, and that could also be the case here. But my brief efforts to make sure clients were closed were unsuccessful. YourKit still shows some client threads continuing.

On Dec 17, 2015, at 11:01 PM, Madan Jampani [email protected] wrote:

I observe intermittent unit test failures ( > 50%) while running local builds of Copycat. ClusterTest::testMany is the one that fails.

I'm at commit 310383b

Tests run: 84, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 476.549 sec <<< FAILURE!
testMany(io.atomix.copycat.test.ClusterTest) Time elapsed: 40.868 sec <<< FAILURE!
java.util.concurrent.TimeoutException: Test timed out while waiting for an expected result
at net.jodah.concurrentunit.Waiter.await(Waiter.java:168)
at net.jodah.concurrentunit.Waiter.await(Waiter.java:142)
at net.jodah.concurrentunit.ConcurrentTestCase.await(ConcurrentTestCase.java:105)
at io.atomix.copycat.test.ClusterTest.testMany(ClusterTest.java:135)

I haven't done a lot of debugging yet. Thought I will log an issue in case some one runs into the same thing or have already figured out whats causing it.


Reply to this email directly or view it on GitHub.

from copycat.

kuujo avatar kuujo commented on May 24, 2024

I found the bug here. After many days of debugging, I discovered retries in the client are not working properly. Fortunately, the tests were complex enough to expose this bug by forcing some commands submitted to the cluster to be lost by killing leaders. What's happening is when a leader is killed in a test or a leader change otherwise results in a command not being received and logged, some commands are never retried. In order for linearizability to work, leaders need to eventually see all commands, and they use a sequence number in each command to ensure that happens. But since a command is lost and doesn't appear to ever get resubmitted, there becomes a permanent gap in the sequence numbers. This means the leader can no longer handle any commands for that session. I think this is going to require a careful rewrite of the client tonight to ensure command retries are handled properly. But I think this effort should bring total stability to Copycat, and it should be ready for a release.

We may also want to create some mechanism to handle missing sequence numbers. My thought is, if a command request fails due to reasons that are not obvious (perhaps serialization or other issues), we should resubmit a special no-op command to fill in the sequence and fail the command. Under those circumstances, Copycat can't guarantee that a command has or has not been applied to the state machine.

from copycat.

kuujo avatar kuujo commented on May 24, 2024

Note, though, that this is actually more just an issue with the LocalTransport used in tests, which doesn't implement timeouts for requests. When a server is closed in tests, the server's executor throws exceptions and LocalTransport futures are never completed. The NettyTransport does have a timeout mechanism for requests and thus should still work properly. I added timeouts to the LocalTransport and all tests passed perfectly. Nevertheless, the client does need some work to ensure the various strategies are used properly, to ensure commands can be retried across sessions, and to ensure failed commands can not result in gaps in sequence numbers in the log (which prevents a client from submitting any further commands).

from copycat.

kuujo avatar kuujo commented on May 24, 2024

These seem to be fixed by all the bug fix work.

from copycat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.