Intermitttent unit test failure in ClusterTest about copycat HOT 4 CLOSED

atomix commented on May 24, 2024

Intermitttent unit test failure in ClusterTest

from copycat.

Comments (4)

kuujo commented on May 24, 2024

Ditto.

@madjam here's my experience: I've actually been looking at this for a few days and am struggling to find any issue in Copycat. The interesting thing is, all tests seem to pass 100% of the time when run individually as far as I can tell, but when the cluster tests are run together one or more frequently fail. What I see in most failures is random leader elections happening. What seems to happen is the leader is blocked for several seconds and the followers election timeouts expire. I spent several hours trying to figure out what was blocking, but it seems to be nothing at all, or at least I couldn't find the culprit. It seemed to be blocking at random points that don't do any I/O.

I did do some profiling and noticed that some clients aren't being closed in test cleanup. Indeed, until recently the Atomix tests were failing because threads weren't being cleaned up, and that could also be the case here. But my brief efforts to make sure clients were closed were unsuccessful. YourKit still shows some client threads continuing.

On Dec 17, 2015, at 11:01 PM, Madan Jampani [email protected] wrote:

I observe intermittent unit test failures ( > 50%) while running local builds of Copycat. ClusterTest::testMany is the one that fails.

I'm at commit 310383b

Tests run: 84, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 476.549 sec <<< FAILURE!
testMany(io.atomix.copycat.test.ClusterTest) Time elapsed: 40.868 sec <<< FAILURE!
java.util.concurrent.TimeoutException: Test timed out while waiting for an expected result
at net.jodah.concurrentunit.Waiter.await(Waiter.java:168)
at net.jodah.concurrentunit.Waiter.await(Waiter.java:142)
at net.jodah.concurrentunit.ConcurrentTestCase.await(ConcurrentTestCase.java:105)
at io.atomix.copycat.test.ClusterTest.testMany(ClusterTest.java:135)

I haven't done a lot of debugging yet. Thought I will log an issue in case some one runs into the same thing or have already figured out whats causing it.

—
Reply to this email directly or view it on GitHub.

from copycat.

kuujo commented on May 24, 2024

I found the bug here. After many days of debugging, I discovered retries in the client are not working properly. Fortunately, the tests were complex enough to expose this bug by forcing some commands submitted to the cluster to be lost by killing leaders. What's happening is when a leader is killed in a test or a leader change otherwise results in a command not being received and logged, some commands are never retried. In order for linearizability to work, leaders need to eventually see all commands, and they use a sequence number in each command to ensure that happens. But since a command is lost and doesn't appear to ever get resubmitted, there becomes a permanent gap in the sequence numbers. This means the leader can no longer handle any commands for that session. I think this is going to require a careful rewrite of the client tonight to ensure command retries are handled properly. But I think this effort should bring total stability to Copycat, and it should be ready for a release.

We may also want to create some mechanism to handle missing sequence numbers. My thought is, if a command request fails due to reasons that are not obvious (perhaps serialization or other issues), we should resubmit a special no-op command to fill in the sequence and fail the command. Under those circumstances, Copycat can't guarantee that a command has or has not been applied to the state machine.

from copycat.

kuujo commented on May 24, 2024

Note, though, that this is actually more just an issue with the LocalTransport used in tests, which doesn't implement timeouts for requests. When a server is closed in tests, the server's executor throws exceptions and LocalTransport futures are never completed. The NettyTransport does have a timeout mechanism for requests and thus should still work properly. I added timeouts to the LocalTransport and all tests passed perfectly. Nevertheless, the client does need some work to ensure the various strategies are used properly, to ensure commands can be retried across sessions, and to ensure failed commands can not result in gaps in sequence numbers in the log (which prevents a client from submitting any further commands).

from copycat.

kuujo commented on May 24, 2024

These seem to be fixed by all the bug fix work.

from copycat.

Intermitttent unit test failure in ClusterTest about copycat HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent