elem-azar-unis / crdt-redis Goto Github PK

CRDTs implemented in Redis

License: BSD 3-Clause "New" or "Revised" License

CMake 0.04% C 80.83% C++ 1.31% Makefile 0.71% Shell 0.60% Perl 2.11% M4 1.02% XSLT 0.01% HTML 3.35% Roff 0.09% CSS 0.02% Batchfile 0.01% Lua 0.14% Ruby 0.24% Tcl 8.12% Smarty 0.01% Python 0.40% TLA 0.99%

crdt crdt-implementations redis redis-server redis-test

crdt-redis's People

Contributors

Stargazers

Watchers

Forkers

dukeduke2002 happycs-gu meiqizhang jackyhong help-lixin hippocampi jasagiri adofsauron lintianshi bugwz preresearch-labs hondrytravis fullstackwebdev

crdt-redis's Issues

CRDT Lists sometimes allow clients to add elements after the element that does not exist. They shouldn't.

Overview

Affected component: RemoveWin List and RWF List.
Bug result: Wrong list semantics.
Found by: Model checking-driven explorative testing - MET.
Root cause: Missing element-existence check.

How it was found

This bug is found using our tool MET. Yet we found that both Remove-win List and RWF List have this issue. The script for MET testing RWF List to trigger this bug is:

2
1 1 Add 0 1 10 ; 2 1 Add 2 2 10 ; 1 2 effect ; 3 2 Rmv 1 ; 4 2 Add 1 2 10 ; 3 1 effect
1 -1 0 null 0,1

The execution trace and the server metadata outputted by our MET testing are:

[op_table]:
  - rwflinsert list1 null 1 e 10 0 0 0 ; rwflinsert list1 null 1 e 10 0 0 0 1,0,0 0,0|0
  - rwflinsert list1 2 2 e 10 0 0 0 ;
  - rwflrem list1 1 ; rwflrem list1 1 0,1|1
  - rwflinsert list1 1 2 e 10 0 0 0 ; rwflinsert list1 1 2 e 10 0 0 0 2,1,0 0,0|1
[steps]:
  - FULL 0 0
  - FULL 1 0
  - EFFECT 0 1
  - FULL 2 1
  - FULL 3 1
  - EFFECT 2 0
[script state] -- 1 -2 0 null [0,1] ;
[server state] -- 1 -1 10 null [0,1] ;
                  1 -1 10 null [0,1] ; 2 1 10 null [] ;

Process of debugging

The CRDT Lists allow the client to add a new element after the element that has been removed from the list. This is confusing and should not happen.

Server 2 accepted the operation of adding element 2 after element 1, while element 1 has been removed before. This should not happen. Such add requests from clients should have been rejected.

Fixed

The bug is fixed in the commit a4f7d68. The check for such cases is added.

Sometimes the elements in RWF List miss some basic information while reading.

Overview

Affected component: RWF List.
Bug result: Missing element information while reading.
Found by: Model checking-driven explorative testing - MET.
Root cause: Element initialization too late.

How it was found

This bug is found using our tool MET. The script for MET testing RWF List to trigger this bug is:

2
1 1 Add 0 1 10 ; 2 1 Rmv 1 ; 2 2 effect ; 1 2 effect
1 -1 0 null 0,1

Process of debugging

Add and remove the element on server 1. The remove is then synchronized to server 2 before the add operation. This causes server 2 not to synchronize the basic information of the element, such as the content of the element, from the add operation that gets synchronized later.

Fixed

This is fixed in the commit b0601f3. We let the basic info of elements always get initialized while synchronizing add operations.

The RWF CRDTs fail to implement preconditions of designs correctly.

Overview

Affected component: RWF RPQ and RWF List.
Bug result: Have strange behaviors. Wrong semantics.
Found by: Model checking-driven explorative testing - MET.
Root cause: Fail to implement "the existence of elements" correctly.

How it was found

This bug is found using our tool of model-based exhaustive testing (MET). The script for MET testing RWF RPQ to trigger this bug is:

2
1 1 Add 10 ; 2 1 Add 10 ; 3 1 Inc -3 ; 3 2 effect ; 4 2 Add 20 ; 1 2 effect ;
1 10 -3 n ; 1 10 -3 n ;

Process of debugging

We find that the elements in the list may have an order that is not consistent with the order that they are designated when inserting them.

The reason for the bug is that we didn't perfectly implement the "existence" semantics according to the RWF algorithm. It affects both the RWF RPQ and RWF List. Then the existence of the elements becomes odd under certain execution traces, like the trace found by the MET tool.
There is a chance that the element should have not existed at some point, but it accepted some client requests (it shouldn't). This causes the semantics of the CRDT to become undefined.

An example here.

Client_0 first adds element "a" and then changes the value of "a" (say, it changes the font of character "a" in a string in a collaborative editing scenario). Then client_1 queries the value of the string and obtains string "[a]". After reading the value of the string, client_1 inserts "b" after "a" into the string. However, in our buggy implementation, client_0 may read string "[b,a]".

This bug can be triggered by the following adversarial pattern of events. Server_0 first receives the operation "changing the font of a" (without adding a first). Then server_0 will receive the operation "adding b after a". At the design level, this "adding b after a" operation is quite safe, since it is protected by the precondition that "before adding b after a, a must exist". The design does not explicitly state what to do if the precondition does not hold. The precondition implicitly states that, if the element does not exist, nothing should be done. However, in our implementation, an illegal position is assigned to b, and then to a. In this way, client_0 gets "[b,a]".

Fixed

The bug is fixed in the commit b1adb29.
We implemented the more accurate "existence" for elements in the CRDTs using RWF.

The RemoveWin CRDTs require causal delivery. This is not provided by the underlying network.

Overview

Affected component: RemoveWin RPQ and RemoveWin List.
Bug result: The remove-win CRDTs get stuck.
Found by: The manual MET approach - experiment/redis_test/test.py.
Root cause: Error in the patch to provide causal delivery for RemoveWin CRDTs.

How it was found

The bug is found by our manual MET approach:

Choose the execution traces we think are interesting. The execution traces include the client request and the server synchronization events. This is the test input.
Manually deduce what the data and the metadata should be in the server each step the server executes the trace. This is the test oracle.
Run the CRDT-Redis system under control. Deterministically replay the trace on the system. Check the test oracle.

The Rmv-Win List and Rmv-Win RPQ get stuck sometimes. They will not synchronize with other servers after executing certain synchronizations.

Process of debugging

The trace to trigger the bug is shown here:

Server2 does not synchronize b and c correctly when it receives a.Then when server2 later receives d, it gets stuck since the required delivery of b and c are not available.

The bug lies in the part of our implementation of causal delivery for the remove-win CRDTs. The part where we should notify all waiting messages but we notified one.

Fixed

The bug is fixed in 2b1eb9b. Change notify one to notify all.

The Remove-Win CRPQ do not always record new add & rmv operations.

Overview

Affected component: Remove-Win CRPQ.
Bug result: Wrong semantics.
Found by: Model checking Remove-Win CRPQ specification - MET/TLA/rmvwin_crpq.
Root cause: Wrong CRPQ design, missing the design that let add delete old add records.

How it was found

This is found during our process of model checking the Remove-Win CRPQ.

Use TLA+ to model the design of Remove-Win CRPQ according to its algorithm.
Use TLC to model-check the design. Here the model checking failed at the correctness property Rcd_Concurrent at model scale 2-4.

Process of debugging

With the help of the traces of model checking, we get the trace that trigger the violation:

The server0 here will record the old add operation of server1, together with the newest add operation initiated by client0. The two add operations are not concurrent. This violates the correctness property that the recorded operations should be concurrent. The old add record should have been removed.

We add the logic for the Remove-Win CRPQ that the add operations should clear the old add record. The bug is fixed.

Fixed

The specification was fixed before it was implemented into the CRDT-Redis system.

CRDT Lists. Sometimes the order of elements is not as expected.

Overview

Affected component: RemoveWin List and RWF List.
Bug result: Unexpected element order in the list.
Found by: Model checking-driven explorative testing - MET.
Root cause: Buggy position ID generator.

How it was found

This bug is found using our tool MET. Yet we found that both Remove-win List and RWF List have this issue. The script for MET testing RWF List to trigger this bug is:

1
1 1 Add 0 1 10 ; 2 1 Add 1 3 10 ; 3 1 Add 1 4 10 ; 4 1 Add 1 5 10
1 1 10 null 0 ; 5 1 10 null 0 ; 4 1 10 null 0 ; 3 1 10 null 0

For example, sometimes I add an element A after B, the list tells me it becomes "AB", while it should be "BA".

Process of debugging

The order of elements is not as expected. The reason for the bug lies in our implementation of the position ID for the elements in the list. The CRDT list algorithms require a position ID generator that generate the IDs that have the following property:

Total ordered.
Dense.
Consistent with the element order indicated by add operations

The algorithms don't specify how to implement the generators. Our implementation of the ID generator works fine for most of the cases, yet sometimes fails property 3.

Fixed

The bug is fixed in the commit cf44f6f. We fixed the bug in the position ID generator.

The specification of readd operation of two CRDT list needs to be separated from the add operation and treated as an independent operation.

Overview

Affected component: RemoveWin List and RWF List.
Bug result: Vague semantics and specification of readd operation.
Found by: Model checking RWF List specification - MET/TLA/rwf_list.
Root cause: Mixed specification of readd and add.

How it was found

This is found during our process of model-based testing of RWF List. And it further affects the Remove-win List.

We first use TLA+ to model the design of RWF List according to its algorithm. Then we use TLC to model-check the design. Here the model checking failed. We encountered an old bug of issue #1.
We regard it as a corner case of the model. We try to restrict the form of client requests to bypass this bug that has been fixed.
Our attempt of bypassing the bug made our TLA+ model very complicated. It is unnatural and is against our intention of model checking. We realize that the flaw actually comes from the vagueness of the readd operation of our List designs.

Process of debugging

With the help of the traces of model checking, we analyzed our CRDT List designs and refined the readd specifications as follows.

Note that the insert operation should always insert a unique new element
with a designated position (i.e. *prev* element), or a previously added element
with "readd" being its *prev*. Otherwise, it is undefined behavior.

This is reasonable that for collaborative text editing, the newly inserted element (characters, words, ...)
should always be unique. And the "readd" semantics, which may be used for undo/redo scenarios,
means that you have seen the element. The element has been added at least from a global perspective.

Fixed

The specification is fixed in the commit f6e7687.
The implementation is changed accordingly in the commit 99220aa.

The CRDT Lists replicas may not converge in the stress testing.

Overview

Affected component: RemoveWin List and RWF List.
Bug result: Fails eventual consistency.
Found by: The random stress testing tool - experiment/bench.
Root cause: Readd operation. Ambiguous semantics.

How it was found

The steps of the random testing that triggers this bug:

Generate client requests to different servers randomly, with concurrent conflict operations.
Pause the clients. Wait for the servers to resolve conflicts.
Check if the servers have converged to the same state.

There are some circumstances where the servers won't converge to the same state. The time of triggering the bug is about 2 hours on average.

Process of debugging

We first locate the root cause of the bug.

We look at the final states of two replicas, and identify the list elements with divergent positions. As specified in the RWF-List protocol, the position of one element is decided when the element is added to the list for the first time.
We go backtracking through the execution trace, in order to find the point when the list replicas first diverge, and the point when the elements with divergent positions are first added to the list.
We find that the bug is caused by the element which is in the list but has an undefined position. It is triggered by the read operation.

Through in-depth investigation, it was found that this bug was triggered by the readd operation. The readd operation is the reverse of the remove operation. It is meant to undo the remove or redo the add of some element in the list.

A minimized example of this bug is shown here.

According to our design of the Rwf-List, the location of list element 𝑒 is decided when it is first added on 𝑠𝑒𝑟𝑣𝑒𝑟0. If we remove and then re-add 𝑒 (such as undoing the removal of words in collaborative editing), its position will not change. The client may send the 𝑟𝑒-𝑎𝑑𝑑(𝑒) operation to 𝑠𝑒𝑟𝑣𝑒𝑟1, which has not yet gotten the previous 𝑎𝑑𝑑(𝑒) operation from 𝑠𝑒𝑟𝑣𝑒𝑟0. 𝑆𝑒𝑟𝑣𝑒𝑟1 should reject this 𝑟𝑒-𝑎𝑑𝑑 operation or buffer this operation for processing later. But in our buggy design, 𝑠𝑒𝑟𝑣𝑒𝑟1 accepts this request and assigns an illegal position to 𝑒. When 𝑠𝑒𝑟𝑣𝑒𝑟1 receives the delayed 𝑎𝑑𝑑(𝑒) operation, it will ignore this operation, which causes the divergence among replicas.

Fixed

This is fixed in the commit 99e2fe6 by adding a check for this condition. The invalid readd is rejected now.

elem-azar-unis / crdt-redis Goto Github PK

crdt-redis's People

Contributors

Stargazers

Watchers

Forkers

crdt-redis's Issues

Overview

How it was found

Process of debugging

Fixed

Overview

How it was found

Process of debugging

Fixed

Overview

How it was found

Process of debugging

Fixed

Overview

How it was found

Process of debugging

Fixed

Overview

How it was found

Process of debugging

Fixed

Overview

How it was found

Process of debugging

Fixed

Overview

How it was found

Process of debugging

Fixed

Overview

How it was found

Process of debugging

Fixed

Recommend Projects

Recommend Topics

Recommend Org