jepsen-io / knossos Goto Github PK
View Code? Open in Web Editor NEWVerifies the linearizability of experimentally accessible histories.
Verifies the linearizability of experimentally accessible histories.
I use the pthread_self
function for the :process
id for events. However, these pointers are 64 bits long and so don't fit for some of the uses Clojure does.
I think this might actually be a common use case but maybe there is too much of a performance problem or something.
Hi,
Every now and then, the execution of knossos after jepsen tests throws the exception below. I copied the history as well. Am I doing something stupid?
Thanks,
Balint
I'm using in the jepsen test the simple cas model:
:model (model/cas-register 42)
:checker (checker/compose
{:perf (checker/perf)
:linear checker/linearizable})
The exception:
java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: i
at java.util.concurrent.FutureTask.report (FutureTask.java:122)
java.util.concurrent.FutureTask.get (FutureTask.java:192)
clojure.core$deref_future.invoke (core.clj:2186)
clojure.core$future_call$reify__6736.deref (core.clj:6683)
clojure.core$deref.invoke (core.clj:2206)
clojure.core$map$fn__4553.invoke (core.clj:2624)
clojure.lang.LazySeq.sval (LazySeq.java:40)
clojure.lang.LazySeq.seq (LazySeq.java:49)
clojure.lang.Cons.next (Cons.java:39)
clojure.lang.RT.next (RT.java:674)
clojure.core/next (core.clj:64)
clojure.core.protocols/fn (protocols.clj:170)
clojure.core.protocols$fn__6478$G__6473__6487.invoke (protocols.clj:19)
clojure.core.protocols$seq_reduce.invoke (protocols.clj:31)
clojure.core.protocols/fn (protocols.clj:101)
clojure.core.protocols$fn__6452$G__6447__6465.invoke (protocols.clj:13)
clojure.core$reduce.invoke (core.clj:6519)
clojure.core$into.invoke (core.clj:6600)
jepsen.checker$compose$reify__6600.check (checker.clj:257)
jepsen.checker$check_safe.invoke (checker.clj:39)
jepsen.core$run_BANG_$fn__6821.invoke (core.clj:421)
jepsen.core$run_BANG_.invoke (core.clj:373)
jepsen.influxdb_test/fn (influxdb_test.clj:13)
clojure.test$test_var$fn__7670.invoke (test.clj:704)
clojure.test$test_var.invoke (test.clj:704)
clojure.test$test_vars$fn__7692$fn__7697.invoke (test.clj:722)
clojure.test$default_fixture.invoke (test.clj:674)
clojure.test$test_vars$fn__7692.invoke (test.clj:722)
clojure.test$default_fixture.invoke (test.clj:674)
clojure.test$test_vars.invoke (test.clj:718)
clojure.test$test_all_vars.invoke (test.clj:728)
clojure.test$test_ns.invoke (test.clj:747)
clojure.core$map$fn__4553.invoke (core.clj:2624)
clojure.lang.LazySeq.sval (LazySeq.java:40)
clojure.lang.LazySeq.seq (LazySeq.java:49)
clojure.lang.Cons.next (Cons.java:39)
clojure.lang.RT.boundedLength (RT.java:1735)
clojure.lang.RestFn.applyTo (RestFn.java:130)
clojure.core$apply.invoke (core.clj:632)
clojure.test$run_tests.doInvoke (test.clj:762)
clojure.lang.RestFn.applyTo (RestFn.java:137)
clojure.core$apply.invoke (core.clj:630)
user$eval85$fn__144$fn__175.invoke (form-init3009603524082798247.clj:1)
user$eval85$fn__144$fn__145.invoke (form-init3009603524082798247.clj:1)
user$eval85$fn__144.invoke (form-init3009603524082798247.clj:1)
user$eval85.invoke (form-init3009603524082798247.clj:1)
clojure.lang.Compiler.eval (Compiler.java:6782)
clojure.lang.Compiler.eval (Compiler.java:6772)
clojure.lang.Compiler.load (Compiler.java:7227)
clojure.lang.Compiler.loadFile (Compiler.java:7165)
clojure.main$load_script.invoke (main.clj:275)
clojure.main$init_opt.invoke (main.clj:280)
clojure.main$initialize.invoke (main.clj:308)
clojure.main$null_opt.invoke (main.clj:343)
clojure.main$main.doInvoke (main.clj:421)
clojure.lang.RestFn.invoke (RestFn.java:421)
clojure.lang.Var.invoke (Var.java:383)
clojure.lang.AFn.applyToHelper (AFn.java:156)
clojure.lang.Var.applyTo (Var.java:700)
clojure.main.main (main.java:37)
Caused by: java.lang.AssertionError: Assert failed: i
at knossos.linear.report$time_coords$fn__6396.invoke (report.clj:179)
clojure.core$map$fn__4553.invoke (core.clj:2624)
clojure.lang.LazySeq.sval (LazySeq.java:40)
clojure.lang.LazySeq.seq (LazySeq.java:49)
clojure.lang.RT.seq (RT.java:507)
clojure.core/seq (core.clj:137)
clojure.core.protocols$seq_reduce.invoke (protocols.clj:30)
clojure.core.protocols/fn (protocols.clj:101)
clojure.core.protocols$fn__6452$G__6447__6465.invoke (protocols.clj:13)
clojure.core$reduce.invoke (core.clj:6519)
clojure.core$into.invoke (core.clj:6600)
knossos.linear.report$time_coords.invoke (report.clj:185)
knossos.linear.report$learnings.invoke (report.clj:427)
knossos.linear.report$render_analysis_BANG_.invoke (report.clj:632)
jepsen.checker$reify__6560.check (checker.clj:55)
jepsen.checker$compose$reify__6600$fn__6602.invoke (checker.clj:256)
clojure.core$pmap$fn__6744$fn__6745.invoke (core.clj:6729)
clojure.core$binding_conveyor_fn$fn__4444.invoke (core.clj:1916)
clojure.lang.AFn.call (AFn.java:18)
java.util.concurrent.FutureTask.run (FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
java.lang.Thread.run (Thread.java:745)
history of operations is the following (simple mix of read and write ops):
0 :invoke :read nil
0 :ok :read 42.0
0 :invoke :read nil
0 :ok :read 42.0
1 :invoke :read nil
1 :ok :read 42.0
2 :invoke :read nil
2 :ok :read 42.0
0 :invoke :read nil
0 :ok :read 42.0
1 :invoke :write 4.0
1 :ok :write 4.0
0 :invoke :write 4.0
0 :ok :write 4.0
2 :invoke :write 1.0
2 :ok :write 1.0
2 :invoke :write 3.0
2 :ok :write 3.0
0 :invoke :write 1.0
0 :ok :write 1.0
1 :invoke :read nil
1 :ok :read 1.0
0 :invoke :write 1.0
0 :ok :write 1.0
0 :invoke :read nil
0 :ok :read 1.0
2 :invoke :write 0.0
2 :ok :write 0.0
2 :invoke :read nil
0 :invoke :write 3.0
2 :ok :read 0.0
0 :ok :write 3.0
1 :invoke :read nil
1 :ok :read 3.0
0 :invoke :write 0.0
0 :ok :write 0.0
2 :invoke :write 2.0
2 :ok :write 2.0
1 :invoke :read nil
1 :ok :read 2.0
0 :invoke :write 4.0
0 :ok :write 4.0
2 :invoke :write 0.0
2 :ok :write 0.0
2 :invoke :write 2.0
2 :ok :write 2.0
1 :invoke :write 4.0
1 :ok :write 4.0
0 :invoke :read nil
0 :ok :read 4.0
1 :invoke :write 0.0
1 :ok :write 0.0
2 :invoke :read nil
2 :ok :read 0.0
0 :invoke :read nil
0 :ok :read 0.0
1 :invoke :write 1.0
1 :ok :write 1.0
2 :invoke :read nil
2 :ok :read 1.0
0 :invoke :read nil
0 :ok :read 1.0
1 :invoke :write 0.0
1 :ok :write 0.0
2 :invoke :read nil
2 :ok :read 0.0
0 :invoke :read nil
0 :ok :read 0.0
1 :invoke :write 1.0
1 :ok :write 1.0
2 :invoke :read nil
2 :ok :read 1.0
0 :invoke :write 2.0
0 :ok :write 2.0
1 :invoke :write 3.0
1 :ok :write 3.0
2 :invoke :read nil
2 :ok :read 3.0
With 'register' model:
1 0 :invoke :write 1
2 0 :ok :write 1
3 0 :invoke :write 3
4 0 :ok :write 3
5 0 :invoke :write 2
6 0 :ok :write 2
7 0 :invoke :read nil
8 0 :ok :read 2
9 0 :invoke :write 4
10 0 :ok :write 4
11 0 :invoke :read nil
12 0 :ok :read 4
13 0 :invoke :read nil
14 0 :ok :read 4
15 0 :invoke :write 4
16 0 :info :write 4 :timeout
17 1 :invoke :read nil
18 1 :ok :read 4
19 1 :invoke :write 0
20 1 :info :write 0 :timeout
21 2 :invoke :read nil
22 2 :ok :read 4
23 2 :invoke :write 0
24 2 :info :write 0 :timeout
25 3 :invoke :read nil
26 3 :ok :read 0
27 3 :invoke :read nil
28 3 :ok :read 0
29 3 :invoke :read nil
30 3 :ok :read 0
31 3 :invoke :read nil
32 3 :ok :read 0
33 3 :invoke :write 0
34 3 :info :write 0 :timeout
35 4 :invoke :read nil
36 4 :ok :read 4
37 4 :invoke :write 1
38 4 :info :write 1 :timeout
39 5 :invoke :read nil
40 5 :ok :read 0
The reads on lines: 26, 28, 30, & 32 should have gotten 4 instead of 0.
The read on line 40 should have gotten 4 instead of 0 although the interim read on line 36 correctly got 4.
Hello,
I am currently working on a concurrent spin-lock algorithm for many-core machines similar to the MCS and CLH queue lock algorithms but that can replace the queue with a stack (for more performance) or other datastructure. See https://sstewartgallus.com/git?p=qlock.git;a=tree for the code. I modelled the algorithm in TLA+ and checked it for a small amount of threads and found and fixed some deadlocks. However, TLA+ does not model weak memory orderings and I'm concerned my algorithm may not use acquire and release atomics appropriately.
To fix this problem, I created a mode where I write a history of operations to a log (although I need to put more work into making sure that writing to the log doesn't cause synchronization that would prevent memory reorderings) in a format like:
{:process 36, :type :invoke, :f :try_acquire, :value 0x7f8f9200e100}
{:process 36, :type :ok, :f :try_acquire, :value true}
{:process 40, :type :invoke, :f :pop, :value 0x7f8f9200e080}
{:process 40, :type :ok, :f :pop, :value 0x55bea84a3400}
{:process 41, :type :invoke, :f :release, :value 0x7f8f9200e100}
{:process 41, :type :ok, :f :release, :value (list)}
{:process 42, :type :invoke, :f :empty, :value 0x7f8f9200e080 }
{:process 42, :type :ok, :f :empty, :value true }
{:process 43, :type :invoke, :f :try_acquire, :value 0x7f8f9200e100}
{:process 43, :type :ok, :f :try_acquire, :value true}
{:process 44, :type :invoke, :f :pop, :value 0x7f8f9200e080}
As I understand it Knossos should let me check if such a log is linearizable and therefore experimentally determine that I am using weak memory orderings correctly. However, I am finding it very difficult to get started.
Is there any extra documentation about getting started using Knossos to check a history for linearizability? As far as I am aware most documentation focuses around Jepson which is heavily built around distributed and networked systems and doesn't just take a preexisting history.
The jepsen linearizable test fails for the attached history. But, I don't see anything wrong. **
history.txt
results.edn.txt
0 :invoke :cas [4 0]
0 :ok :cas [4 0]
0 :invoke :write 4
0 :ok :write 4
3 :invoke :cas [4 2]
1 :invoke :write 0
1 :ok :write 0
3 :ok :cas [4 2]
3 :invoke :read nil
3 :ok :read 2
2 :invoke :write 1
2 :ok :write 1
:nemesis :info :start nil
2 :invoke :read nil
1 :invoke :read nil
2 :ok :read 1
knossos says the process 3 "can't read 2 from register 0"
I read your blog post with great interest. I had a few questions on how Knossos compares with other related work on linearizability that is not mentioned in the blog post, and so set out to experiment with a variant of the Wing and Gong algorithm, including later extensions by Lowe. This is work-in-progress but the results seem promising. You seem to enjoy fun problems, so I only wanted to share that link in case you feel like bouncing off ideas or providing any feedback, questions etc.
PS
Thanks for the Jepsen infrastructure by the way.
Assume you have this history.edn
file:
[ {:process 1, :type :invoke, :f :cas, :value [2 0]}
{:process 1, :type :info, :f :cas, :value [2 0]}
{:process 1, :type :invoke, :f :read, :value nil}]
Running the above history against Knossos gives me the following result:
java.lang.RuntimeException: Process 1 already running #knossos.op.Op{:process 1, :type :invoke, :f :cas, :value [2 0], :index 0}, yet attempted to invoke #knossos.op.Op{:process 1, :type :invoke, :f :read, :value nil, :index 2} concurrently
It seems to me that after an info
response, a process cannot do anything else. Looking at other histories in the knossos repository, it seems indeed the case that if a process gets an info
message, it will never invoke any other operation after it.
Is this indeed the case? Maybe, the creator of issue had a similar problem.
Reading Knossos concepts here:
Every logical transition in Knossos is represented by a pair of operations: an invoke, when the op begins, and a completion: either ok if the operation took place, fail if the operation did not take place, or
info
if you're not sure what happened, e.g. the operation timed out.
aninfo
response, seems like what is adequate here. We do not know whether thecas
succeeded or not.
it seems to me, this does not have to be the case. For example, I could think of a client c
getting an operation timed-out and hence logging an info
message, however c
might still want to invoke operations later on in an execution. The easy fix would be, that after every timed-out operation, client c
starts logging operations (i.e., invocations and responses) with a different client identifier c'
. Is there a better way to solve this issue in Knossos?
I get the following when trying to run the example in the README. I'm on Ubuntu 17.10 with oracle jdk 9. I also tried on OSX previously and got the same thing.
knossos.cli=> (def h (read-history "data/cas-register/bad/rethink-fail-minimal.edn"))
#'knossos.cli/h
knossos.cli=> (pprint h)
[{:process 0, :type :invoke, :f :write, :value 0, :index nil}
{:process 0, :type :ok, :f :write, :value 0, :index nil}
{:process 1, :type :invoke, :f :read, :value nil, :index nil}
{:process 2, :type :invoke, :f :write, :value 4, :index nil}
{:process 1, :type :ok, :f :read, :value 3, :index nil}
{:process 2, :type :ok, :f :write, :value 4, :index nil}
{:process 3, :type :invoke, :f :read, :value nil, :index nil}
{:process 3, :type :ok, :f :read, :value 4, :index nil}]
nil
knossos.cli=> (def a (competition/analysis (model/cas-register) h))
#'knossos.cli/a
knossos.cli=> a
{:valid? false, :configs ({:model #knossos.model.CASRegister{:value 0}, :last-op {:process 0, :type :ok, :f :write, :value 0, :index 1}, :pending [{:process 1, :type :invoke, :f :read, :value 3, :index 2} {:process 2, :type :invoke, :f :write, :value 4, :index 3}]}), :final-paths #{[{:op {:process 0, :type :ok, :f :write, :value 0, :index 1}, :model #knossos.model.CASRegister{:value 0}} {:op {:process 2, :type :invoke, :f :write, :value 4, :index 3}, :model #knossos.model.CASRegister{:value 4}} {:op {:process 1, :type :ok, :f :read, :value 3, :index 4}, :model #knossos.model.Inconsistent{:msg "can't read 3 from register 4"}}] [{:op {:process 0, :type :ok, :f :write, :value 0, :index 1}, :model #knossos.model.CASRegister{:value 0}} {:op {:process 1, :type :ok, :f :read, :value 3, :index 4}, :model #knossos.model.Inconsistent{:msg "can't read 3 from register 0"}}]}, :previous-ok {:process 0, :type :ok, :f :write, :value 0, :index 1}, :last-op {:process 0, :type :ok, :f :write, :value 0, :index 1}, :op {:process 1, :type :ok, :f :read, :value 3, :index 4}}
knossos.cli=> (pprint a)
{:valid? false,
:configs
({:model {:value 0},
:last-op {:process 0, :type :ok, :f :write, :value 0, :index 1},
:pending
[{:process 1, :type :invoke, :f :read, :value 3, :index 2}
{:process 2, :type :invoke, :f :write, :value 4, :index 3}]}),
:final-paths
#{[{:op {:process 0, :type :ok, :f :write, :value 0, :index 1},
:model {:value 0}}
{:op {:process 2, :type :invoke, :f :write, :value 4, :index 3},
:model {:value 4}}
{:op {:process 1, :type :ok, :f :read, :value 3, :index 4},
:model {:msg "can't read 3 from register 4"}}]
[{:op {:process 0, :type :ok, :f :write, :value 0, :index 1},
:model {:value 0}}
{:op {:process 1, :type :ok, :f :read, :value 3, :index 4},
:model {:msg "can't read 3 from register 0"}}]},
:previous-ok {:process 0, :type :ok, :f :write, :value 0, :index 1},
:last-op {:process 0, :type :ok, :f :write, :value 0, :index 1},
:op {:process 1, :type :ok, :f :read, :value 3, :index 4}}
nil
knossos.cli=> (require '[knossos.linear.report :as report])
nil
knossos.cli=> (report/render-analysis! h a "linear.svg")
AssertionError Assert failed: No invocation for op {:process 0, :type :ok, :f :write, :value 0, :index 1}
inv knossos.linear.report/time-coords/fn--5148 (report.clj:186)
knossos.cli=>
Hi @aphyr - Thank you for the amazing framework you've supplied us with to test out our databases
I have a working repo with Jepsen testing out our Blockchain Go Implementation at Orbs
and have encountered something weird with Knossos verifying history and firing many Java Assertion Errors like the following attached here
The weird thing is, the history checks appears to be valid and is quite confusing for me to know whether the test has passed or not due to these errors.
jepsen.log.zip
Would love your feedback. I am attaching here a small fragment of what comes out from stdout
:
WARN [2018-11-06 13:46:47,969] clojure-agent-send-pool-2 - jepsen.checker Error while checking history: java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value nil, :process 9, :time 682238600, :index 10} i at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.8.0_181] at java.util.concurrent.FutureTask.get(FutureTask.java:192) [na:1.8.0_181] at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na] at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na] at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na] at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na] at clojure.core$map$fn__4785.invoke(core.clj:2646) ~[clojure-1.8.0.jar:na] at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na] at clojure.lang.LazySeq.seq(LazySeq.java:56) ~[clojure-1.8.0.jar:na] at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na] at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na] at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na] at jepsen.checker$compose$reify__3952.check(checker.clj:88) ~[jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.independent$checker$reify__2302$fn__2303.invoke(independent.clj:272) [na:na] at dom_top.core$bounded_pmap$launcher__218$fn__219.invoke(core.clj:129) [jepsen-0.1.9-SNAPSHOT.jar:na] at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) [clojure-1.8.0.jar:na] at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181] Caused by: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value nil, :process 9, :time 682238600, :index 10} i at knossos.history$complete_fold_op.invokeStatic(history.clj:187) ~[knossos-0.3.3.jar:na] at knossos.history$complete_fold_op.invoke(history.clj:160) ~[knossos-0.3.3.jar:na] at clojure.lang.ArrayChunk.reduce(ArrayChunk.java:58) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6750.invokeStatic(protocols.clj:136) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6750.invoke(protocols.clj:124) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6710$G__6705__6719.invoke(protocols.clj:19) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:31) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invoke(core.clj:6527) ~[clojure-1.8.0.jar:na] at knossos.history$complete.invokeStatic(history.clj:241) ~[knossos-0.3.3.jar:na] at knossos.history$complete.invoke(history.clj:221) ~[knossos-0.3.3.jar:na] at jepsen.checker.timeline$html$reify__2386.check(timeline.clj:163) ~[na:na] at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$compose$reify__3952$fn__3954.invoke(checker.clj:87) ~[jepsen-0.1.9-SNAPSHOT.jar:na] at clojure.core$pmap$fn__6970$fn__6971.invoke(core.clj:6736) ~[clojure-1.8.0.jar:na] ... 6 common frames omitted
I also am attaching the full log of the running
Thanks and hope to hear from you soon!
-- Itamar Arjuan
Well, I want to test the linearizability of my own distributed file system, which is similar to HDFS. And I have seen there are some useful models already such as cas-register and multi-register, but I don't know how to apply these models to the file systems, which is a little different from a key-value store. For example, in a file system, writing/reading is at a certain offset while the key-value store is writing/reading the key's whole value. I want to ask if there are any possibilities that I can use the existed models here or I have to write my own models?
Since I'm not an English native speaker, my sentences or words above may not accurate, sorry in advance and looking forward to your reply!
Hey!
Using and loving Jepsen and Knossos.
Lately, I've ran into a few situations where :ok and :fail do not allow the propagation of enough information forward into the checking phase.
As an example that will apply to most situations, a :fail-ed :cas actually provides more information than :fail makes available to the checker - it constrains the register to a value not equal to the compared value at some time between the :invoke and :fail times.
As a more prominent example, a :fail-ed :cas in Cassandra LWT gives us the value at read-time that did not match v in our :cas :value [v v']. It is possible for a :fail-ed :cas to be able to provide as much information to the checker as an :ok-ed :read in this circumstance.
I have two main ideas on strategies to allow this.
Thoughts on any approach you might be interested in integrating into Knossos/Jepsen? I'd PR.
Out of curious, I was trying to give the command at https://github.com/jepsen-io/knossos#quickly a try to see what happens. But it turns out be failed with multiple OpenJDK versions:
With OpenJDK 8.0
$ lein run data/cas-register/bad/bad-analysis.edn
Unrecognized VM option 'UnlockCommercialFeatures'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
With OpenJDK 9.0.4:
$ lein run data/cas-register/bad/bad-analysis.edn
OpenJDK 64-Bit Server VM warning: Ignoring option UseFastAccessorMethods; support was removed in 9.0
Unrecognized VM option 'UnlockCommercialFeatures'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
With JDK 10.0.2
$ lein run data/cas-register/bad/bad-analysis.edn
Unrecognized VM option 'UseParNewGC'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
With JDK 11.0.6
$ lein run data/cas-register/bad/bad-analysis.edn
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
Unrecognized VM option 'UseParNewGC'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Any instruction about that?
As I understand the idea of knossos.competition
checker is to run linear
and wgl
checkers in parallel. But actually linear/start-analysis
is blocking calling thread on knossos.model.memo/memo
function call, which seems to be doing almost the whole analysis. As a result mem-watch
and reporter
in knossos.search/run
are also not working, because the linear/start-analysis
needs to complete before mem-watch
and reporter
started.
I read
See knossos.core for the linearizability checker, and knossos.redis for an model that can generate histories verifiable with Knossos.
I was confused by not finding knossos.redis or variations, till I dug into history and found 08166a4. What's an example model to get started now?
Have you considered applying partial order reduction as a technique for reducing the runtime of knossos?
Basic idea: if two messages are concurrent (formally: if one message does not happen-before
the other), they must be independent, therefore you do not need to consider both interleavings of those messages.
In a distributed system, you wouldn't need to modify the systems under test to infer concurrent messages; you could infer concurrent messages by attaching vector clocks to events in the network. cf. Modist and shivector.
Sidenote: For a nice overview of state space reduction techniques, check out section 2.2 of this paper.
Hi there,
I am not sure if I should open this issue to this repo or the jepsen repo. I am giving a shot here and can port the issue to the jepsen repo if requested.
I have a very simple jepsen test taking 20 seconds and injecting no faults in the environment. It is a mutex test for now, but I will use it for testing re-entrant mutex in future once I make this version work.
In my test, i have 5 clients. Each client is acquiring and releasing the same distributed lock object repeatedly.
Here is my generator:
:generator (->> [{:type :invoke, :f :acquire}
;(gen/sleep 1)
{:type :invoke, :f :release}
(gen/sleep 1)]
cycle
gen/seq
gen/each
(gen/stagger 1/10))
Here is my custom model:
(defrecord ReentrantMutex [owner lockCount]
Model
(step [this op]
(if (nil? (:value op))
(knossos.model/inconsistent "no owner!")
(condp = (:f op)
:acquire (if (or (nil? owner) (= owner (:value op)))
(createReentrantMutex this (:value op) (inc lockCount) op)
(do
;(info (str "cannot acquire owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
(knossos.model/inconsistent "cannot acquire")))
:release (if (or (nil? owner) (not= owner (:value op)))
(do
;(info (str "cannot release owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
(knossos.model/inconsistent "cannot release"))
(if (= lockCount 1)
(createReentrantMutex this nil 0 op)
(createReentrantMutex this owner (dec lockCount) op))))))
Object
(toString [this] (str "owner: " owner ", lockCount: " lockCount)))
(defn createEmptyReentrantMutex []
(let [
initial (ReentrantMutex. nil 0)
;_ (info (str "initial model: " (System/identityHashCode initial)))
]
initial)
)
(defn createReentrantMutex [prev owner lockCount op]
(
let [
newModel (ReentrantMutex. owner lockCount)
;_ (info (str "owner: " owner ", count: " lockCount ", op: " op " new hash: " (System/identityHashCode newModel) " prev owner: " (:owner prev) " prev lock count: " (:lockCount prev) " prev hash:" (System/identityHashCode prev)))
]
newModel)
)
My model is the ReentrantMutex
defrecord. Initially, it is ReentrantMutex(nil, 0)
, means nobody holding the lock. For instance, if n1
acquires the lock, it will be ReentrantMutex("n1", 1)
.
When I run this test for 20 seconds without introducing any failure in the system, I get a history of operations.
I am sharing my history files and the timeline file here.
history.txt
history.edn.txt
timeline.html.txt
Sorry for the file extensions. Please rename "history.edn.txt" to "history.edn" and "timeline.html.txt" to "timeline.html.
In short, the history of operations is this one: N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ => N4_REL => N1_ACQ => N1_REL => N5_ACQ => N5_REL => N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ => N4_REL => N1_ACQ => N1_REL => N5_ACQ => N5_REL => N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ
So the lock is held by N3, N2, N4, N1, N5, N3, N2, N4, N1, N5, N3, N2, N4
in order. You can verify this by checking the history files and the timeline file.
The thing is, when the analyzer starts running with my model, it never terminates (when info logs above are commented out). I cannot say that I fully understand how it works internally but I see that it is walking through possible histories based on concurrency of operations, but each one failing with one of the knossos.model/inconsistent
returns in my model. Moreover, as I run it with logs printed, I see states like ReentrantMutex("n2", 555)
. Probably, I am misunderstanding something, but I don't expect to see such states because acquire and release operations of the same client are sequential. Since I have a short history, I expect the analyzer to complete the verification step in short time. Therefore, I suspect I am doing something wrong in my model, or using the tool in some wrong way.
I have tried many things, but couldn't manage to make this model work. Could you help me with this issue? Any hint or a pointer would be very helpful.
Thanks in advance.
I've run into an issue with the register model where Knossos says a certain history is linearizable when, in fact, it looks non-linearizable.
The history under question is shown in etcd_005.log. Under certain assumptions (to be clarified below), the history stops to be linerizable at the following entry on line 131:
INFO jepsen.util - 16 :ok :read 4
In inspecting this history log, I made two assumptions. Firstly, I assumed that failed CAS instructions have no effect on the state of the model. For example, at the beginning of the log, we have:
INFO jepsen.util - 1 :invoke :cas [0 1]
INFO jepsen.util - 1 :fail :cas [0 1]
I ignored these kind of failed CAS instructions. Secondly, I ignored timed out entries. For example, I ignored
INFO jepsen.util - 0 :invoke :write 4
INFO jepsen.util - 0 :info :write :timed-out
on line 83 and line 89, respectively.
Under those assumptions, the above history is non-linearizable. I wanted to double check if this is expected, or not. If my assumptions are too strong, I'd appreciate if you could clarify what Knossos assumes about failed CAS and timeouts.
Hi,
I'm using Knossos to verify linearizability on a history generated by a distributed system that is, for now, modelling a register. The checker is reporting that the history is not linearizable, but when I ask for a rendering of the problem sections, I get the error
Execution error (AssertionError) at knossos.linear.report/time-coords$fn (report.clj:184).
Assert failed: Expected index but got nil for op #knossos.op.Op{:process nil, :type nil, :f nil, :value nil, :index nil}
Given that I'm writing out a history "by hand", I'm assuming that I've done something wrong, but the current error being reported isn't helping me determine what exactly I'm doing wrong.
This is the history file I'm using, and this is what I'm doing to try and render it (which has worked with other broken histories in the past, so I'm pretty sure it's capable of working):
knossos.cli=> (require '[knossos.model :as model] '[knossos.competition :as competition])
nil
knossos.cli=> (require '[knossos.linear.report :as report])
nil
knossos.cli=> (def h (read-history "edn.txt"))
#'knossos.cli/h
knossos.cli=> (def a (competition/analysis (model/register) h))
Jun 25, 2020 5:51:51 PM clojure.tools.logging$eval1047$fn__1050 invoke
INFO: More than 1024 reachable models; not memoizing models for this search
#'knossos.cli/a
knossos.cli=> Jun 25, 2020 5:51:51 PM clojure.tools.logging$eval1047$fn__1050 invoke
INFO: More than 1024 reachable models; not memoizing models for this search
knossos.cli=> (report/render-analysis! h a "linear.svg")
Execution error (AssertionError) at knossos.linear.report/time-coords$fn (report.clj:185).
Assert failed: Expected index but got nil for op #knossos.op.Op{:process nil, :type nil, :f nil, :value nil, :index nil}
Any assistance appreciated.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.