jepsen-io / knossos Goto Github PK

View Code? Open in Web Editor NEW

395.0 395.0 31.0 5.98 MB

Verifies the linearizability of experimentally accessible histories.

Clojure 86.43% HTML 13.57%

knossos's People

Contributors

Stargazers

Watchers

knossos's Issues

Should process ids be 64 bits long?

I use the pthread_self function for the :process id for events. However, these pointers are 64 bits long and so don't fit for some of the uses Clojure does.

I think this might actually be a common use case but maybe there is too much of a performance problem or something.

java.lang.AssertionError: Assert failed: i

Hi,

Every now and then, the execution of knossos after jepsen tests throws the exception below. I copied the history as well. Am I doing something stupid?

Thanks,
Balint

I'm using in the jepsen test the simple cas model:

:model   (model/cas-register 42)
       :checker (checker/compose
                    {:perf   (checker/perf)
                     :linear checker/linearizable})

The exception:

java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: i
 at java.util.concurrent.FutureTask.report (FutureTask.java:122)
    java.util.concurrent.FutureTask.get (FutureTask.java:192)
    clojure.core$deref_future.invoke (core.clj:2186)
    clojure.core$future_call$reify__6736.deref (core.clj:6683)
    clojure.core$deref.invoke (core.clj:2206)
    clojure.core$map$fn__4553.invoke (core.clj:2624)
    clojure.lang.LazySeq.sval (LazySeq.java:40)
    clojure.lang.LazySeq.seq (LazySeq.java:49)
    clojure.lang.Cons.next (Cons.java:39)
    clojure.lang.RT.next (RT.java:674)
    clojure.core/next (core.clj:64)
    clojure.core.protocols/fn (protocols.clj:170)
    clojure.core.protocols$fn__6478$G__6473__6487.invoke (protocols.clj:19)
    clojure.core.protocols$seq_reduce.invoke (protocols.clj:31)
    clojure.core.protocols/fn (protocols.clj:101)
    clojure.core.protocols$fn__6452$G__6447__6465.invoke (protocols.clj:13)
    clojure.core$reduce.invoke (core.clj:6519)
    clojure.core$into.invoke (core.clj:6600)
    jepsen.checker$compose$reify__6600.check (checker.clj:257)
    jepsen.checker$check_safe.invoke (checker.clj:39)
    jepsen.core$run_BANG_$fn__6821.invoke (core.clj:421)
    jepsen.core$run_BANG_.invoke (core.clj:373)
    jepsen.influxdb_test/fn (influxdb_test.clj:13)
    clojure.test$test_var$fn__7670.invoke (test.clj:704)
    clojure.test$test_var.invoke (test.clj:704)
    clojure.test$test_vars$fn__7692$fn__7697.invoke (test.clj:722)
    clojure.test$default_fixture.invoke (test.clj:674)
    clojure.test$test_vars$fn__7692.invoke (test.clj:722)
    clojure.test$default_fixture.invoke (test.clj:674)
    clojure.test$test_vars.invoke (test.clj:718)
    clojure.test$test_all_vars.invoke (test.clj:728)
    clojure.test$test_ns.invoke (test.clj:747)
    clojure.core$map$fn__4553.invoke (core.clj:2624)
    clojure.lang.LazySeq.sval (LazySeq.java:40)
    clojure.lang.LazySeq.seq (LazySeq.java:49)
    clojure.lang.Cons.next (Cons.java:39)
    clojure.lang.RT.boundedLength (RT.java:1735)
    clojure.lang.RestFn.applyTo (RestFn.java:130)
    clojure.core$apply.invoke (core.clj:632)
    clojure.test$run_tests.doInvoke (test.clj:762)
    clojure.lang.RestFn.applyTo (RestFn.java:137)
    clojure.core$apply.invoke (core.clj:630)
    user$eval85$fn__144$fn__175.invoke (form-init3009603524082798247.clj:1)
    user$eval85$fn__144$fn__145.invoke (form-init3009603524082798247.clj:1)
    user$eval85$fn__144.invoke (form-init3009603524082798247.clj:1)
    user$eval85.invoke (form-init3009603524082798247.clj:1)
    clojure.lang.Compiler.eval (Compiler.java:6782)
    clojure.lang.Compiler.eval (Compiler.java:6772)
    clojure.lang.Compiler.load (Compiler.java:7227)
    clojure.lang.Compiler.loadFile (Compiler.java:7165)
    clojure.main$load_script.invoke (main.clj:275)
    clojure.main$init_opt.invoke (main.clj:280)
    clojure.main$initialize.invoke (main.clj:308)
    clojure.main$null_opt.invoke (main.clj:343)
    clojure.main$main.doInvoke (main.clj:421)
    clojure.lang.RestFn.invoke (RestFn.java:421)
    clojure.lang.Var.invoke (Var.java:383)
    clojure.lang.AFn.applyToHelper (AFn.java:156)
    clojure.lang.Var.applyTo (Var.java:700)
    clojure.main.main (main.java:37)
Caused by: java.lang.AssertionError: Assert failed: i
 at knossos.linear.report$time_coords$fn__6396.invoke (report.clj:179)
    clojure.core$map$fn__4553.invoke (core.clj:2624)
    clojure.lang.LazySeq.sval (LazySeq.java:40)
    clojure.lang.LazySeq.seq (LazySeq.java:49)
    clojure.lang.RT.seq (RT.java:507)
    clojure.core/seq (core.clj:137)
    clojure.core.protocols$seq_reduce.invoke (protocols.clj:30)
    clojure.core.protocols/fn (protocols.clj:101)
    clojure.core.protocols$fn__6452$G__6447__6465.invoke (protocols.clj:13)
    clojure.core$reduce.invoke (core.clj:6519)
    clojure.core$into.invoke (core.clj:6600)
    knossos.linear.report$time_coords.invoke (report.clj:185)
    knossos.linear.report$learnings.invoke (report.clj:427)
    knossos.linear.report$render_analysis_BANG_.invoke (report.clj:632)
    jepsen.checker$reify__6560.check (checker.clj:55)
    jepsen.checker$compose$reify__6600$fn__6602.invoke (checker.clj:256)
    clojure.core$pmap$fn__6744$fn__6745.invoke (core.clj:6729)
    clojure.core$binding_conveyor_fn$fn__4444.invoke (core.clj:1916)
    clojure.lang.AFn.call (AFn.java:18)
    java.util.concurrent.FutureTask.run (FutureTask.java:266)
    java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
    java.lang.Thread.run (Thread.java:745)

history of operations is the following (simple mix of read and write ops):

0   :invoke :read   nil
0   :ok :read   42.0
0   :invoke :read   nil
0   :ok :read   42.0
1   :invoke :read   nil
1   :ok :read   42.0
2   :invoke :read   nil
2   :ok :read   42.0
0   :invoke :read   nil
0   :ok :read   42.0
1   :invoke :write  4.0
1   :ok :write  4.0
0   :invoke :write  4.0
0   :ok :write  4.0
2   :invoke :write  1.0
2   :ok :write  1.0
2   :invoke :write  3.0
2   :ok :write  3.0
0   :invoke :write  1.0
0   :ok :write  1.0
1   :invoke :read   nil
1   :ok :read   1.0
0   :invoke :write  1.0
0   :ok :write  1.0
0   :invoke :read   nil
0   :ok :read   1.0
2   :invoke :write  0.0
2   :ok :write  0.0
2   :invoke :read   nil
0   :invoke :write  3.0
2   :ok :read   0.0
0   :ok :write  3.0
1   :invoke :read   nil
1   :ok :read   3.0
0   :invoke :write  0.0
0   :ok :write  0.0
2   :invoke :write  2.0
2   :ok :write  2.0
1   :invoke :read   nil
1   :ok :read   2.0
0   :invoke :write  4.0
0   :ok :write  4.0
2   :invoke :write  0.0
2   :ok :write  0.0
2   :invoke :write  2.0
2   :ok :write  2.0
1   :invoke :write  4.0
1   :ok :write  4.0
0   :invoke :read   nil
0   :ok :read   4.0
1   :invoke :write  0.0
1   :ok :write  0.0
2   :invoke :read   nil
2   :ok :read   0.0
0   :invoke :read   nil
0   :ok :read   0.0
1   :invoke :write  1.0
1   :ok :write  1.0
2   :invoke :read   nil
2   :ok :read   1.0
0   :invoke :read   nil
0   :ok :read   1.0
1   :invoke :write  0.0
1   :ok :write  0.0
2   :invoke :read   nil
2   :ok :read   0.0
0   :invoke :read   nil
0   :ok :read   0.0
1   :invoke :write  1.0
1   :ok :write  1.0
2   :invoke :read   nil
2   :ok :read   1.0
0   :invoke :write  2.0
0   :ok :write  2.0
1   :invoke :write  3.0
1   :ok :write  3.0
2   :invoke :read   nil
2   :ok :read   3.0

linearizability checker incorrectly validates non-linear history

With 'register' model:

     1  0 :invoke :write  1
     2  0 :ok :write  1
     3  0 :invoke :write  3
     4  0 :ok :write  3
     5  0 :invoke :write  2
     6  0 :ok :write  2
     7  0 :invoke :read nil
     8  0 :ok :read 2
     9  0 :invoke :write  4
    10  0 :ok :write  4
    11  0 :invoke :read nil
    12  0 :ok :read 4
    13  0 :invoke :read nil
    14  0 :ok :read 4
    15  0 :invoke :write  4
    16  0 :info :write  4 :timeout
    17  1 :invoke :read nil
    18  1 :ok :read 4
    19  1 :invoke :write  0
    20  1 :info :write  0 :timeout
    21  2 :invoke :read nil
    22  2 :ok :read 4
    23  2 :invoke :write  0
    24  2 :info :write  0 :timeout
    25  3 :invoke :read nil
    26  3 :ok :read 0
    27  3 :invoke :read nil
    28  3 :ok :read 0
    29  3 :invoke :read nil
    30  3 :ok :read 0
    31  3 :invoke :read nil
    32  3 :ok :read 0
    33  3 :invoke :write  0
    34  3 :info :write  0 :timeout
    35  4 :invoke :read nil
    36  4 :ok :read 4
    37  4 :invoke :write  1
    38  4 :info :write  1 :timeout
    39  5 :invoke :read nil
    40  5 :ok :read 0

The reads on lines: 26, 28, 30, & 32 should have gotten 4 instead of 0.
The read on line 40 should have gotten 4 instead of 0 although the interim read on line 36 correctly got 4.

history.txt

Knossos seems to fail linearizable results

The jepsen linearizable test fails for the attached history. But, I don't see anything wrong. **
history.txt
results.edn.txt

**

0 :invoke :cas [4 0]
0 :ok :cas [4 0]
0 :invoke :write 4
0 :ok :write 4
3 :invoke :cas [4 2]
1 :invoke :write 0
1 :ok :write 0
3 :ok :cas [4 2]
3 :invoke :read nil
3 :ok :read 2
2 :invoke :write 1
2 :ok :write 1
:nemesis :info :start nil
2 :invoke :read nil
1 :invoke :read nil
2 :ok :read 1

knossos says the process 3 "can't read 2 from register 0"

Linearizability algorithms

I read your blog post with great interest. I had a few questions on how Knossos compares with other related work on linearizability that is not mentioned in the blog post, and so set out to experiment with a variant of the Wing and Gong algorithm, including later extensions by Lowe. This is work-in-progress but the results seem promising. You seem to enjoy fun problems, so I only wanted to share that link in case you feel like bouncing off ideas or providing any feedback, questions etc.

PS
Thanks for the Jepsen infrastructure by the way.

Process already running and :info messages.

Assume you have this history.edn file:

[ {:process 1, :type :invoke, :f :cas, :value [2 0]}
  {:process 1, :type :info, :f :cas, :value [2 0]}
  {:process 1, :type :invoke, :f :read, :value nil}]

Running the above history against Knossos gives me the following result:

java.lang.RuntimeException: Process 1 already running #knossos.op.Op{:process 1, :type :invoke, :f :cas, :value [2 0], :index 0}, yet attempted to invoke #knossos.op.Op{:process 1, :type :invoke, :f :read, :value nil, :index 2} concurrently

It seems to me that after an info response, a process cannot do anything else. Looking at other histories in the knossos repository, it seems indeed the case that if a process gets an info message, it will never invoke any other operation after it.
Is this indeed the case? Maybe, the creator of issue had a similar problem.

Reading Knossos concepts here:

Every logical transition in Knossos is represented by a pair of operations: an invoke, when the op begins, and a completion: either ok if the operation took place, fail if the operation did not take place, or info if you're not sure what happened, e.g. the operation timed out.
an info response, seems like what is adequate here. We do not know whether the cas succeeded or not.

it seems to me, this does not have to be the case. For example, I could think of a client c getting an operation timed-out and hence logging an info message, however c might still want to invoke operations later on in an execution. The easy fix would be, that after every timed-out operation, client c starts logging operations (i.e., invocations and responses) with a different client identifier c'. Is there a better way to solve this issue in Knossos?

Visualization fails on latest master with README example

I get the following when trying to run the example in the README. I'm on Ubuntu 17.10 with oracle jdk 9. I also tried on OSX previously and got the same thing.

knossos.cli=> (def h (read-history "data/cas-register/bad/rethink-fail-minimal.edn"))
#'knossos.cli/h
knossos.cli=> (pprint h)
[{:process 0, :type :invoke, :f :write, :value 0, :index nil}
 {:process 0, :type :ok, :f :write, :value 0, :index nil}
 {:process 1, :type :invoke, :f :read, :value nil, :index nil}
 {:process 2, :type :invoke, :f :write, :value 4, :index nil}
 {:process 1, :type :ok, :f :read, :value 3, :index nil}
 {:process 2, :type :ok, :f :write, :value 4, :index nil}
 {:process 3, :type :invoke, :f :read, :value nil, :index nil}
 {:process 3, :type :ok, :f :read, :value 4, :index nil}]
nil
knossos.cli=> (def a (competition/analysis (model/cas-register) h))
#'knossos.cli/a
knossos.cli=> a
{:valid? false, :configs ({:model #knossos.model.CASRegister{:value 0}, :last-op {:process 0, :type :ok, :f :write, :value 0, :index 1}, :pending [{:process 1, :type :invoke, :f :read, :value 3, :index 2} {:process 2, :type :invoke, :f :write, :value 4, :index 3}]}), :final-paths #{[{:op {:process 0, :type :ok, :f :write, :value 0, :index 1}, :model #knossos.model.CASRegister{:value 0}} {:op {:process 2, :type :invoke, :f :write, :value 4, :index 3}, :model #knossos.model.CASRegister{:value 4}} {:op {:process 1, :type :ok, :f :read, :value 3, :index 4}, :model #knossos.model.Inconsistent{:msg "can't read 3 from register 4"}}] [{:op {:process 0, :type :ok, :f :write, :value 0, :index 1}, :model #knossos.model.CASRegister{:value 0}} {:op {:process 1, :type :ok, :f :read, :value 3, :index 4}, :model #knossos.model.Inconsistent{:msg "can't read 3 from register 0"}}]}, :previous-ok {:process 0, :type :ok, :f :write, :value 0, :index 1}, :last-op {:process 0, :type :ok, :f :write, :value 0, :index 1}, :op {:process 1, :type :ok, :f :read, :value 3, :index 4}}
knossos.cli=> (pprint a)
{:valid? false,
 :configs
 ({:model {:value 0},
   :last-op {:process 0, :type :ok, :f :write, :value 0, :index 1},
   :pending
   [{:process 1, :type :invoke, :f :read, :value 3, :index 2}
    {:process 2, :type :invoke, :f :write, :value 4, :index 3}]}),
 :final-paths
 #{[{:op {:process 0, :type :ok, :f :write, :value 0, :index 1},
     :model {:value 0}}
    {:op {:process 2, :type :invoke, :f :write, :value 4, :index 3},
     :model {:value 4}}
    {:op {:process 1, :type :ok, :f :read, :value 3, :index 4},
     :model {:msg "can't read 3 from register 4"}}]
   [{:op {:process 0, :type :ok, :f :write, :value 0, :index 1},
     :model {:value 0}}
    {:op {:process 1, :type :ok, :f :read, :value 3, :index 4},
     :model {:msg "can't read 3 from register 0"}}]},
 :previous-ok {:process 0, :type :ok, :f :write, :value 0, :index 1},
 :last-op {:process 0, :type :ok, :f :write, :value 0, :index 1},
 :op {:process 1, :type :ok, :f :read, :value 3, :index 4}}
nil
knossos.cli=> (require '[knossos.linear.report :as report])
nil
knossos.cli=> (report/render-analysis! h a "linear.svg")

AssertionError Assert failed: No invocation for op {:process 0, :type :ok, :f :write, :value 0, :index 1}
inv  knossos.linear.report/time-coords/fn--5148 (report.clj:186)
knossos.cli=>

Assertion errors from Java

Hi @aphyr - Thank you for the amazing framework you've supplied us with to test out our databases
I have a working repo with Jepsen testing out our Blockchain Go Implementation at Orbs
and have encountered something weird with Knossos verifying history and firing many Java Assertion Errors like the following attached here

The weird thing is, the history checks appears to be valid and is quite confusing for me to know whether the test has passed or not due to these errors.
jepsen.log.zip

Would love your feedback. I am attaching here a small fragment of what comes out from stdout:

WARN [2018-11-06 13:46:47,969] clojure-agent-send-pool-2 - jepsen.checker Error while checking history: java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value nil, :process 9, :time 682238600, :index 10} i at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.8.0_181] at java.util.concurrent.FutureTask.get(FutureTask.java:192) [na:1.8.0_181] at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na] at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na] at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na] at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na] at clojure.core$map$fn__4785.invoke(core.clj:2646) ~[clojure-1.8.0.jar:na] at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na] at clojure.lang.LazySeq.seq(LazySeq.java:56) ~[clojure-1.8.0.jar:na] at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na] at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na] at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na] at jepsen.checker$compose$reify__3952.check(checker.clj:88) ~[jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.independent$checker$reify__2302$fn__2303.invoke(independent.clj:272) [na:na] at dom_top.core$bounded_pmap$launcher__218$fn__219.invoke(core.clj:129) [jepsen-0.1.9-SNAPSHOT.jar:na] at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) [clojure-1.8.0.jar:na] at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181] Caused by: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value nil, :process 9, :time 682238600, :index 10} i at knossos.history$complete_fold_op.invokeStatic(history.clj:187) ~[knossos-0.3.3.jar:na] at knossos.history$complete_fold_op.invoke(history.clj:160) ~[knossos-0.3.3.jar:na] at clojure.lang.ArrayChunk.reduce(ArrayChunk.java:58) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6750.invokeStatic(protocols.clj:136) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6750.invoke(protocols.clj:124) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6710$G__6705__6719.invoke(protocols.clj:19) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:31) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invoke(core.clj:6527) ~[clojure-1.8.0.jar:na] at knossos.history$complete.invokeStatic(history.clj:241) ~[knossos-0.3.3.jar:na] at knossos.history$complete.invoke(history.clj:221) ~[knossos-0.3.3.jar:na] at jepsen.checker.timeline$html$reify__2386.check(timeline.clj:163) ~[na:na] at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$compose$reify__3952$fn__3954.invoke(checker.clj:87) ~[jepsen-0.1.9-SNAPSHOT.jar:na] at clojure.core$pmap$fn__6970$fn__6971.invoke(core.clj:6736) ~[clojure-1.8.0.jar:na] ... 6 common frames omitted

I also am attaching the full log of the running

Thanks and hope to hear from you soon!
-- Itamar Arjuan

Is this an non linearizable result?

Hi, Kyle
I have got an non linearizable result according to jepsen as the following figure.
But I guess this is a linearization,
there is an order of 0 read 1, 2 read 1, 3 read 1, 4 read 1, 1 write 0
Am I wrong? or this is bug?

Is it possible to test distrubuted file systems like HDFS using the existed models?

Well, I want to test the linearizability of my own distributed file system, which is similar to HDFS. And I have seen there are some useful models already such as cas-register and multi-register, but I don't know how to apply these models to the file systems, which is a little different from a key-value store. For example, in a file system, writing/reading is at a certain offset while the key-value store is writing/reading the key's whole value. I want to ask if there are any possibilities that I can use the existed models here or I have to write my own models?
Since I'm not an English native speaker, my sentences or words above may not accurate, sorry in advance and looking forward to your reply!

Feature Discussion: Propagating more information forward than :ok and :fail allow

Hey!

Using and loving Jepsen and Knossos.

Lately, I've ran into a few situations where :ok and :fail do not allow the propagation of enough information forward into the checking phase.

As an example that will apply to most situations, a :fail-ed :cas actually provides more information than :fail makes available to the checker - it constrains the register to a value not equal to the compared value at some time between the :invoke and :fail times.

As a more prominent example, a :fail-ed :cas in Cassandra LWT gives us the value at read-time that did not match v in our :cas :value [v v']. It is possible for a :fail-ed :cas to be able to provide as much information to the checker as an :ok-ed :read in this circumstance.

I have two main ideas on strategies to allow this.

When completing the history in knossos, merge all new keys in the :ok-ed op into the :invoke op. Then, an enhanced-cas-register model could check keys on the :invoke and step the model appropriately. The main disadvantage of this is cognitive - it overloads the semantics of :ok.
When completing the history in knossos, allow functions to be provided to handle differ op :types. In this case, we might just rewrite the :fail :cas to an :ok :read. Again, the main disadvantage here is the cognitive overhead of rewriting history.

Thoughts on any approach you might be interested in integrating into Knossos/Jepsen? I'd PR.

Is there a compatible OpenJDK version to play with knossos?

Out of curious, I was trying to give the command at https://github.com/jepsen-io/knossos#quickly a try to see what happens. But it turns out be failed with multiple OpenJDK versions:

With OpenJDK 8.0

$ lein run data/cas-register/bad/bad-analysis.edn
Unrecognized VM option 'UnlockCommercialFeatures'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

With OpenJDK 9.0.4:

$ lein run data/cas-register/bad/bad-analysis.edn
OpenJDK 64-Bit Server VM warning: Ignoring option UseFastAccessorMethods; support was removed in 9.0
Unrecognized VM option 'UnlockCommercialFeatures'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

With JDK 10.0.2

$ lein run data/cas-register/bad/bad-analysis.edn
Unrecognized VM option 'UseParNewGC'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

With JDK 11.0.6

$ lein run data/cas-register/bad/bad-analysis.edn
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
Unrecognized VM option 'UseParNewGC'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Any instruction about that?

knossos.competition checker is blocked by linear/start-analysis

As I understand the idea of knossos.competition checker is to run linear and wgl checkers in parallel. But actually linear/start-analysis is blocking calling thread on knossos.model.memo/memo function call, which seems to be doing almost the whole analysis. As a result mem-watch and reporter in knossos.search/run are also not working, because the linear/start-analysis needs to complete before mem-watch and reporter started.

README still mentions missing Redis example

I read

See knossos.core for the linearizability checker, and knossos.redis for an model that can generate histories verifiable with Knossos.

I was confused by not finding knossos.redis or variations, till I dug into history and found 08166a4. What's an example model to get started now?

partial order reduction as an optimization?

Have you considered applying partial order reduction as a technique for reducing the runtime of knossos?

Basic idea: if two messages are concurrent (formally: if one message does not happen-before the other), they must be independent, therefore you do not need to consider both interleavings of those messages.

In a distributed system, you wouldn't need to modify the systems under test to infer concurrent messages; you could infer concurrent messages by attaching vector clocks to events in the network. cf. Modist and shivector.

Sidenote: For a nice overview of state space reduction techniques, check out section 2.2 of this paper.

Question about checking linearizability of a history with a custom model

Hi there,

I am not sure if I should open this issue to this repo or the jepsen repo. I am giving a shot here and can port the issue to the jepsen repo if requested.

I have a very simple jepsen test taking 20 seconds and injecting no faults in the environment. It is a mutex test for now, but I will use it for testing re-entrant mutex in future once I make this version work.

In my test, i have 5 clients. Each client is acquiring and releasing the same distributed lock object repeatedly.

Here is my generator:

     :generator (->> [{:type :invoke, :f :acquire}
                                      ;(gen/sleep 1)
                                      {:type :invoke, :f :release}
                                      (gen/sleep 1)]
                                     cycle
                                     gen/seq
                                     gen/each
                                     (gen/stagger 1/10))

Here is my custom model:

(defrecord ReentrantMutex [owner lockCount]
  Model
  (step [this op]
    (if (nil? (:value op))
      (knossos.model/inconsistent "no owner!")
      (condp = (:f op)
        :acquire (if (or (nil? owner) (= owner (:value op)))
                   (createReentrantMutex this (:value op) (inc lockCount) op)
                   (do
                     ;(info (str "cannot acquire owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
                     (knossos.model/inconsistent "cannot acquire")))
        :release (if (or (nil? owner) (not= owner (:value op)))
                   (do
                     ;(info (str "cannot release owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
                     (knossos.model/inconsistent "cannot release"))
                   (if (= lockCount 1)
                     (createReentrantMutex this nil 0 op)
                     (createReentrantMutex this owner (dec lockCount) op))))))

  Object
  (toString [this] (str "owner: " owner ", lockCount: " lockCount)))

(defn createEmptyReentrantMutex []
  (let [
        initial (ReentrantMutex. nil 0)
        ;_ (info (str "initial model: " (System/identityHashCode initial)))
        ]
    initial)
  )

(defn createReentrantMutex [prev owner lockCount op]
  (
    let [
         newModel (ReentrantMutex. owner lockCount)
         ;_ (info (str "owner: " owner ", count: " lockCount ", op: " op " new hash: " (System/identityHashCode newModel) " prev owner: " (:owner prev) " prev lock count: " (:lockCount prev) " prev hash:" (System/identityHashCode prev)))
         ]
    newModel)
  )

My model is the ReentrantMutex defrecord. Initially, it is ReentrantMutex(nil, 0), means nobody holding the lock. For instance, if n1 acquires the lock, it will be ReentrantMutex("n1", 1).

When I run this test for 20 seconds without introducing any failure in the system, I get a history of operations.

I am sharing my history files and the timeline file here.
history.txt
history.edn.txt
timeline.html.txt

Sorry for the file extensions. Please rename "history.edn.txt" to "history.edn" and "timeline.html.txt" to "timeline.html.

In short, the history of operations is this one: N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ => N4_REL => N1_ACQ => N1_REL => N5_ACQ => N5_REL => N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ => N4_REL => N1_ACQ => N1_REL => N5_ACQ => N5_REL => N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ

So the lock is held by N3, N2, N4, N1, N5, N3, N2, N4, N1, N5, N3, N2, N4 in order. You can verify this by checking the history files and the timeline file.

The thing is, when the analyzer starts running with my model, it never terminates (when info logs above are commented out). I cannot say that I fully understand how it works internally but I see that it is walking through possible histories based on concurrency of operations, but each one failing with one of the knossos.model/inconsistent returns in my model. Moreover, as I run it with logs printed, I see states like ReentrantMutex("n2", 555). Probably, I am misunderstanding something, but I don't expect to see such states because acquire and release operations of the same client are sequential. Since I have a short history, I expect the analyzer to complete the verification step in short time. Therefore, I suspect I am doing something wrong in my model, or using the tool in some wrong way.

I have tried many things, but couldn't manage to make this model work. Could you help me with this issue? Any hint or a pointer would be very helpful.

Thanks in advance.

Handling of failed CAS and timeouts

I've run into an issue with the register model where Knossos says a certain history is linearizable when, in fact, it looks non-linearizable.

The history under question is shown in etcd_005.log. Under certain assumptions (to be clarified below), the history stops to be linerizable at the following entry on line 131:

  INFO  jepsen.util - 16    :ok :read   4

In inspecting this history log, I made two assumptions. Firstly, I assumed that failed CAS instructions have no effect on the state of the model. For example, at the beginning of the log, we have:

  INFO jepsen.util - 1 :invoke :cas [0 1]
  INFO jepsen.util - 1 :fail :cas [0 1]

I ignored these kind of failed CAS instructions. Secondly, I ignored timed out entries. For example, I ignored

  INFO jepsen.util - 0 :invoke :write 4
  INFO jepsen.util - 0 :info :write :timed-out

on line 83 and line 89, respectively.

Under those assumptions, the above history is non-linearizable. I wanted to double check if this is expected, or not. If my assumptions are too strong, I'd appreciate if you could clarify what Knossos assumes about failed CAS and timeouts.

Unhelpful error message when trying to render failed portion of a history

Hi,

I'm using Knossos to verify linearizability on a history generated by a distributed system that is, for now, modelling a register. The checker is reporting that the history is not linearizable, but when I ask for a rendering of the problem sections, I get the error

Execution error (AssertionError) at knossos.linear.report/time-coords$fn (report.clj:184).
Assert failed: Expected index but got nil for op #knossos.op.Op{:process nil, :type nil, :f nil, :value nil, :index nil}

Given that I'm writing out a history "by hand", I'm assuming that I've done something wrong, but the current error being reported isn't helping me determine what exactly I'm doing wrong.

This is the history file I'm using, and this is what I'm doing to try and render it (which has worked with other broken histories in the past, so I'm pretty sure it's capable of working):

knossos.cli=> (require '[knossos.model :as model] '[knossos.competition :as competition])
nil
knossos.cli=> (require '[knossos.linear.report :as report])
nil
knossos.cli=> (def h (read-history "edn.txt"))
#'knossos.cli/h
knossos.cli=> (def a (competition/analysis (model/register) h))
Jun 25, 2020 5:51:51 PM clojure.tools.logging$eval1047$fn__1050 invoke
INFO: More than 1024 reachable models; not memoizing models for this search
#'knossos.cli/a
knossos.cli=> Jun 25, 2020 5:51:51 PM clojure.tools.logging$eval1047$fn__1050 invoke
INFO: More than 1024 reachable models; not memoizing models for this search
knossos.cli=> (report/render-analysis! h a "linear.svg")

Execution error (AssertionError) at knossos.linear.report/time-coords$fn (report.clj:185).
Assert failed: Expected index but got nil for op #knossos.op.Op{:process nil, :type nil, :f nil, :value nil, :index nil}

Any assistance appreciated.

jepsen-io / knossos Goto Github PK

knossos's People

Contributors

Stargazers

Watchers

Forkers

knossos's Issues

**

Recommend Projects

Recommend Topics

Recommend Org