jepsen-io / maelstrom Goto Github PK
View Code? Open in Web Editor NEWA workbench for writing toy implementations of distributed systems.
License: Eclipse Public License 1.0
A workbench for writing toy implementations of distributed systems.
License: Eclipse Public License 1.0
I had a typo in my code which caused all txn_ok
response to use "append"
for all functions. Such as this:
INFO [2021-04-13 23:32:29,761] jepsen worker 0 - jepsen.util 0 :invoke :txn [[:append 224 6] [:r 226 nil] [:r 227 nil] [:r 226 nil]]
INFO [2021-04-13 23:32:29,761] jepsen worker 0 - jepsen.util 0 :ok :txn [[:append 224 6] [:append 226 [1 2 3 4 5 6 7 8 9]] [:append 227 [1 2 3 4 5 6 8 9]] [:append 226 [1 2 3 4 5 6 7 8 9]]]
However in this case maelstrom
considers the run valid:
INFO [2021-04-13 23:32:29,998] jepsen worker 1 - jepsen.util 1 :invoke :txn [[:r 229 nil] [:r 229 nil]]
INFO [2021-04-13 23:32:29,998] jepsen worker 1 - jepsen.util 1 :ok :txn [[:append 229 [1 3 5 6]] [:append 229 [1 3 5 6]]]
INFO [2021-04-13 23:32:30,005] jepsen node n1 - maelstrom.db Tearing down n1
INFO [2021-04-13 23:32:30,005] jepsen node n0 - maelstrom.db Tearing down n0
...
:workload {:valid? true},
:valid? true}
Everything looks good! ヽ(‘ー`)ノ
I noticed this when I tried a multi-node test when only finished the single node part, so found it weird that I can successfully pass the test. When I fix my typo it now correctly identifies my single node code to fail for multi node test.
It seems this is a deeper issue in Jepsen/Elle? (or intended behaviour?)
Found a minor issue when running maelstrom against a binary without explicitly setting the workload -w
option, the test throws a NullPointerException.
Probably the workload set here
(def test-opt-spec
"Options for single tests."
[[nil "--bin FILE" "Path to binary which runs a node"
:missing "Expected a --bin PATH_TO_BINARY to test"]
["-w" "--workload NAME" "What workload to run."
:default "lin-kv"
:parse-fn keyword
:validate [workloads (cli/one-of workloads)]]
])
Is not setting the workload-name
here
workload-name (:workload opts)
workload ((workloads workload-name)
(assoc opts :nodes nodes, :net net))
Reproduction Steps:
Execute the command: ./maelstrom test --bin demobin
the error message which includes the main details:
ERROR [2023-09-09 21:22:36,041] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.lang.NullPointerException: Cannot invoke "clojure.lang.IFn.invoke(Object)" because the return va at maelstrom.core$maelstrom_test.invokeStatic(core.clj:62)
at maelstrom.core$maelstrom_test.invoke(core.clj:53)
at jepsen.cli$single_test_cmd$fn__13951.invoke(cli.clj:396)
at jepsen.cli$run_BANG_.invokeStatic(cli.clj:329)
at jepsen.cli$run_BANG_.invoke(cli.clj:258)
at maelstrom.core$_main.invokeStatic(core.clj:269)
at maelstrom.core$_main.doInvoke(core.clj:267)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at maelstrom.core.main(Unknown Source)
Expected Behavior:
The default workload, which is "lin-kv", should be used when the -w option is not explicitly provided. Furthermore, when I run the test with the explicit -w lin-kv
argument, the test works. It appears there's an issue with either the default value behavior or the way it's being handled.
We can either fix or remove the lin-kv
option from the default settings. I am open to contributing a fix through a pull request based on the direction you would like to take on this issue.
If maelstrom gets a binary that ignores all the kafka commands, for example like this modified demo/clojure/echo.clj, then maelstrom fails with a NullPointerException rather than some useful errors.
For other workflows, eg broadcast/g-counter, maelstroms fails with some timeout exceptions that are more understandable.
Is this the correct way to fail for the kafka workflow?
$ lein run -- test -w kafka --bin demo/clojure/lenient_echo.clj # see gist
[...]
INFO [2023-09-13 10:24:40,229] jepsen node n1 - maelstrom.net Shutting down Maelstrom network
INFO [2023-09-13 10:24:40,233] jepsen test runner - jepsen.core Analyzing...
INFO [2023-09-13 10:24:40,294] clojure-agent-send-off-pool-5 - jepsen.tests.kafka Wrote /Users/ndr/coding/maelstrom-src/maelstrom/store/kafka/20230913T102325.079+0100/consume-counts.edn
WARN [2023-09-13 10:24:40,295] clojure-agent-send-off-pool-5 - jepsen.checker Error while checking history:
java.lang.NullPointerException: Cannot invoke "java.lang.Number.doubleValue()" because "x" is null
at clojure.lang.Numbers.divide(Numbers.java:3899)
at jepsen.util$nanos__GT_secs.invokeStatic(util.clj:386)
at jepsen.util$nanos__GT_secs.invoke(util.clj:386)
at clojure.core$update.invokeStatic(core.clj:6231)
at clojure.core$update.invoke(core.clj:6223)
at jepsen.tests.kafka$checker$reify__19272.check(kafka.clj:2080)
at jepsen.checker$check_safe.invokeStatic(checker.clj:86)
at jepsen.checker$check_safe.invoke(checker.clj:79)
at jepsen.checker$compose$reify__10709$fn__10711.invoke(checker.clj:102)
at clojure.core$pmap$fn__8552$fn__8553.invoke(core.clj:7089)
at clojure.core$binding_conveyor_fn$fn__5823.invoke(core.clj:2047)
at clojure.lang.AFn.call(AFn.java:18)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1623)
INFO [2023-09-13 10:24:40,428] jepsen test runner - jepsen.core Analysis complete
INFO [2023-09-13 10:24:40,434] jepsen results - jepsen.store Wrote /Users/ndr/coding/maelstrom-src/maelstrom/store/kafka/20230913T102325.079+0100/results.edn
INFO [2023-09-13 10:24:40,456] jepsen test runner - jepsen.core {:perf {:latency-graph {:valid? true},
:rate-graph {:valid? true},
:valid? true},
:timeline {:valid? true},
:exceptions {:valid? true},
:stats {:valid? false,
:count 67,
:ok-count 0,
:fail-count 11,
:info-count 56,
:by-f {:assign {:valid? false,
:count 11,
:ok-count 0,
:fail-count 11,
:info-count 0},
:crash {:valid? false,
:count 7,
:ok-count 0,
:fail-count 0,
:info-count 7},
:poll {:valid? false,
:count 20,
:ok-count 0,
:fail-count 0,
:info-count 20},
:send {:valid? false,
:count 29,
:ok-count 0,
:fail-count 0,
:info-count 29}}},
:availability {:valid? true, :ok-fraction 0.0},
:net {:all {:send-count 70,
:recv-count 70,
:msg-count 70,
:msgs-per-op 1.0447761},
:clients {:send-count 70, :recv-count 70, :msg-count 70},
:servers {:send-count 0,
:recv-count 0,
:msg-count 0,
:msgs-per-op 0.0},
:valid? true},
:workload {:valid? :unknown,
:error "java.lang.NullPointerException: Cannot invoke \"java.lang.Number.doubleValue()\" because \"x\" is null\n at clojure.lang.Numbers.divide (Numbers.java:3899)\n jepsen.util$nanos__GT_secs.invokeStatic (util.clj:386)\n jepsen.util$nanos__GT_secs.invoke (util.clj:386)\n clojure.core$update.invokeStatic (core.clj:6231)\n clojure.core$update.invoke (core.clj:6223)\n jepsen.tests.kafka$checker$reify__19272.check (kafka.clj:2080)\n jepsen.checker$check_safe.invokeStatic (checker.clj:86)\n jepsen.checker$check_safe.invoke (checker.clj:79)\n jepsen.checker$compose$reify__10709$fn__10711.invoke (checker.clj:102)\n clojure.core$pmap$fn__8552$fn__8553.invoke (core.clj:7089)\n clojure.core$binding_conveyor_fn$fn__5823.invoke (core.clj:2047)\n clojure.lang.AFn.call (AFn.java:18)\n java.util.concurrent.FutureTask.run (FutureTask.java:317)\n java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1144)\n java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:642)\n java.lang.Thread.run (Thread.java:1623)\n"},
:valid? false}
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
I used the async RPC
method in the broadcast challenge. One of the issues I faced was keeping track of the status of calls.
I wanted to maintain a list of messages I had sent. If the node had sent the message successfully, I would delete it from the list. If not, I would retry again.
Once I sent a message, I wanted to know if I had gotten any response or if the call had failed. The RPC method adds the auto-incremented message-id, but it does not return to the caller, but the returned status contains the message id. So it was painful to manage this.
Here is what I did:
Sample code:
// this function gives a stateful HandlerFunc
// which stores the msgId with it
//
// when this method is called, then it knows for which
// message it was called
rpcHandler := func(msgId string) maelstrom.HandlerFunc {
return func(msg maelstrom.Message) error {
// do something with the msgId
// store.Remove(msgId)
return nil
}
}
// generate a msgId for this RPC call
msgId, _ := uuid.NewRandom()
// store this msgId in a map
// store.Save(msgId)
if err := n.RPC(nodId, msgBody, rpcHandler(msgId.String())); err != nil {
return false
}
We can improve this by a great deal by making the RPC method return the msgId of the message it had sent. The change in the go library is minimal, but for the user, the code would be something like the following:
// the handler function does not require to be stateful anymore
someHandlerFunc := func(msg maelstrom.Message) error {
var body map[string]any
if err := json.Unmarshal(msg.Body, &body); err != nil {
return err
}
// body: map[in_reply_to:1 type:broadcast_ok]
if msgId, ok := body["in_reply_to"]; ok {
// do something with the msgId
// store.Remove(msgId)
}
return nil
}
if msgId, err := n.RPC(nodId, msgBody, someHandlerFunc); err != nil {
return false
}
// store this msgId in a map
// store.Save(msgId)
The change in the library:
- func (n *Node) RPC(dest string, body any, handler HandlerFunc) error { ... }
+ func (n *Node) RPC(dest string, body any, handler HandlerFunc) (int, error) { ... }
It is a breaking change, but the developer experience this change provides makes it worth it. Are you open to considering such a change?
Hi Kyle, I managed to use Jepsen to do some simple tests on my distributed system.
And I was wondering if it is possible to use maelstrom on my Jepsen test results to produce some of visualisations here https://github.com/jepsen-io/maelstrom/blob/main/doc/results.md#common-files ?
Hi, I try to use your workbench for my small project and started with implementing echo service.
I got an assertion that a destination is invalid
java.lang.AssertionError: Assert failed: Invalid dest for message {:dest "c0", :id 0, :body {:type "init_ok", :msg_id 1, :in_reply_to 1}, :src "n1"}
. And after several attempts I just stuck. In response I use src
from init
message.
Here is a log from the run.
Initialisation:
2021-03-10 22:12:50,547{GMT} INFO [jepsen test runner] jepsen.core: Test version 8c8d9286e436a5ae9c23e6e8e490520582418823 (plus uncommitted changes)
2021-03-10 22:12:50,549{GMT} INFO [jepsen test runner] jepsen.core: Command line:
lein run test -w echo --bin testnode --nodes n1 --time-limit 5 --log-net-send --log-net-recv --log-stderr
2021-03-10 22:12:50,761{GMT} INFO [jepsen test runner] jepsen.core: Running test:
{:remote #jepsen.control.SSHRemote{:session nil}
:log-net-send true
:node-count nil
:max-txn-length 4
:concurrency 1
:db
#object[maelstrom.db$db$reify__16747
"0x4c683777"
"maelstrom.db$db$reify__16747@4c683777"]
:max-writes-per-key 16
:leave-db-running? false
:name "echo"
:logging-json? false
:net-journal #object[clojure.lang.Atom "0x28bf7bea" {:status :ready, :val []}]
:start-time
#object[org.joda.time.DateTime "0x44881fcc" "2021-03-10T22:12:50.000+01:00"]
:nemesis-interval 10
:net
#object[maelstrom.net$jepsen_adapter$reify__15560
"0xe7423bb"
"maelstrom.net$jepsen_adapter$reify__15560@e7423bb"]
:client
#object[maelstrom.workload.echo$client$reify__17341
"0x270663e2"
"maelstrom.workload.echo$client$reify__17341@270663e2"]
:barrier
#object[java.util.concurrent.CyclicBarrier
"0x3987bfc8"
"java.util.concurrent.CyclicBarrier@3987bfc8"]
:log-stderr true
:pure-generators true
:ssh {:dummy? true}
:rate 5
:checker
#object[jepsen.checker$compose$reify__8612
"0x31084f37"
"jepsen.checker$compose$reify__8612@31084f37"]
:argv
("test"
"-w"
"echo"
"--bin"
"testnode"
"--nodes"
"n1"
"--time-limit"
"5"
"--log-net-send"
"--log-net-recv"
"--log-stderr")
:nemesis
(jepsen.nemesis.ReflCompose
{:fm {:start-partition 0,
:stop-partition 0,
:kill 1,
:start 1,
:pause 1,
:resume 1},
:nemeses [#unprintable "jepsen.nemesis.combined$partition_nemesis$reify__17019@1ececb56"
#unprintable "jepsen.nemesis.combined$db_nemesis$reify__17000@a43a21e"]})
:active-histories
#object[clojure.lang.Atom "0xebe3561" {:status :ready, :val #{}}]
:nodes ["n1"]
:test-count 1
:latency {:mean 0, :dist :constant}
:bin "testnode"
:generator
(jepsen.generator.TimeLimit
{:limit 5000000000,
:cutoff nil,
:gen (jepsen.generator.Any
{:gens [(jepsen.generator.OnThreads {:f #{:nemesis}, :gen nil})
(jepsen.generator.OnThreads
{:f #object[clojure.core$complement$fn__5654
"0x3343997b"
"clojure.core$complement$fn__5654@3343997b"],
:gen (jepsen.generator.Stagger
{:dt 400000000,
:next-time nil,
:gen (jepsen.generator.EachThread
{:fresh-gen #object[maelstrom.workload.echo$workload$fn__17360
"0x14b3976"
"maelstrom.workload.echo$workload$fn__17360@14b3976"],
:gens {}})})})]})})
:log-net-recv true
:os
#object[jepsen.os$reify__2490 "0x79de9f90" "jepsen.os$reify__2490@79de9f90"]
:time-limit 5
:workload :echo
:consistency-models [:strict-serializable]
:topology :grid}
Log messages:
2021-03-10 22:12:50,800{GMT} INFO [jepsen test runner] jepsen.db: Tearing down DB
2021-03-10 22:12:50,813{GMT} INFO [jepsen test runner] jepsen.db: Setting up DB
2021-03-10 22:12:50,830{GMT} INFO [jepsen node n1] maelstrom.service: Starting services: (lin-kv lin-tso lww-kv seq-kv)
2021-03-10 22:12:50,837{GMT} INFO [jepsen node n1] maelstrom.db: Setting up n1
2021-03-10 22:12:50,840{GMT} INFO [jepsen node n1] maelstrom.process: launching testnode nil
2021-03-10 22:12:50,877{GMT} INFO [jepsen node n1] maelstrom.net: :send {:dest "n1", :body {:type "init", :node_id "n1", :node_ids ["n1"], :msg_id 1}, :src "c0", :id 0}
2021-03-10 22:12:50,884{GMT} INFO [n1 stdin] maelstrom.net: :recv {:dest "n1", :body {:type "init", :node_id "n1", :node_ids ["n1"], :msg_id 1}, :src "c0", :id 0}
2021-03-10 22:12:50,918{GMT} INFO [n1 stderr] maelstrom.process: 2021-03-10T22:12:50+0100 debug error : Get JSON {"dest":"n1","body":{"type":"init","node_id":"n1","node_ids":["n1"],"msg_id":1},"src":"c0","id":0}
2021-03-10 22:12:50,918{GMT} INFO [n1 stderr] maelstrom.process: 2021-03-10T22:12:50+0100 debug error : Response JSON {"dest":"c0","id":0,"body":{"type":"init_ok","msg_id":1,"in_reply_to":1},"src":"n1"}
2021-03-10 22:13:00,943{GMT} INFO [jepsen node n1] maelstrom.db: Tearing down n1
2021-03-10 22:13:00,970{GMT} WARN [n1 stdout] maelstrom.process: Error!
java.lang.AssertionError: Assert failed: Invalid dest for message {:dest "c0", :id 0, :body {:type "init_ok", :msg_id 1, :in_reply_to 1}, :src "n1"}
(get queues (:dest m))
Hello, I hope you are doing well. I just started the Gossip Glomers Echo Challenge, but I am running into an error.
I have installed Maelstrom and its prerequisites but when I hit the command ./maelstrom test -w echo --bin ~/go/bin/maelstrom-echo --node-count 1 --time-limit 10
, I get the following error Error: Unable to access jarfile lib/maelstrom.jar
I download the echo.rb
file and I am running
maelstrom test -w echo --bin echo.rb --time-limit 5
INFO [2023-02-23 18:21:56,703] jepsen node n1 - maelstrom.db Setting up n1
INFO [2023-02-23 18:21:56,703] jepsen node n1 - maelstrom.process launching echo.rb []
INFO [2023-02-23 18:21:57,715] jepsen node n1 - maelstrom.net Shutting down Maelstrom network
WARN [2023-02-23 18:21:57,716] jepsen test runner - jepsen.core Test crashed!
java.io.IOException: Cannot run program "/home/barry/programming/haskell/maelstromHs/echo.rb" (in directory "/tmp"): error=13, Permission denied
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
I also tried sudo ...
. I also tried executing it directly from /tmp
(same error).
I don't get permission denied from, e.g.,
cd /tmp
touch example.file
Not really sure what to do here.
I'm working through chapter 2 in the Guide, implementing the echo server in Go.
I'm getting an Invalid dest
error when my echo server sends a reply to the init message.
I'm using the src
value from the init message as the dest
value in the reply, so this is an unexpected error.
INFO [2022-09-14 10:55:14,768] jepsen node n1 - maelstrom.net Starting Maelstrom network
INFO [2022-09-14 10:55:14,769] jepsen test runner - jepsen.db Tearing down DB
INFO [2022-09-14 10:55:14,770] jepsen test runner - jepsen.db Setting up DB
INFO [2022-09-14 10:55:14,772] jepsen node n1 - maelstrom.service Starting services: (lin-kv lin-tso lww-kv seq-kv)
INFO [2022-09-14 10:55:14,774] jepsen node n1 - maelstrom.db Setting up n1
INFO [2022-09-14 10:55:14,775] jepsen node n1 - maelstrom.process launching ../chapter2-echo/cmd/echo/echo []
INFO [2022-09-14 10:55:14,794] n1 stderr - maelstrom.process Received init message from Maelstrom: {c0 n1 {1 init n1 [n1]}}
INFO [2022-09-14 10:55:14,795] n1 stderr - maelstrom.process Sending init message reply to Maelstrom: {n1 c0 {1 1 init_ok}}
INFO [2022-09-14 10:55:24,794] jepsen node n1 - maelstrom.db Tearing down n1
INFO [2022-09-14 10:55:24,795] n1 stderr - maelstrom.process Starting echo server with node id: n1 and message counter: 1
WARN [2022-09-14 10:55:24,808] n1 stdout - maelstrom.process Error!
java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 1, :src "n1", :dest "c0", :body {:msg_id 1, :in_reply_to 1, :type "init_ok"}}
(get queues (:dest m))
Hi,
When working on the txn-rw-register
workload I noticed that my responses to the txn
operation weren't conformant to the protocol because I was leaving out some of the operations of the transaction. An example:
[2023-04-06T05:34:52.648Z]: "[recv][from:c6][type:txn][msg_id:1] {\"id\":5,\"src\":\"c6\",\"dest\":\"n0\",\"body\":{\"txn\":[[\"r\",8,null],[\"w\",8,1],[\"r\",9,null]],\"type\":\"txn\",\"msg_id\":1}}"
[2023-04-06T05:34:52.648Z]: "[send] {\"dest\":\"c6\",\"src\":\"n0\",\"body\":{\"msg_id\":1,\"type\":\"txn_ok\",\"txn\":[[\"r\",8,null]],\"in_reply_to\":1}}"
Int the example above, I replied to message 1 with only the "read" operation. However maelstrom reported that all was good at the end of the execution. Is this a bug or is the checker lenient?
Hello.
I'm attempting to vet my solution for Fly's distributed system challenge #5a but maelstrom seems stuck analyzing: it's been doing so for 45 minutes now and has produced 1.3GB of log output.
It's dumping what looks to be internal state relevant to plotting; you can see it at the tail end of the gist at https://gist.github.com/chronoslynx/6e8ac74b11159ab07d6ad7ca1304e5ef. The log was full of output like
INFO [2023-03-05 11:26:42,178] clojure-agent-send-off-pool-80 - jepsen.tests.kafka :row [:g [:title ok send by process 2
[[:send "9" [2 2]]]] [:text {:x 48, :y 14, :style background: #f4ce90;} 2]]
I've replicated the issue with:
I can reliably reproduce this issue with my code at https://github.com/chronoslynx/gossip-glomers/blob/main/kafka/main.go
I'm working through the Maelstrom Guide. For better or worse I decided to implement each exercise in Go rather than just use the Ruby code supplied in docs. I'm currently implementing the logic from Chapter 5 'Database As Value'.
I'm invoking maelstrom with these arguments:
./maelstrom test -w txn-list-append --bin "${GOPATH}/bin/datomic" --time-limit 10 --node-count 2 --rate 100 --log-stderr
I'm running into an error in results.edn
that I'm not sure how to troubleshoot:
java.lang.IllegalArgumentException: No matching clause: [:ok :invoke]
My first thought was that that indicated there was an :invoke
without a corresponding :ok
. I found one missing:ok
for each node -- with a :net :timeout
instead. That seems to match the behavior described in the Database As Value section.
Any suggestions for how to troubleshoot this?
{:perf {:latency-graph {:valid? true},
:rate-graph {:valid? true},
:valid? true},
:timeline {:valid? true},
:exceptions {:valid? true},
:stats {:valid? true,
:count 853,
:ok-count 851,
:fail-count 0,
:info-count 2,
:by-f {:txn {:valid? true,
:count 853,
:ok-count 851,
:fail-count 0,
:info-count 2}}},
:net {:all {:send-count 5120,
:recv-count 5120,
:msg-count 5120,
:msgs-per-op 6.0023446},
:clients {:send-count 1708, :recv-count 1708, :msg-count 1708},
:servers {:send-count 3412,
:recv-count 3412,
:msg-count 3412,
:msgs-per-op 4.0},
:valid? true},
:workload {:valid? :unknown,
:error "java.lang.IllegalArgumentException: No matching clause: [:ok :invoke]\n at elle.list_append$dirty_update_cases$fn__18117$fn__18122.invoke (list_append.clj:404)\n clojure.lang.PersistentVector.reduce (PersistentVector.java:343)\n clojure.core$reduce.invokeStatic (core.clj:6829)\n clojure.core$reduce.invoke (core.clj:6812)\n elle.list_append$dirty_update_cases$fn__18117.invoke (list_append.clj:397)\n clojure.core$map$fn__5884.invoke (core.clj:2759)\n clojure.lang.LazySeq.sval (LazySeq.java:42)\n clojure.lang.LazySeq.seq (LazySeq.java:51)\n clojure.lang.RT.seq (RT.java:535)\n clojure.core$seq__5419.invokeStatic (core.clj:139)\n clojure.core$apply.invokeStatic (core.clj:662)\n clojure.core$mapcat.invokeStatic (core.clj:2787)\n clojure.core$mapcat.doInvoke (core.clj:2787)\n clojure.lang.RestFn.invoke (RestFn.java:423)\n elle.list_append$dirty_update_cases.invokeStatic (list_append.clj:395)\n elle.list_append$dirty_update_cases.invoke (list_append.clj:386)\n elle.list_append$check.invokeStatic (list_append.clj:883)\n elle.list_append$check.invoke (list_append.clj:848)\n jepsen.tests.cycle.append$checker$reify__18396.check (append.clj:19)\n jepsen.checker$check_safe.invokeStatic (checker.clj:81)\n jepsen.checker$check_safe.invoke (checker.clj:74)\n jepsen.checker$compose$reify__9656$fn__9658.invoke (checker.clj:97)\n clojure.core$pmap$fn__8485$fn__8486.invoke (core.clj:7024)\n clojure.core$binding_conveyor_fn$fn__5772.invoke (core.clj:2034)\n clojure.lang.AFn.call (AFn.java:18)\n java.util.concurrent.FutureTask.run (FutureTask.java:264)\n java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1136)\n java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:635)\n java.lang.Thread.run (Thread.java:833)\n"},
:valid? :unknown}
I'm trying to package maelstrom for myself using Nix but using the packaged version always crashes with the following backtrace:
WARN [2023-04-14 10:39:06,433] jepsen test runner - jepsen.core Test crashed!
java.nio.file.NoSuchFileException: store/echo/20230414T103906.375+0200/test.jepsen
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357)
at jepsen.store.format$open.invokeStatic(format.clj:407)
at jepsen.store.format$open.invoke(format.clj:402)
at jepsen.core$run_BANG_.invokeStatic(core.clj:388)
at jepsen.core$run_BANG_.invoke(core.clj:318)
at jepsen.cli$single_test_cmd$fn__13951.invoke(cli.clj:396)
at jepsen.cli$run_BANG_.invokeStatic(cli.clj:329)
at jepsen.cli$run_BANG_.invoke(cli.clj:258)
at maelstrom.core$_main.invokeStatic(core.clj:269)
at maelstrom.core$_main.doInvoke(core.clj:267)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at maelstrom.core.main(Unknown Source)
ERROR [2023-04-14 10:39:06,434] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.nio.file.NoSuchFileException: store/echo/20230414T103906.375+0200/test.jepsen
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357)
at jepsen.store.format$open.invokeStatic(format.clj:407)
at jepsen.store.format$open.invoke(format.clj:402)
at jepsen.core$run_BANG_.invokeStatic(core.clj:388)
at jepsen.core$run_BANG_.invoke(core.clj:318)
at jepsen.cli$single_test_cmd$fn__13951.invoke(cli.clj:396)
at jepsen.cli$run_BANG_.invokeStatic(cli.clj:329)
at jepsen.cli$run_BANG_.invoke(cli.clj:258)
at maelstrom.core$_main.invokeStatic(core.clj:269)
at maelstrom.core$_main.doInvoke(core.clj:267)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at maelstrom.core.main(Unknown Source)
I can't figure out what file is actually missing, nowhere can I find a reference to the given path and strace
on a successful run doesn't mention it either.
Packaging maelstrom crates /bin/maelstrom
with the following content (nix store paths have been minimsed to what matters):
#!/bin/bash
programDir="/lib"
cd "$programDir"
exec "/bin/java" -jar "$programDir/maelstrom.jar" "$@"
lib/maelstrom.jar
is moved to /lib/maelstrom.jar
.
Any ideas?
Hey! Sorry if I'm blind, but it looks like v0.2.1 doesn't have a pre-built jar, but the other versions do.
0.2.1: https://github.com/jepsen-io/maelstrom/releases/tag/0.2.1 (no maelstrom.tar.bz2, just the sources)
0.2.0: https://github.com/jepsen-io/maelstrom/releases/tag/0.2.0 (has a jar in maelstrom.tar.bz2)
Any chance you'd be willing to toss one up? I don't have a closure dev environment set up. Thanks!
Hi @aphyr. Someone recently posted a graph of messages against the seq-kv
where two separate identity CAS operations succeed:
n0
for value 121
n1
for value 119
Is this legal under Sequential consistency? My understanding from the section in Services was that updates could only interact with past states if total-order is not violated.
tl;dr just wondering if this is a bug or if I'm misunderstanding something. Thanks!
On teardown, maelstrom has an AssertionError
when one of the nodes attempts to send to a node that was already torn down.
Seems like this assertion error shouldn't be happening / printed.
INFO [2023-04-13 01:55:14,033] jepsen node n2 - maelstrom.db Tearing down n2
WARN [2023-04-13 01:55:14,033] n4 stdout - maelstrom.process Error!
java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 518901, :src "n4", :dest "n1", :body {:msg_id 108978, :type "broadcast", :message 36}}
Hi, I am trying to create 5K nodes
./maelstrom test -w echo --bin maelstrom-echo --time-limit 300 --node-count 5000 --rate 5000 --concurrency 1n
I was running plain maelstrom-echo, and here was the following, resource utilization was getting reached, is there any estimation for this amount of node instances to run, what is the config required?
Here are the following logs and info about system .
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:798)
at dom_top.core$real_pmap_helper.invokeStatic(core.clj:186)
at dom_top.core$real_pmap_helper.invoke(core.clj:150)
at jepsen.util$real_pmap.invokeStatic(util.clj:76)
at jepsen.util$real_pmap.invoke(util.clj:69)
at jepsen.control$on_nodes.invokeStatic(control.clj:311)
at jepsen.control$on_nodes.invoke(control.clj:299)
at jepsen.control$on_nodes.invokeStatic(control.clj:304)
at jepsen.control$on_nodes.invoke(control.clj:299)
at jepsen.core$run_BANG_$fn__13071$fn__13074.invoke(core.clj:395)
at jepsen.core$run_BANG_$fn__13071.invoke(core.clj:394)
at jepsen.core$run_BANG_.invokeStatic(core.clj:392)
at jepsen.core$run_BANG_.invoke(core.clj:318)
at jepsen.cli$single_test_cmd$fn__13951.invoke(cli.clj:396)
at jepsen.cli$run_BANG_.invokeStatic(cli.clj:329)
at jepsen.cli$run_BANG_.invoke(cli.clj:258)
at maelstrom.core$_main.invokeStatic(core.clj:269)
at maelstrom.core$_main.doInvoke(core.clj:267)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at maelstrom.core.main(Unknown Source)
I have modified the maelstrom command as following
exec java -Xss256M -Xms54G -Xmx54G -XX:ThreadStackSize=136
system specs
ubunutu 20.04
Ram 60GB
Core 12
Disk Sapce 400G
/proc/sys/kernel# cat threads-max
470681
root@e2e-30-74:~# ulimit -aH core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 235340 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 235340 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
Hello,
First of thanks for all your work.
I'm working on Gossip Glomers, challenge 3c. The main change is that we add --nemesis partition
to test network partitions. When I test it against my solution for 3b, which doesn't handle network partitions, I'm getting an Everything looks good!
.
That led me to think that perhaps, the --nemesis partition
wasn't working correctly as, again, the test should have failed because my solution doesn't handle network partitions.
Is this because I'm populating msg_id
in the reply body?
2023-01-10 10:06:45,020{GMT} INFO [n0 stderr] maelstrom.process: 2023/01/10 10:06:45 Failed to process txn request {c4 n0 {txn 35 [[append 6 5] [append 11 13] [append 5 2] [r 6 <nil>]]}}: [[append 6 5] [append 11 13] [append 5 2] [r 6 [1 2 3 4 5]]], RPC txn-conflict error (due to service response code 22)
2023-01-10 10:06:45,020{GMT} INFO [n0 stderr] maelstrom.process: 2023/01/10 10:06:45 Wrote error reply: {"src":"n0","dest":"c4","body":{"type":"error","msg_id":99,"in_reply_to":35,"code":30,"text":"CAS failed"}}
2023-01-10 10:06:45,022{GMT} INFO [jepsen worker 0] jepsen.util: 0 :info :txn [[:append 6 5] [:append 11 13] [:append 5 2] [:r 6 nil]] [:unknown "CAS failed"]
Currently Maelstrom is using jepsen version 0.3.1, however since then there have been a number of performance and other improvements up to version 0.3.4.
I tried this locally already and it seems to work without any noticeable problems
Maelstrom does not detect g0 (write-cycle) for me, like in the following history:
t3: ["r",0,1],["w",0,10], ["r",0,3],["w",0,11]
t2: ["w",0,1],["r",0,1], ["r",0,10],["w",0,2],["w",0,3],["r",0,3]
So t2
observes the read value of 3 at operation index 5 (0-indexed), while it just previously wrote 2 and 3 on top of ww-conflict (10). It's a G0 and maelstrom ignored it.
$ go build && ~/.../maelstrom test -w txn-rw-register --bin ./maelstrom-txn --node-count 1 --time-limit 20 --rate 10 --concurrency 10n --consistency-models read-uncommitted --max-txn-length 20 --max-writes-per-key 10000000 --key-count 1
2023/03/29 12:11:11 Received {c3 n0 {"txn":[["w",0,1],["r",0,null],["r",0,null],["w",0,2],["w",0,3],["r",0,null],["w",0,4],["r",0,null],["r",0,null],["r",0,null],["w",0,5],["w",0,6],["r",0,null],["w",0,7],["w",0,8],["r",0,null],["w",0,9]],"type":"txn","msg_id":1}}
910670418 + 0 = 1 [#2] - write key 0, value 1, tx 2
991888336 - 0 = 0 [#2]
2023/03/29 12:11:11 Received {c4 n0 {"txn":[["r",0,null],["w",0,10],["r",0,null],["w",0,11]],"type":"txn","msg_id":1}}
997560527 - 0 = 0 [#3]
8713188 + 0 = 10 [#3] - write key 0, value 10, tx 3 (ww-edge)
17116025 - 0 = 0 [#2]
60262526 + 0 = 2 [#2] - write key 0, value 2, tx 2 (ww-edge - cycle with 3)
77438615 + 0 = 3 [#2] - write key 0, value 3, tx 2 (ww-edge - cycle with 3)
90639108 - 0 = 0 [#2] - read key 0, value 3, tx 2 (wr-edge - cycle with 3)
...
2023/03/29 12:11:12 Sent {"src":"n0","dest":"c4","body":{"in_reply_to":1,"txn":[["r",0,1],["w",0,10],["r",0,3],["w",0,11]],"type":"txn_ok"}}
2023/03/29 12:11:12 Sent {"src":"n0","dest":"c3","body":{"in_reply_to":1,"txn":[["w",0,1],["r",0,1],["r",0,10],["w",0,2],["w",0,3],["r",0,3],["w",0,4],["r",0,11],["r",0,17],["r",0,19],["w",0,5],["w",0,6],["r",0,14],["w",0,7],["w",0,8],["r",0,34],["w",0,9]],"type":"txn_ok"}}
Here we have many cycles of tx 2 and tx3: w2(1), w3(10), w2(2), w2(3)
. MS could alone detect the g0 based only on the received responses going from that (1) every written value is unique (2) all writes in those 2 txs covered with reads.
Specifically for this https://github.com/jepsen-io/maelstrom/blob/main/doc/05-datomic/02-shared-state.md#database-as-value and I'm mostly just wondering if this should be mentioned in the tutorial. (you might also encounter this issue when doing training :D)
I'm using Rust and since the builtin HashMap
type could reorder keys when growing, it could cause lots of failure when used blindly in CAS.
For example I'd have more than half failed transactions even for very few operations:
:stats {:valid? true,
:count 22,
:ok-count 7,
:fail-count 15,
For longer runs it's even worse so I didn't bother to finish any. Looking at the message diagram points me to believe this is due to ordering (of HashMap keys):
Luckily there's also a built-in type which is sorted by its keys and it solved the issue for me, so I would get results like this: (same with yet another type)
:stats {:valid? true,
:count 2045,
:ok-count 1924,
:fail-count 121,
Which is more in-line with the results in the tutorial.
I'm experimenting with partitioning with Raft protocol. Can anyone help me on how to induce partial/simplex faults where only one way communication is possible?
I see the option --nemesis FAULTS #{}, a comma separated list of faults to inject.
How is this option used? And what I should give in the comma separated list? usually "--nemesis partition" is provided to induce faults
Hi there! Thanks so much for making Maelstrom. I just ran a small workshop at a systems reading group that I organize at Harvard, and we all worked on the Gossip Glomers challenges together on a shared remote computer. It was a lot of fun :)
I wrote a Python client library for Maelstrom (via asyncio
) for the purposes of the workshop, since I did a quick poll, and most of the members of our reading group didn't know Go or JavaScript but were still interested in learning about distributed systems. I thought it might be helpful for others who are more familiar with Python as well. Would you be open to a pull request where I add this library to the demo/
folder for others to use?
https://gist.github.com/ekzhang/65310f0260c03d4ec340e0daf0ccb9d5#file-_maelstrom-py
It seems like nodes are sent a sigkill right away at termination (via Process.destroyForcefully
. This makes it a little tricky to use things like in-process sampling profilers because the process is killed before the profile can be written.
It would be nicer if the node had its stdin closed/have a sigint and then was given a few seconds to quit before being forcibly killed.
It is pretty easy for the maelstrom implementation in a language to watch for stdin closure and break out of the Run()
.
First of all,
The docs and this project is a marvel of engineering. This weekend finishing reading
the docs, will give it some time and read again. Reads like a novel, wish all full blown
books on distributed systems would be written like this. Good job sir!
Was wondering, what is the main usage of this ? Do you expect it to be used by those who write
and test replicated state machines with this text based protocol ? Or there is a way that a can wire a
db client and test its advertised isolation levels ?
Appreciate your hard work,
Good luck!
The docs seem to suggest Java 1.8 is supported, but I get errors when attempting to use the pre-built binary with openjdk 1.8.0_362. This stackoverflow post suggests that perhaps the true minimum Java version is Java 11?
$ java -version
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-b08)
OpenJDK 64-Bit Server VM (build 25.362-b08, mixed mode)
$ ./maelstrom
Exception in thread "main" java.lang.UnsupportedClassVersionError: ch/qos/logback/classic/spi/LogbackServiceProvider has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
Run Command: ./maelstrom test -w echo --bin demo/js/echo.js --time-limit 5
Error:
WARN [2023-12-05 15:35:38,456] jepsen test runner - jepsen.core Test crashed!
java.io.IOException: Cannot run program "/Users/user1/Workspace/maelstrom/demo/js/echo.js" (in directory "/var/folders/x5/jmppkfcx71zd921s1dtytdyw0000gn/T"): error=2, No such file or directory
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
at maelstrom.process$start_node_BANG_.invokeStatic(process.clj:199)
at maelstrom.process$start_node_BANG_.invoke(process.clj:168)
at maelstrom.db$db$reify__16142.setup_BANG_(db.clj:34)
at jepsen.db$fn__8729$G__8723__8733.invoke(db.clj:12)
at jepsen.db$fn__8729$G__8722__8738.invoke(db.clj:12)
at clojure.core$partial$fn__5908.invoke(core.clj:2642)
at jepsen.control$on_nodes$fn__8599.invoke(control.clj:314)
at clojure.lang.AFn.applyToHelper(AFn.java:154)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.core$apply.invokeStatic(core.clj:667)
at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invokeStatic(core.clj:671)
at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at dom_top.core$real_pmap_helper$build_thread__211$fn__212.invoke(core.clj:163)
at clojure.lang.AFn.applyToHelper(AFn.java:152)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.core$apply.invokeStatic(core.clj:667)
at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
at clojure.lang.RestFn.invoke(RestFn.java:425)
at clojure.lang.AFn.applyToHelper(AFn.java:156)
at clojure.lang.RestFn.applyTo(RestFn.java:132)
at clojure.core$apply.invokeStatic(core.clj:671)
at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
at clojure.lang.RestFn.invoke(RestFn.java:397)
at clojure.lang.AFn.run(AFn.java:22)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.IOException: error=2, No such file or directory
at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
... 31 common frames omitted
ERROR [2023-12-05 15:35:38,461] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.io.IOException: Cannot run program "/Users/user1/Workspace/maelstrom/demo/js/echo.js" (in directory "/var/folders/x5/jmppkfcx71zd921s1dtytdyw0000gn/T"): error=2, No such file or directory
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
at maelstrom.process$start_node_BANG_.invokeStatic(process.clj:199)
at maelstrom.process$start_node_BANG_.invoke(process.clj:168)
at maelstrom.db$db$reify__16142.setup_BANG_(db.clj:34)
at jepsen.db$fn__8729$G__8723__8733.invoke(db.clj:12)
at jepsen.db$fn__8729$G__8722__8738.invoke(db.clj:12)
at clojure.core$partial$fn__5908.invoke(core.clj:2642)
at jepsen.control$on_nodes$fn__8599.invoke(control.clj:314)
at clojure.lang.AFn.applyToHelper(AFn.java:154)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.core$apply.invokeStatic(core.clj:667)
at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invokeStatic(core.clj:671)
at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at dom_top.core$real_pmap_helper$build_thread__211$fn__212.invoke(core.clj:163)
at clojure.lang.AFn.applyToHelper(AFn.java:152)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.core$apply.invokeStatic(core.clj:667)
at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
at clojure.lang.RestFn.invoke(RestFn.java:425)
at clojure.lang.AFn.applyToHelper(AFn.java:156)
at clojure.lang.RestFn.applyTo(RestFn.java:132)
at clojure.core$apply.invokeStatic(core.clj:671)
at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
at clojure.lang.RestFn.invoke(RestFn.java:397)
at clojure.lang.AFn.run(AFn.java:22)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.IOException: error=2, No such file or directory
at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
... 31 common frames omitted
I tried the "echo" challenge from the fly-io site. Good idea to have a simple echo things to sort out the basic setup.
So one program I tried was the following Racket code:
#lang racket/base
(require json)
(require racket/match)
(require racket/port)
(define (main)
(for ([h (in-producer read-json eof)])
(match h
[(hash-table ('dest dest)
('src src)
('body body)
(k v) ...)
(match body
[(hash-table ('type "echo")
('msg_id msg_id)
(bk bv))
(write-json
(hash
'src dest
'dest src
'body (hash 'type "echo_ok"
'msg_id msg_id
'in_reply_to msg_id
bk bv)))]
[(hash-table ('type "init")
('msg_id msg_id)
(bk bv) ...)
(write-json
(hash 'src dest
'dest src
'body
(hash 'type "init_ok"
'in_reply_to msg_id)))]
[_
#f])]
[_
#f])
(flush-output)))
(module+ main
(main))
The above code results in the following error:
java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 1, :src "n0", :dest "c0", :body {:in_reply_to 1, :type "init_ok"}}
I sorta guessed by reading the spec that making my code emit a newline after each message would follow the stated spec better and might just work. And it did! Simply adding (displayln "")
before (flush-output)
in the above code makes it work and pass the tests.
The way to run the above code is, assuming Racket 8.8
is installed, to generate an executable using raco exe echo.rkt
, then running the resulting echo
binary with maelstrom.
PS I don't feel too strongly about this thing and wouldn't mind if the issue is closed as unimportant
Hello 👋 ,
I'm checking out the Fly distributed systems challenges, and for the "Echo" demo, after compiling the Go code and trying to run the maelstrom tests, I get this error:
[...]
INFO [2023-02-25 14:58:30,921] jepsen node n0 - maelstrom.net Starting Maelstrom network
INFO [2023-02-25 14:58:30,922] jepsen test runner - jepsen.db Tearing down DB
INFO [2023-02-25 14:58:30,923] jepsen test runner - jepsen.db Setting up DB
INFO [2023-02-25 14:58:30,925] jepsen node n0 - maelstrom.service Starting services: (lin-kv lin-tso lww-kv seq-kv)
INFO [2023-02-25 14:58:30,925] jepsen node n0 - maelstrom.db Setting up n0
INFO [2023-02-25 14:58:30,926] jepsen node n0 - maelstrom.process launching /home/johndoe/path/to/fly-dist-sys/1_echo []
INFO [2023-02-25 14:58:31,930] jepsen node n0 - maelstrom.net Shutting down Maelstrom network
WARN [2023-02-25 14:58:31,933] jepsen test runner - jepsen.core Test crashed!
java.io.IOException: Cannot run program "/home/johndoe/path/to/fly-dist-sys/1_echo" (in directory "/tmp"): error=13, Permission denied
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
at maelstrom.process$start_node_BANG_.invokeStatic(process.clj:199)
at maelstrom.process$start_node_BANG_.invoke(process.clj:168)
at maelstrom.db$db$reify__16142.setup_BANG_(db.clj:34)
at jepsen.db$fn__8729$G__8723__8733.invoke(db.clj:12)
at jepsen.db$fn__8729$G__8722__8738.invoke(db.clj:12)
at clojure.core$partial$fn__5908.invoke(core.clj:2642)
at jepsen.control$on_nodes$fn__8599.invoke(control.clj:314)
at clojure.lang.AFn.applyToHelper(AFn.java:154)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.core$apply.invokeStatic(core.clj:667)
at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invokeStatic(core.clj:671)
at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at dom_top.core$real_pmap_helper$build_thread__211$fn__212.invoke(core.clj:163)
at clojure.lang.AFn.applyToHelper(AFn.java:152)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.core$apply.invokeStatic(core.clj:667)
at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
at clojure.lang.RestFn.invoke(RestFn.java:425)
at clojure.lang.AFn.applyToHelper(AFn.java:156)
at clojure.lang.RestFn.applyTo(RestFn.java:132)
at clojure.core$apply.invokeStatic(core.clj:671)
at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
at clojure.lang.RestFn.invoke(RestFn.java:397)
at clojure.lang.AFn.run(AFn.java:22)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.IOException: error=13, Permission denied
at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
... 31 common frames omitted
ERROR [2023-02-25 14:58:31,938] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.io.IOException: Cannot run program "/home/johndoe/path/to/fly-dist-sys/1_echo" (in directory "/tmp"): error=13, Permission denied
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
at maelstrom.process$start_node_BANG_.invokeStatic(process.clj:199)
at maelstrom.process$start_node_BANG_.invoke(process.clj:168)
at maelstrom.db$db$reify__16142.setup_BANG_(db.clj:34)
at jepsen.db$fn__8729$G__8723__8733.invoke(db.clj:12)
at jepsen.db$fn__8729$G__8722__8738.invoke(db.clj:12)
at clojure.core$partial$fn__5908.invoke(core.clj:2642)
at jepsen.control$on_nodes$fn__8599.invoke(control.clj:314)
at clojure.lang.AFn.applyToHelper(AFn.java:154)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.core$apply.invokeStatic(core.clj:667)
at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invokeStatic(core.clj:671)
at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at dom_top.core$real_pmap_helper$build_thread__211$fn__212.invoke(core.clj:163)
at clojure.lang.AFn.applyToHelper(AFn.java:152)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.core$apply.invokeStatic(core.clj:667)
at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
at clojure.lang.RestFn.invoke(RestFn.java:425)
at clojure.lang.AFn.applyToHelper(AFn.java:156)
at clojure.lang.RestFn.applyTo(RestFn.java:132)
at clojure.core$apply.invokeStatic(core.clj:671)
at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
at clojure.lang.RestFn.invoke(RestFn.java:397)
at clojure.lang.AFn.run(AFn.java:22)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.IOException: error=13, Permission denied
at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
... 31 common frames omitted
The compiled Go binary, as well as maelstrom belong to johndoe
user and group.
I'm running maelstrom like this: ./maelstrom test -w echo --bin ~/path/to/fly-dist-sys/1_echo --node-count 1 --time-limit 10
I'm on Linux (Fedora).
Am I doing anything wrong? Should permissions be set differently?
I was interesting in modelling Apache BookKeeper, to compare the experience of my modelling it with TLA+. One slight difficulty with BK is that there are 3 node types: metadata_store (zk), bookie (storage node) and bk_client (serving layer where all the consensus logic is). The bk_clients are the only externally visible nodes of the protocol, all requests go to the bk_clients which in turn interact with the metadata_store and bookies.
Currently all nodes are treated the same by Maelstrom, meaning I will need to use proxying of messages, and use node ID to determine node type. When a node initializes I could ensure that the first is a metadata_store, the second two are bk_clients and the rest are bookies. Metadata_store and bookie nodes will need to proxy maelstrom messages to bk_client nodes. But this would mean adding some communication between nodes that does not exist in the real protocol, so each node would always be able to forward messages to the right node.
An alternative is allow node specialization, or at least, mark which nodes form the public API surface of the protocol, in order to avoid this proxying. For my purposes, simply restricting the public API calls to particular nodes would be enough. Using node ID to determine node type is not an issue for me.
I'm getting this error when trying to run JAR from releases:
Exception in thread "main" java.lang.ExceptionInInitializerError
[...] here long clj stacktrace [...]
Caused by: java.lang.ClassNotFoundException: javax.xml.bind.DatatypeConverter
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:398)
at clojure.lang.RT.classForName(RT.java:2168)
at clojure.lang.RT.classForNameNonLoading(RT.java:2181)
at org.httpkit.server$loading__5569__auto____6381.invoke(server.clj:1)
at org.httpkit.server__init.load(Unknown Source)
at org.httpkit.server__init.<clinit>(Unknown Source)
... 98 more
According to SO the problem is removal of this library in new Java versions: In Java 11 they are completely removed from the JDK.
I'm doing the gossip glomers g-counter challenge at the moment in which you implement g-counter using seq-kv. Below is a spoiler of my solution, so just a warning for anyone who doesn't wish to be spoiled.
The solution I went for is that each node maintains its own sum under its own key in seq-kv, that means that if we have 3 nodes, then we will have 3 keys in seq-kv each storing the sum of its own node. On a background thread, each node repeatedly issues read requests for all 3 keys in seq-kv, and so my thinking was that after enough reads they should eventually get the latest value. This approach is also mentioned as one that should work in the other issue here.
What I am observing is that the most recent write or cas operation that occurs against any key in the seq-kv will never be read by other nodes in the system. I have attached at the bottom of the issue a log of the maelstrom output with --log-net-send
turned on to show proof of this happening. If there is a better format to debug this data that you want me to upload, let me know. In this log I ran the g-counter workload against 2 nodes with a rate of 20 and a time limit of 1. The most recent cas/write message that occurs is: {:id 90, :src "n0", :dest "seq-kv", :body {:type "cas", :msg_id 28, :key "n0", :from 10, :to 14}}
. After this, n1
sends a read
for key "n0"
19 times, all of which return a value of 10 despite the fact that the cas sent from n0
changed it to 14.
I know according to the definition of sequential consistency that this is an allowed outcome, but I would have expected that eventually it would return the latest value. Sending noop cas's (e.g. from current value to current value) from all the nodes doesn't seem to fix this either. The solution I ended up getting to work was to set up another background thread on each node that just writes a random number to an unrelated key in seq-kv which seems to cause the reads to eventually return the latest values for each node.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.