Giter Club home page Giter Club logo

maelstrom's People

Contributors

aalekhpatel07 avatar aphyr avatar aroussel-data avatar avinassh avatar benbjohnson avatar bgreenlee avatar bilalsaad avatar dependabot[bot] avatar drpacman avatar dvguruprasad avatar ekzhang avatar igorperikov avatar jemc avatar llimllib avatar metanivek avatar mkcp avatar nezteb avatar nrkirby avatar nuno-faria avatar philippgille avatar pjambet avatar rhishikesh-helpshift avatar rhishikeshj avatar simbo1905 avatar sitano avatar stevexuereb avatar thelortex avatar viktaur avatar vjuranek avatar ziyaddin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

maelstrom's Issues

`txn-list-append` workload fails to detect invalid return for read operations

I had a typo in my code which caused all txn_ok response to use "append" for all functions. Such as this:

INFO [2021-04-13 23:32:29,761] jepsen worker 0 - jepsen.util 0	:invoke	:txn	[[:append 224 6] [:r 226 nil] [:r 227 nil] [:r 226 nil]]
INFO [2021-04-13 23:32:29,761] jepsen worker 0 - jepsen.util 0	:ok	:txn	[[:append 224 6] [:append 226 [1 2 3 4 5 6 7 8 9]] [:append 227 [1 2 3 4 5 6 8 9]] [:append 226 [1 2 3 4 5 6 7 8 9]]]

However in this case maelstrom considers the run valid:

INFO [2021-04-13 23:32:29,998] jepsen worker 1 - jepsen.util 1	:invoke	:txn	[[:r 229 nil] [:r 229 nil]]
INFO [2021-04-13 23:32:29,998] jepsen worker 1 - jepsen.util 1	:ok	:txn	[[:append 229 [1 3 5 6]] [:append 229 [1 3 5 6]]]
INFO [2021-04-13 23:32:30,005] jepsen node n1 - maelstrom.db Tearing down n1
INFO [2021-04-13 23:32:30,005] jepsen node n0 - maelstrom.db Tearing down n0
...
 :workload {:valid? true},
 :valid? true}


Everything looks good! ヽ(‘ー`)ノ

I noticed this when I tried a multi-node test when only finished the single node part, so found it weird that I can successfully pass the test. When I fix my typo it now correctly identifies my single node code to fail for multi node test.

It seems this is a deeper issue in Jepsen/Elle? (or intended behaviour?)

Default workload name not being set when not explicitly passed

Found a minor issue when running maelstrom against a binary without explicitly setting the workload -w option, the test throws a NullPointerException.

Probably the workload set here

(def test-opt-spec
  "Options for single tests."
  [[nil "--bin FILE"        "Path to binary which runs a node"
    :missing "Expected a --bin PATH_TO_BINARY to test"]

   ["-w" "--workload NAME" "What workload to run."
    :default "lin-kv"
    :parse-fn keyword
    :validate [workloads (cli/one-of workloads)]]
   ])

Is not setting the workload-name here

        workload-name (:workload opts)
        workload      ((workloads workload-name)
                       (assoc opts :nodes nodes, :net net))

Reproduction Steps:

Execute the command: ./maelstrom test --bin demobin
the error message which includes the main details:

ERROR [2023-09-09 21:22:36,041] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.lang.NullPointerException: Cannot invoke "clojure.lang.IFn.invoke(Object)" because the return va        at maelstrom.core$maelstrom_test.invokeStatic(core.clj:62)
        at maelstrom.core$maelstrom_test.invoke(core.clj:53)
        at jepsen.cli$single_test_cmd$fn__13951.invoke(cli.clj:396)
        at jepsen.cli$run_BANG_.invokeStatic(cli.clj:329)
        at jepsen.cli$run_BANG_.invoke(cli.clj:258)
        at maelstrom.core$_main.invokeStatic(core.clj:269)
        at maelstrom.core$_main.doInvoke(core.clj:267)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at maelstrom.core.main(Unknown Source)

Expected Behavior:
The default workload, which is "lin-kv", should be used when the -w option is not explicitly provided. Furthermore, when I run the test with the explicit -w lin-kv argument, the test works. It appears there's an issue with either the default value behavior or the way it's being handled.

We can either fix or remove the lin-kv option from the default settings. I am open to contributing a fix through a pull request based on the direction you would like to take on this issue.

Kafka workflow crashes with NullPointerException

If maelstrom gets a binary that ignores all the kafka commands, for example like this modified demo/clojure/echo.clj, then maelstrom fails with a NullPointerException rather than some useful errors.

For other workflows, eg broadcast/g-counter, maelstroms fails with some timeout exceptions that are more understandable.

Is this the correct way to fail for the kafka workflow?

$ lein run -- test -w kafka --bin demo/clojure/lenient_echo.clj # see gist
[...]
INFO [2023-09-13 10:24:40,229] jepsen node n1 - maelstrom.net Shutting down Maelstrom network
INFO [2023-09-13 10:24:40,233] jepsen test runner - jepsen.core Analyzing...
INFO [2023-09-13 10:24:40,294] clojure-agent-send-off-pool-5 - jepsen.tests.kafka Wrote /Users/ndr/coding/maelstrom-src/maelstrom/store/kafka/20230913T102325.079+0100/consume-counts.edn
WARN [2023-09-13 10:24:40,295] clojure-agent-send-off-pool-5 - jepsen.checker Error while checking history:
java.lang.NullPointerException: Cannot invoke "java.lang.Number.doubleValue()" because "x" is null
        at clojure.lang.Numbers.divide(Numbers.java:3899)
        at jepsen.util$nanos__GT_secs.invokeStatic(util.clj:386)
        at jepsen.util$nanos__GT_secs.invoke(util.clj:386)
        at clojure.core$update.invokeStatic(core.clj:6231)
        at clojure.core$update.invoke(core.clj:6223)
        at jepsen.tests.kafka$checker$reify__19272.check(kafka.clj:2080)
        at jepsen.checker$check_safe.invokeStatic(checker.clj:86)
        at jepsen.checker$check_safe.invoke(checker.clj:79)
        at jepsen.checker$compose$reify__10709$fn__10711.invoke(checker.clj:102)
        at clojure.core$pmap$fn__8552$fn__8553.invoke(core.clj:7089)
        at clojure.core$binding_conveyor_fn$fn__5823.invoke(core.clj:2047)
        at clojure.lang.AFn.call(AFn.java:18)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1623)
INFO [2023-09-13 10:24:40,428] jepsen test runner - jepsen.core Analysis complete
INFO [2023-09-13 10:24:40,434] jepsen results - jepsen.store Wrote /Users/ndr/coding/maelstrom-src/maelstrom/store/kafka/20230913T102325.079+0100/results.edn
INFO [2023-09-13 10:24:40,456] jepsen test runner - jepsen.core {:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? false,
         :count 67,
         :ok-count 0,
         :fail-count 11,
         :info-count 56,
         :by-f {:assign {:valid? false,
                         :count 11,
                         :ok-count 0,
                         :fail-count 11,
                         :info-count 0},
                :crash {:valid? false,
                        :count 7,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 7},
                :poll {:valid? false,
                       :count 20,
                       :ok-count 0,
                       :fail-count 0,
                       :info-count 20},
                :send {:valid? false,
                       :count 29,
                       :ok-count 0,
                       :fail-count 0,
                       :info-count 29}}},
 :availability {:valid? true, :ok-fraction 0.0},
 :net {:all {:send-count 70,
             :recv-count 70,
             :msg-count 70,
             :msgs-per-op 1.0447761},
       :clients {:send-count 70, :recv-count 70, :msg-count 70},
       :servers {:send-count 0,
                 :recv-count 0,
                 :msg-count 0,
                 :msgs-per-op 0.0},
       :valid? true},
 :workload {:valid? :unknown,
            :error "java.lang.NullPointerException: Cannot invoke \"java.lang.Number.doubleValue()\" because \"x\" is null\n at clojure.lang.Numbers.divide (Numbers.java:3899)\n    jepsen.util$nanos__GT_secs.invokeStatic (util.clj:386)\n    jepsen.util$nanos__GT_secs.invoke (util.clj:386)\n    clojure.core$update.invokeStatic (core.clj:6231)\n    clojure.core$update.invoke (core.clj:6223)\n    jepsen.tests.kafka$checker$reify__19272.check (kafka.clj:2080)\n    jepsen.checker$check_safe.invokeStatic (checker.clj:86)\n    jepsen.checker$check_safe.invoke (checker.clj:79)\n    jepsen.checker$compose$reify__10709$fn__10711.invoke (checker.clj:102)\n    clojure.core$pmap$fn__8552$fn__8553.invoke (core.clj:7089)\n    clojure.core$binding_conveyor_fn$fn__5823.invoke (core.clj:2047)\n    clojure.lang.AFn.call (AFn.java:18)\n    java.util.concurrent.FutureTask.run (FutureTask.java:317)\n    java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1144)\n    java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:642)\n    java.lang.Thread.run (Thread.java:1623)\n"},
 :valid? false}


Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

Suggestion: A better API for `RPC` method in Go

I used the async RPC method in the broadcast challenge. One of the issues I faced was keeping track of the status of calls.

I wanted to maintain a list of messages I had sent. If the node had sent the message successfully, I would delete it from the list. If not, I would retry again.

Once I sent a message, I wanted to know if I had gotten any response or if the call had failed. The RPC method adds the auto-incremented message-id, but it does not return to the caller, but the returned status contains the message id. So it was painful to manage this.

Here is what I did:

  1. Generate a random ID and add this to HandlerFunc's state (the callback method)
  2. Send the message, save the ID in my local state
  3. When the callback is called, remove the ID from the state
  4. After X minutes, if my ID is still in the state, then resend the message

Sample code:

// this function gives a stateful HandlerFunc 
// which stores the msgId with it
//
// when this method is called, then it knows for which
// message it was called
rpcHandler := func(msgId string) maelstrom.HandlerFunc {
	return func(msg maelstrom.Message) error {
		// do something with the msgId
                // store.Remove(msgId)
		return nil
	}
}

// generate a msgId for this RPC call
msgId, _ := uuid.NewRandom()
// store this msgId in a map
// store.Save(msgId)
if err := n.RPC(nodId, msgBody, rpcHandler(msgId.String())); err != nil {
	return false
}

We can improve this by a great deal by making the RPC method return the msgId of the message it had sent. The change in the go library is minimal, but for the user, the code would be something like the following:

// the handler function does not require to be stateful anymore
someHandlerFunc := func(msg maelstrom.Message) error {
	var body map[string]any
	if err := json.Unmarshal(msg.Body, &body); err != nil {
		return err
	}
	// body: map[in_reply_to:1 type:broadcast_ok]
	if msgId, ok := body["in_reply_to"]; ok {
		// do something with the msgId
		// store.Remove(msgId)
	}
	return nil
}


if msgId, err := n.RPC(nodId, msgBody, someHandlerFunc); err != nil {
	return false
}
// store this msgId in a map
// store.Save(msgId)

The change in the library:

- func (n *Node) RPC(dest string, body any, handler HandlerFunc) error { ... }
+ func (n *Node) RPC(dest string, body any, handler HandlerFunc) (int, error) { ... }

It is a breaking change, but the developer experience this change provides makes it worth it. Are you open to considering such a change?

Assert on dest for init_ok message

Hi, I try to use your workbench for my small project and started with implementing echo service.
I got an assertion that a destination is invalid
java.lang.AssertionError: Assert failed: Invalid dest for message {:dest "c0", :id 0, :body {:type "init_ok", :msg_id 1, :in_reply_to 1}, :src "n1"}. And after several attempts I just stuck. In response I use src from init message.

Here is a log from the run.
Initialisation:

2021-03-10 22:12:50,547{GMT}	INFO	[jepsen test runner] jepsen.core: Test version 8c8d9286e436a5ae9c23e6e8e490520582418823 (plus uncommitted changes)
2021-03-10 22:12:50,549{GMT}	INFO	[jepsen test runner] jepsen.core: Command line:
lein run test -w echo --bin testnode --nodes n1 --time-limit 5 --log-net-send --log-net-recv --log-stderr
2021-03-10 22:12:50,761{GMT}	INFO	[jepsen test runner] jepsen.core: Running test:
{:remote #jepsen.control.SSHRemote{:session nil}
 :log-net-send true
 :node-count nil
 :max-txn-length 4
 :concurrency 1
 :db
 #object[maelstrom.db$db$reify__16747
         "0x4c683777"
         "maelstrom.db$db$reify__16747@4c683777"]
 :max-writes-per-key 16
 :leave-db-running? false
 :name "echo"
 :logging-json? false
 :net-journal #object[clojure.lang.Atom "0x28bf7bea" {:status :ready, :val []}]
 :start-time
 #object[org.joda.time.DateTime "0x44881fcc" "2021-03-10T22:12:50.000+01:00"]
 :nemesis-interval 10
 :net
 #object[maelstrom.net$jepsen_adapter$reify__15560
         "0xe7423bb"
         "maelstrom.net$jepsen_adapter$reify__15560@e7423bb"]
 :client
 #object[maelstrom.workload.echo$client$reify__17341
         "0x270663e2"
         "maelstrom.workload.echo$client$reify__17341@270663e2"]
 :barrier
 #object[java.util.concurrent.CyclicBarrier
         "0x3987bfc8"
         "java.util.concurrent.CyclicBarrier@3987bfc8"]
 :log-stderr true
 :pure-generators true
 :ssh {:dummy? true}
 :rate 5
 :checker
 #object[jepsen.checker$compose$reify__8612
         "0x31084f37"
         "jepsen.checker$compose$reify__8612@31084f37"]
 :argv
 ("test"
  "-w"
  "echo"
  "--bin"
  "testnode"
  "--nodes"
  "n1"
  "--time-limit"
  "5"
  "--log-net-send"
  "--log-net-recv"
  "--log-stderr")
 :nemesis
 (jepsen.nemesis.ReflCompose
  {:fm {:start-partition 0,
        :stop-partition 0,
        :kill 1,
        :start 1,
        :pause 1,
        :resume 1},
   :nemeses [#unprintable "jepsen.nemesis.combined$partition_nemesis$reify__17019@1ececb56"
             #unprintable "jepsen.nemesis.combined$db_nemesis$reify__17000@a43a21e"]})
 :active-histories
 #object[clojure.lang.Atom "0xebe3561" {:status :ready, :val #{}}]
 :nodes ["n1"]
 :test-count 1
 :latency {:mean 0, :dist :constant}
 :bin "testnode"
 :generator
 (jepsen.generator.TimeLimit
  {:limit 5000000000,
   :cutoff nil,
   :gen (jepsen.generator.Any
         {:gens [(jepsen.generator.OnThreads {:f #{:nemesis}, :gen nil})
                 (jepsen.generator.OnThreads
                  {:f #object[clojure.core$complement$fn__5654
                              "0x3343997b"
                              "clojure.core$complement$fn__5654@3343997b"],
                   :gen (jepsen.generator.Stagger
                         {:dt 400000000,
                          :next-time nil,
                          :gen (jepsen.generator.EachThread
                                {:fresh-gen #object[maelstrom.workload.echo$workload$fn__17360
                                                    "0x14b3976"
                                                    "maelstrom.workload.echo$workload$fn__17360@14b3976"],
                                 :gens {}})})})]})})
 :log-net-recv true
 :os
 #object[jepsen.os$reify__2490 "0x79de9f90" "jepsen.os$reify__2490@79de9f90"]
 :time-limit 5
 :workload :echo
 :consistency-models [:strict-serializable]
 :topology :grid}

Log messages:

2021-03-10 22:12:50,800{GMT}	INFO	[jepsen test runner] jepsen.db: Tearing down DB
2021-03-10 22:12:50,813{GMT}	INFO	[jepsen test runner] jepsen.db: Setting up DB
2021-03-10 22:12:50,830{GMT}	INFO	[jepsen node n1] maelstrom.service: Starting services: (lin-kv lin-tso lww-kv seq-kv)
2021-03-10 22:12:50,837{GMT}	INFO	[jepsen node n1] maelstrom.db: Setting up n1
2021-03-10 22:12:50,840{GMT}	INFO	[jepsen node n1] maelstrom.process: launching testnode nil
2021-03-10 22:12:50,877{GMT}	INFO	[jepsen node n1] maelstrom.net: :send {:dest "n1", :body {:type "init", :node_id "n1", :node_ids ["n1"], :msg_id 1}, :src "c0", :id 0}
2021-03-10 22:12:50,884{GMT}	INFO	[n1 stdin] maelstrom.net: :recv {:dest "n1", :body {:type "init", :node_id "n1", :node_ids ["n1"], :msg_id 1}, :src "c0", :id 0}
2021-03-10 22:12:50,918{GMT}	INFO	[n1 stderr] maelstrom.process: 2021-03-10T22:12:50+0100 debug error : Get JSON {"dest":"n1","body":{"type":"init","node_id":"n1","node_ids":["n1"],"msg_id":1},"src":"c0","id":0}
2021-03-10 22:12:50,918{GMT}	INFO	[n1 stderr] maelstrom.process: 2021-03-10T22:12:50+0100 debug error : Response JSON {"dest":"c0","id":0,"body":{"type":"init_ok","msg_id":1,"in_reply_to":1},"src":"n1"}
2021-03-10 22:13:00,943{GMT}	INFO	[jepsen node n1] maelstrom.db: Tearing down n1
2021-03-10 22:13:00,970{GMT}	WARN	[n1 stdout] maelstrom.process: Error!
java.lang.AssertionError: Assert failed: Invalid dest for message {:dest "c0", :id 0, :body {:type "init_ok", :msg_id 1, :in_reply_to 1}, :src "n1"}
(get queues (:dest m))

Error: Unable to access jarfile lib/maelstrom.jar

Hello, I hope you are doing well. I just started the Gossip Glomers Echo Challenge, but I am running into an error.
I have installed Maelstrom and its prerequisites but when I hit the command ./maelstrom test -w echo --bin ~/go/bin/maelstrom-echo --node-count 1 --time-limit 10, I get the following error Error: Unable to access jarfile lib/maelstrom.jar

Screenshot:
Screenshot from 2023-02-28 13-35-01

Trying to set this up, but getting error=13, Permission denied

I download the echo.rb file and I am running

maelstrom test -w echo --bin echo.rb --time-limit 5
INFO [2023-02-23 18:21:56,703] jepsen node n1 - maelstrom.db Setting up n1
INFO [2023-02-23 18:21:56,703] jepsen node n1 - maelstrom.process launching echo.rb []
INFO [2023-02-23 18:21:57,715] jepsen node n1 - maelstrom.net Shutting down Maelstrom network
WARN [2023-02-23 18:21:57,716] jepsen test runner - jepsen.core Test crashed!
java.io.IOException: Cannot run program "/home/barry/programming/haskell/maelstromHs/echo.rb" (in directory "/tmp"): error=13, Permission denied
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)

I also tried sudo .... I also tried executing it directly from /tmp (same error).

I don't get permission denied from, e.g.,

cd /tmp
touch example.file

Not really sure what to do here.

'Invalid dest' error encountered when implementing echo server init message reply, when newline is not appended

I'm working through chapter 2 in the Guide, implementing the echo server in Go.

I'm getting an Invalid dest error when my echo server sends a reply to the init message.

I'm using the src value from the init message as the dest value in the reply, so this is an unexpected error.

INFO [2022-09-14 10:55:14,768] jepsen node n1 - maelstrom.net Starting Maelstrom network
INFO [2022-09-14 10:55:14,769] jepsen test runner - jepsen.db Tearing down DB
INFO [2022-09-14 10:55:14,770] jepsen test runner - jepsen.db Setting up DB
INFO [2022-09-14 10:55:14,772] jepsen node n1 - maelstrom.service Starting services: (lin-kv lin-tso lww-kv seq-kv)
INFO [2022-09-14 10:55:14,774] jepsen node n1 - maelstrom.db Setting up n1
INFO [2022-09-14 10:55:14,775] jepsen node n1 - maelstrom.process launching ../chapter2-echo/cmd/echo/echo []
INFO [2022-09-14 10:55:14,794] n1 stderr - maelstrom.process Received init message from Maelstrom: {c0 n1 {1 init n1 [n1]}}
INFO [2022-09-14 10:55:14,795] n1 stderr - maelstrom.process Sending init message reply to Maelstrom: {n1 c0 {1 1 init_ok}}
INFO [2022-09-14 10:55:24,794] jepsen node n1 - maelstrom.db Tearing down n1
INFO [2022-09-14 10:55:24,795] n1 stderr - maelstrom.process Starting echo server with node id: n1 and message counter: 1
WARN [2022-09-14 10:55:24,808] n1 stdout - maelstrom.process Error!
java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 1, :src "n1", :dest "c0", :body {:msg_id 1, :in_reply_to 1, :type "init_ok"}}
(get queues (:dest m))

txn-rw-register: non complete response payload validation

Hi,

When working on the txn-rw-register workload I noticed that my responses to the txn operation weren't conformant to the protocol because I was leaving out some of the operations of the transaction. An example:

[2023-04-06T05:34:52.648Z]: "[recv][from:c6][type:txn][msg_id:1] {\"id\":5,\"src\":\"c6\",\"dest\":\"n0\",\"body\":{\"txn\":[[\"r\",8,null],[\"w\",8,1],[\"r\",9,null]],\"type\":\"txn\",\"msg_id\":1}}"
[2023-04-06T05:34:52.648Z]: "[send] {\"dest\":\"c6\",\"src\":\"n0\",\"body\":{\"msg_id\":1,\"type\":\"txn_ok\",\"txn\":[[\"r\",8,null]],\"in_reply_to\":1}}"

Int the example above, I replied to message 1 with only the "read" operation. However maelstrom reported that all was good at the end of the execution. Is this a bug or is the checker lenient?

Kafka workload never finishes analyzing

Hello.

I'm attempting to vet my solution for Fly's distributed system challenge #5a but maelstrom seems stuck analyzing: it's been doing so for 45 minutes now and has produced 1.3GB of log output.

It's dumping what looks to be internal state relevant to plotting; you can see it at the tail end of the gist at https://gist.github.com/chronoslynx/6e8ac74b11159ab07d6ad7ca1304e5ef. The log was full of output like

INFO [2023-03-05 11:26:42,178] clojure-agent-send-off-pool-80 - jepsen.tests.kafka :row [:g [:title ok send by process 2
[[:send "9" [2 2]]]] [:text {:x 48, :y 14, :style background: #f4ce90;} 2]]

I've replicated the issue with:

  • maelstrom's 0.2.2 release
  • maelstrom's 0.2.3 release
  • 0.2.3

I can reliably reproduce this issue with my code at https://github.com/chronoslynx/gossip-glomers/blob/main/kafka/main.go

Not clear how to troubleshoot "No matching clause: [:ok :invoke]"

I'm working through the Maelstrom Guide. For better or worse I decided to implement each exercise in Go rather than just use the Ruby code supplied in docs. I'm currently implementing the logic from Chapter 5 'Database As Value'.

I'm invoking maelstrom with these arguments:
./maelstrom test -w txn-list-append --bin "${GOPATH}/bin/datomic" --time-limit 10 --node-count 2 --rate 100 --log-stderr

I'm running into an error in results.edn that I'm not sure how to troubleshoot:
java.lang.IllegalArgumentException: No matching clause: [:ok :invoke]

My first thought was that that indicated there was an :invoke without a corresponding :ok. I found one missing:ok for each node -- with a :net :timeout instead. That seems to match the behavior described in the Database As Value section.

Any suggestions for how to troubleshoot this?

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 853,
         :ok-count 851,
         :fail-count 0,
         :info-count 2,
         :by-f {:txn {:valid? true,
                      :count 853,
                      :ok-count 851,
                      :fail-count 0,
                      :info-count 2}}},
 :net {:all {:send-count 5120,
             :recv-count 5120,
             :msg-count 5120,
             :msgs-per-op 6.0023446},
       :clients {:send-count 1708, :recv-count 1708, :msg-count 1708},
       :servers {:send-count 3412,
                 :recv-count 3412,
                 :msg-count 3412,
                 :msgs-per-op 4.0},
       :valid? true},
 :workload {:valid? :unknown,
            :error "java.lang.IllegalArgumentException: No matching clause: [:ok :invoke]\n at elle.list_append$dirty_update_cases$fn__18117$fn__18122.invoke (list_append.clj:404)\n    clojure.lang.PersistentVector.reduce (PersistentVector.java:343)\n    clojure.core$reduce.invokeStatic (core.clj:6829)\n    clojure.core$reduce.invoke (core.clj:6812)\n    elle.list_append$dirty_update_cases$fn__18117.invoke (list_append.clj:397)\n    clojure.core$map$fn__5884.invoke (core.clj:2759)\n    clojure.lang.LazySeq.sval (LazySeq.java:42)\n    clojure.lang.LazySeq.seq (LazySeq.java:51)\n    clojure.lang.RT.seq (RT.java:535)\n    clojure.core$seq__5419.invokeStatic (core.clj:139)\n    clojure.core$apply.invokeStatic (core.clj:662)\n    clojure.core$mapcat.invokeStatic (core.clj:2787)\n    clojure.core$mapcat.doInvoke (core.clj:2787)\n    clojure.lang.RestFn.invoke (RestFn.java:423)\n    elle.list_append$dirty_update_cases.invokeStatic (list_append.clj:395)\n    elle.list_append$dirty_update_cases.invoke (list_append.clj:386)\n    elle.list_append$check.invokeStatic (list_append.clj:883)\n    elle.list_append$check.invoke (list_append.clj:848)\n    jepsen.tests.cycle.append$checker$reify__18396.check (append.clj:19)\n    jepsen.checker$check_safe.invokeStatic (checker.clj:81)\n    jepsen.checker$check_safe.invoke (checker.clj:74)\n    jepsen.checker$compose$reify__9656$fn__9658.invoke (checker.clj:97)\n    clojure.core$pmap$fn__8485$fn__8486.invoke (core.clj:7024)\n    clojure.core$binding_conveyor_fn$fn__5772.invoke (core.clj:2034)\n    clojure.lang.AFn.call (AFn.java:18)\n    java.util.concurrent.FutureTask.run (FutureTask.java:264)\n    java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1136)\n    java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:635)\n    java.lang.Thread.run (Thread.java:833)\n"},
 :valid? :unknown}

Running test yields `NoSuchFileException`

I'm trying to package maelstrom for myself using Nix but using the packaged version always crashes with the following backtrace:

WARN [2023-04-14 10:39:06,433] jepsen test runner - jepsen.core Test crashed!
java.nio.file.NoSuchFileException: store/echo/20230414T103906.375+0200/test.jepsen
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357)
        at jepsen.store.format$open.invokeStatic(format.clj:407)
        at jepsen.store.format$open.invoke(format.clj:402)
        at jepsen.core$run_BANG_.invokeStatic(core.clj:388)
        at jepsen.core$run_BANG_.invoke(core.clj:318)
        at jepsen.cli$single_test_cmd$fn__13951.invoke(cli.clj:396)
        at jepsen.cli$run_BANG_.invokeStatic(cli.clj:329)
        at jepsen.cli$run_BANG_.invoke(cli.clj:258)
        at maelstrom.core$_main.invokeStatic(core.clj:269)
        at maelstrom.core$_main.doInvoke(core.clj:267)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at maelstrom.core.main(Unknown Source)
ERROR [2023-04-14 10:39:06,434] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.nio.file.NoSuchFileException: store/echo/20230414T103906.375+0200/test.jepsen
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357)
        at jepsen.store.format$open.invokeStatic(format.clj:407)
        at jepsen.store.format$open.invoke(format.clj:402)
        at jepsen.core$run_BANG_.invokeStatic(core.clj:388)
        at jepsen.core$run_BANG_.invoke(core.clj:318)
        at jepsen.cli$single_test_cmd$fn__13951.invoke(cli.clj:396)
        at jepsen.cli$run_BANG_.invokeStatic(cli.clj:329)
        at jepsen.cli$run_BANG_.invoke(cli.clj:258)
        at maelstrom.core$_main.invokeStatic(core.clj:269)
        at maelstrom.core$_main.doInvoke(core.clj:267)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at maelstrom.core.main(Unknown Source)

I can't figure out what file is actually missing, nowhere can I find a reference to the given path and strace on a successful run doesn't mention it either.

Packaging maelstrom crates /bin/maelstrom with the following content (nix store paths have been minimsed to what matters):

#!/bin/bash

programDir="/lib"
cd "$programDir"
exec "/bin/java" -jar "$programDir/maelstrom.jar" "$@"

lib/maelstrom.jar is moved to /lib/maelstrom.jar.

Any ideas?

Compare-and-swap on seq-kv

Hi @aphyr. Someone recently posted a graph of messages against the seq-kv where two separate identity CAS operations succeed:

  • n0 for value 121
  • n1 for value 119

Is this legal under Sequential consistency? My understanding from the section in Services was that updates could only interact with past states if total-order is not violated.

tl;dr just wondering if this is a bug or if I'm misunderstanding something. Thanks!

image

`Invalid dest for message` on teardown

On teardown, maelstrom has an AssertionError when one of the nodes attempts to send to a node that was already torn down.
Seems like this assertion error shouldn't be happening / printed.

INFO [2023-04-13 01:55:14,033] jepsen node n2 - maelstrom.db Tearing down n2
WARN [2023-04-13 01:55:14,033] n4 stdout - maelstrom.process Error!
java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 518901, :src "n4", :dest "n1", :body {:msg_id 108978, :type "broadcast", :message 36}}

Server Config Estimation For Following Spec.

Hi, I am trying to create 5K nodes
./maelstrom test -w echo --bin maelstrom-echo --time-limit 300 --node-count 5000 --rate 5000 --concurrency 1n

I was running plain maelstrom-echo, and here was the following, resource utilization was getting reached, is there any estimation for this amount of node instances to run, what is the config required?

Here are the following logs and info about system .

java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        at java.base/java.lang.Thread.start0(Native Method)
        at java.base/java.lang.Thread.start(Thread.java:798)
        at dom_top.core$real_pmap_helper.invokeStatic(core.clj:186)
        at dom_top.core$real_pmap_helper.invoke(core.clj:150)
        at jepsen.util$real_pmap.invokeStatic(util.clj:76)
        at jepsen.util$real_pmap.invoke(util.clj:69)
        at jepsen.control$on_nodes.invokeStatic(control.clj:311)
        at jepsen.control$on_nodes.invoke(control.clj:299)
        at jepsen.control$on_nodes.invokeStatic(control.clj:304)
        at jepsen.control$on_nodes.invoke(control.clj:299)
        at jepsen.core$run_BANG_$fn__13071$fn__13074.invoke(core.clj:395)
        at jepsen.core$run_BANG_$fn__13071.invoke(core.clj:394)
        at jepsen.core$run_BANG_.invokeStatic(core.clj:392)
        at jepsen.core$run_BANG_.invoke(core.clj:318)
        at jepsen.cli$single_test_cmd$fn__13951.invoke(cli.clj:396)
        at jepsen.cli$run_BANG_.invokeStatic(cli.clj:329)
        at jepsen.cli$run_BANG_.invoke(cli.clj:258)
        at maelstrom.core$_main.invokeStatic(core.clj:269)
        at maelstrom.core$_main.doInvoke(core.clj:267)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at maelstrom.core.main(Unknown Source)

I have modified the maelstrom command as following
exec java -Xss256M -Xms54G -Xmx54G -XX:ThreadStackSize=136

system specs 
ubunutu 20.04
Ram 60GB
Core 12
Disk Sapce 400G
/proc/sys/kernel# cat threads-max
470681

root@e2e-30-74:~# ulimit -aH core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 235340 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 235340 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

Namesis partition issue?

Hello,

First of thanks for all your work.

I'm working on Gossip Glomers, challenge 3c. The main change is that we add --nemesis partition to test network partitions. When I test it against my solution for 3b, which doesn't handle network partitions, I'm getting an Everything looks good!.

That led me to think that perhaps, the --nemesis partition wasn't working correctly as, again, the test should have failed because my solution doesn't handle network partitions.

Chapter 5 error reply showing up in history as ":unknown"

Is this because I'm populating msg_id in the reply body?

2023-01-10 10:06:45,020{GMT}    INFO    [n0 stderr] maelstrom.process: 2023/01/10 10:06:45 Failed to process txn request {c4 n0 {txn 35 [[append 6 5] [append 11 13] [append 5 2] [r 6 <nil>]]}}: [[append 6 5] [append 11 13] [append 5 2] [r 6 [1 2 3 4 5]]], RPC txn-conflict error (due to service response code 22)

2023-01-10 10:06:45,020{GMT}    INFO    [n0 stderr] maelstrom.process: 2023/01/10 10:06:45 Wrote error reply: {"src":"n0","dest":"c4","body":{"type":"error","msg_id":99,"in_reply_to":35,"code":30,"text":"CAS failed"}}

2023-01-10 10:06:45,022{GMT}    INFO    [jepsen worker 0] jepsen.util: 0        :info   :txn    [[:append 6 5] [:append 11 13] [:append 5 2] [:r 6 nil]]        [:unknown "CAS failed"]

Update Jepsen to 0.3.4 from 0.3.1

Currently Maelstrom is using jepsen version 0.3.1, however since then there have been a number of performance and other improvements up to version 0.3.4.

I tried this locally already and it seems to work without any noticeable problems

txn-rw-register workload does not detect g0 write cycles under read-uncommitted consistency

Maelstrom does not detect g0 (write-cycle) for me, like in the following history:

t3:           ["r",0,1],["w",0,10],                               ["r",0,3],["w",0,11]
t2: ["w",0,1],["r",0,1],           ["r",0,10],["w",0,2],["w",0,3],["r",0,3]

So t2 observes the read value of 3 at operation index 5 (0-indexed), while it just previously wrote 2 and 3 on top of ww-conflict (10). It's a G0 and maelstrom ignored it.

$ go build && ~/.../maelstrom test -w txn-rw-register --bin ./maelstrom-txn --node-count 1 --time-limit 20 --rate 10 --concurrency 10n --consistency-models read-uncommitted --max-txn-length 20 --max-writes-per-key 10000000 --key-count 1

2023/03/29 12:11:11 Received {c3 n0 {"txn":[["w",0,1],["r",0,null],["r",0,null],["w",0,2],["w",0,3],["r",0,null],["w",0,4],["r",0,null],["r",0,null],["r",0,null],["w",0,5],["w",0,6],["r",0,null],["w",0,7],["w",0,8],["r",0,null],["w",0,9]],"type":"txn","msg_id":1}}
910670418 + 0 = 1 [#2]  - write key 0, value 1, tx 2
991888336 - 0 = 0 [#2]
2023/03/29 12:11:11 Received {c4 n0 {"txn":[["r",0,null],["w",0,10],["r",0,null],["w",0,11]],"type":"txn","msg_id":1}}
997560527 - 0 = 0 [#3]
8713188 + 0 = 10 [#3] - write key 0, value 10, tx 3 (ww-edge)
17116025 - 0 = 0 [#2]
60262526 + 0 = 2 [#2] - write key 0, value 2, tx 2 (ww-edge - cycle with 3)
77438615 + 0 = 3 [#2] - write key 0, value 3, tx 2 (ww-edge - cycle with 3)
90639108 - 0 = 0 [#2] - read key 0, value 3, tx 2 (wr-edge - cycle with 3)
...
2023/03/29 12:11:12 Sent {"src":"n0","dest":"c4","body":{"in_reply_to":1,"txn":[["r",0,1],["w",0,10],["r",0,3],["w",0,11]],"type":"txn_ok"}}
2023/03/29 12:11:12 Sent {"src":"n0","dest":"c3","body":{"in_reply_to":1,"txn":[["w",0,1],["r",0,1],["r",0,10],["w",0,2],["w",0,3],["r",0,3],["w",0,4],["r",0,11],["r",0,17],["r",0,19],["w",0,5],["w",0,6],["r",0,14],["w",0,7],["w",0,8],["r",0,34],["w",0,9]],"type":"txn_ok"}}

Here we have many cycles of tx 2 and tx3: w2(1), w3(10), w2(2), w2(3). MS could alone detect the g0 based only on the received responses going from that (1) every written value is unique (2) all writes in those 2 txs covered with reads.

"Map" with random key ordering would lead to excessive failure in Datomic chapter

Specifically for this https://github.com/jepsen-io/maelstrom/blob/main/doc/05-datomic/02-shared-state.md#database-as-value and I'm mostly just wondering if this should be mentioned in the tutorial. (you might also encounter this issue when doing training :D)

I'm using Rust and since the builtin HashMap type could reorder keys when growing, it could cause lots of failure when used blindly in CAS.

For example I'd have more than half failed transactions even for very few operations:

 :stats {:valid? true,
         :count 22,
         :ok-count 7,
         :fail-count 15,

For longer runs it's even worse so I didn't bother to finish any. Looking at the message diagram points me to believe this is due to ordering (of HashMap keys):

image

Luckily there's also a built-in type which is sorted by its keys and it solved the issue for me, so I would get results like this: (same with yet another type)

 :stats {:valid? true,
         :count 2045,
         :ok-count 1924,
         :fail-count 121,

Which is more in-line with the results in the tutorial.

Injecting Faults - nemesis

I'm experimenting with partitioning with Raft protocol. Can anyone help me on how to induce partial/simplex faults where only one way communication is possible?

I see the option --nemesis FAULTS #{}, a comma separated list of faults to inject.

How is this option used? And what I should give in the comma separated list? usually "--nemesis partition" is provided to induce faults

Python client library

Hi there! Thanks so much for making Maelstrom. I just ran a small workshop at a systems reading group that I organize at Harvard, and we all worked on the Gossip Glomers challenges together on a shared remote computer. It was a lot of fun :)

image

I wrote a Python client library for Maelstrom (via asyncio) for the purposes of the workshop, since I did a quick poll, and most of the members of our reading group didn't know Go or JavaScript but were still interested in learning about distributed systems. I thought it might be helpful for others who are more familiar with Python as well. Would you be open to a pull request where I add this library to the demo/ folder for others to use?

https://gist.github.com/ekzhang/65310f0260c03d4ec340e0daf0ccb9d5#file-_maelstrom-py

Would it be possible to SIGINT/SIGTERM processes instead of SIGKILLING them?

It seems like nodes are sent a sigkill right away at termination (via Process.destroyForcefully. This makes it a little tricky to use things like in-process sampling profilers because the process is killed before the profile can be written.

It would be nicer if the node had its stdin closed/have a sigint and then was given a few seconds to quit before being forcibly killed.

It is pretty easy for the maelstrom implementation in a language to watch for stdin closure and break out of the Run().

Maybe it was intentionally left out in the docs but was curious

First of all,
The docs and this project is a marvel of engineering. This weekend finishing reading
the docs, will give it some time and read again. Reads like a novel, wish all full blown
books on distributed systems would be written like this. Good job sir!

Was wondering, what is the main usage of this ? Do you expect it to be used by those who write
and test replicated state machines with this text based protocol ? Or there is a way that a can wire a
db client and test its advertised isolation levels ?

Appreciate your hard work,
Good luck!

Java 1.8 support

The docs seem to suggest Java 1.8 is supported, but I get errors when attempting to use the pre-built binary with openjdk 1.8.0_362. This stackoverflow post suggests that perhaps the true minimum Java version is Java 11?

$ java -version
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-b08)
OpenJDK 64-Bit Server VM (build 25.362-b08, mixed mode)
$ ./maelstrom 
Exception in thread "main" java.lang.UnsupportedClassVersionError: ch/qos/logback/classic/spi/LogbackServiceProvider has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0

Runing the demo js script does not work on Mac OS

Run Command: ./maelstrom test -w echo --bin demo/js/echo.js --time-limit 5

Error:

WARN [2023-12-05 15:35:38,456] jepsen test runner - jepsen.core Test crashed!
java.io.IOException: Cannot run program "/Users/user1/Workspace/maelstrom/demo/js/echo.js" (in directory "/var/folders/x5/jmppkfcx71zd921s1dtytdyw0000gn/T"): error=2, No such file or directory
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
	at maelstrom.process$start_node_BANG_.invokeStatic(process.clj:199)
	at maelstrom.process$start_node_BANG_.invoke(process.clj:168)
	at maelstrom.db$db$reify__16142.setup_BANG_(db.clj:34)
	at jepsen.db$fn__8729$G__8723__8733.invoke(db.clj:12)
	at jepsen.db$fn__8729$G__8722__8738.invoke(db.clj:12)
	at clojure.core$partial$fn__5908.invoke(core.clj:2642)
	at jepsen.control$on_nodes$fn__8599.invoke(control.clj:314)
	at clojure.lang.AFn.applyToHelper(AFn.java:154)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.applyTo(RestFn.java:142)
	at clojure.core$apply.invokeStatic(core.clj:671)
	at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at dom_top.core$real_pmap_helper$build_thread__211$fn__212.invoke(core.clj:163)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.invoke(RestFn.java:425)
	at clojure.lang.AFn.applyToHelper(AFn.java:156)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at clojure.core$apply.invokeStatic(core.clj:671)
	at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
	at clojure.lang.RestFn.invoke(RestFn.java:397)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
	... 31 common frames omitted
ERROR [2023-12-05 15:35:38,461] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.io.IOException: Cannot run program "/Users/user1/Workspace/maelstrom/demo/js/echo.js" (in directory "/var/folders/x5/jmppkfcx71zd921s1dtytdyw0000gn/T"): error=2, No such file or directory
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
	at maelstrom.process$start_node_BANG_.invokeStatic(process.clj:199)
	at maelstrom.process$start_node_BANG_.invoke(process.clj:168)
	at maelstrom.db$db$reify__16142.setup_BANG_(db.clj:34)
	at jepsen.db$fn__8729$G__8723__8733.invoke(db.clj:12)
	at jepsen.db$fn__8729$G__8722__8738.invoke(db.clj:12)
	at clojure.core$partial$fn__5908.invoke(core.clj:2642)
	at jepsen.control$on_nodes$fn__8599.invoke(control.clj:314)
	at clojure.lang.AFn.applyToHelper(AFn.java:154)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.applyTo(RestFn.java:142)
	at clojure.core$apply.invokeStatic(core.clj:671)
	at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at dom_top.core$real_pmap_helper$build_thread__211$fn__212.invoke(core.clj:163)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.invoke(RestFn.java:425)
	at clojure.lang.AFn.applyToHelper(AFn.java:156)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at clojure.core$apply.invokeStatic(core.clj:671)
	at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
	at clojure.lang.RestFn.invoke(RestFn.java:397)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
	... 31 common frames omitted

Wonky error when missing newlines between messages

I tried the "echo" challenge from the fly-io site. Good idea to have a simple echo things to sort out the basic setup.

So one program I tried was the following Racket code:

#lang racket/base

(require json)
(require racket/match)
(require racket/port)

(define (main)
  (for ([h (in-producer read-json eof)])
    (match h
      [(hash-table ('dest dest)
                   ('src src)
                   ('body body)
                   (k v) ...)
       (match body
         [(hash-table ('type "echo")
                      ('msg_id msg_id)
                      (bk bv))
          (write-json
           (hash
            'src dest
            'dest src
            'body (hash 'type "echo_ok"
                        'msg_id msg_id
                        'in_reply_to msg_id
                        bk bv)))]
         [(hash-table ('type "init")
                      ('msg_id msg_id)
                      (bk bv) ...)
          (write-json
           (hash 'src dest
                 'dest src
                 'body
                 (hash 'type "init_ok"
                       'in_reply_to msg_id)))]
         [_
          #f])]
      [_
       #f])
    (flush-output)))

(module+ main
  (main))

The above code results in the following error:

java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 1, :src "n0", :dest "c0", :body {:in_reply_to 1, :type "init_ok"}}

I sorta guessed by reading the spec that making my code emit a newline after each message would follow the stated spec better and might just work. And it did! Simply adding (displayln "") before (flush-output) in the above code makes it work and pass the tests.

The way to run the above code is, assuming Racket 8.8 is installed, to generate an executable using raco exe echo.rkt, then running the resulting echo binary with maelstrom.

PS I don't feel too strongly about this thing and wouldn't mind if the issue is closed as unimportant

Permission denied when running `--bin` binary

Hello 👋 ,

I'm checking out the Fly distributed systems challenges, and for the "Echo" demo, after compiling the Go code and trying to run the maelstrom tests, I get this error:

[...]
INFO [2023-02-25 14:58:30,921] jepsen node n0 - maelstrom.net Starting Maelstrom network
INFO [2023-02-25 14:58:30,922] jepsen test runner - jepsen.db Tearing down DB
INFO [2023-02-25 14:58:30,923] jepsen test runner - jepsen.db Setting up DB
INFO [2023-02-25 14:58:30,925] jepsen node n0 - maelstrom.service Starting services: (lin-kv lin-tso lww-kv seq-kv)
INFO [2023-02-25 14:58:30,925] jepsen node n0 - maelstrom.db Setting up n0
INFO [2023-02-25 14:58:30,926] jepsen node n0 - maelstrom.process launching /home/johndoe/path/to/fly-dist-sys/1_echo []
INFO [2023-02-25 14:58:31,930] jepsen node n0 - maelstrom.net Shutting down Maelstrom network
WARN [2023-02-25 14:58:31,933] jepsen test runner - jepsen.core Test crashed!
java.io.IOException: Cannot run program "/home/johndoe/path/to/fly-dist-sys/1_echo" (in directory "/tmp"): error=13, Permission denied
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
	at maelstrom.process$start_node_BANG_.invokeStatic(process.clj:199)
	at maelstrom.process$start_node_BANG_.invoke(process.clj:168)
	at maelstrom.db$db$reify__16142.setup_BANG_(db.clj:34)
	at jepsen.db$fn__8729$G__8723__8733.invoke(db.clj:12)
	at jepsen.db$fn__8729$G__8722__8738.invoke(db.clj:12)
	at clojure.core$partial$fn__5908.invoke(core.clj:2642)
	at jepsen.control$on_nodes$fn__8599.invoke(control.clj:314)
	at clojure.lang.AFn.applyToHelper(AFn.java:154)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.applyTo(RestFn.java:142)
	at clojure.core$apply.invokeStatic(core.clj:671)
	at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at dom_top.core$real_pmap_helper$build_thread__211$fn__212.invoke(core.clj:163)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.invoke(RestFn.java:425)
	at clojure.lang.AFn.applyToHelper(AFn.java:156)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at clojure.core$apply.invokeStatic(core.clj:671)
	at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
	at clojure.lang.RestFn.invoke(RestFn.java:397)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.IOException: error=13, Permission denied
	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
	... 31 common frames omitted
ERROR [2023-02-25 14:58:31,938] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.io.IOException: Cannot run program "/home/johndoe/path/to/fly-dist-sys/1_echo" (in directory "/tmp"): error=13, Permission denied
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
	at maelstrom.process$start_node_BANG_.invokeStatic(process.clj:199)
	at maelstrom.process$start_node_BANG_.invoke(process.clj:168)
	at maelstrom.db$db$reify__16142.setup_BANG_(db.clj:34)
	at jepsen.db$fn__8729$G__8723__8733.invoke(db.clj:12)
	at jepsen.db$fn__8729$G__8722__8738.invoke(db.clj:12)
	at clojure.core$partial$fn__5908.invoke(core.clj:2642)
	at jepsen.control$on_nodes$fn__8599.invoke(control.clj:314)
	at clojure.lang.AFn.applyToHelper(AFn.java:154)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.applyTo(RestFn.java:142)
	at clojure.core$apply.invokeStatic(core.clj:671)
	at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
	at clojure.lang.RestFn.invoke(RestFn.java:408)
	at dom_top.core$real_pmap_helper$build_thread__211$fn__212.invoke(core.clj:163)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.invoke(RestFn.java:425)
	at clojure.lang.AFn.applyToHelper(AFn.java:156)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at clojure.core$apply.invokeStatic(core.clj:671)
	at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
	at clojure.lang.RestFn.invoke(RestFn.java:397)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.IOException: error=13, Permission denied
	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
	... 31 common frames omitted

The compiled Go binary, as well as maelstrom belong to johndoe user and group.

I'm running maelstrom like this: ./maelstrom test -w echo --bin ~/path/to/fly-dist-sys/1_echo --node-count 1 --time-limit 10

I'm on Linux (Fedora).

Am I doing anything wrong? Should permissions be set differently?

Specialized nodes

I was interesting in modelling Apache BookKeeper, to compare the experience of my modelling it with TLA+. One slight difficulty with BK is that there are 3 node types: metadata_store (zk), bookie (storage node) and bk_client (serving layer where all the consensus logic is). The bk_clients are the only externally visible nodes of the protocol, all requests go to the bk_clients which in turn interact with the metadata_store and bookies.

Currently all nodes are treated the same by Maelstrom, meaning I will need to use proxying of messages, and use node ID to determine node type. When a node initializes I could ensure that the first is a metadata_store, the second two are bk_clients and the rest are bookies. Metadata_store and bookie nodes will need to proxy maelstrom messages to bk_client nodes. But this would mean adding some communication between nodes that does not exist in the real protocol, so each node would always be able to forward messages to the right node.

An alternative is allow node specialization, or at least, mark which nodes form the public API surface of the protocol, in order to avoid this proxying. For my purposes, simply restricting the public API calls to particular nodes would be enough. Using node ID to determine node type is not an issue for me.

Realesed JAR doesn't work with Java 11

I'm getting this error when trying to run JAR from releases:

Exception in thread "main" java.lang.ExceptionInInitializerError
[...] here long clj stacktrace [...]
Caused by: java.lang.ClassNotFoundException: javax.xml.bind.DatatypeConverter
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:398)
        at clojure.lang.RT.classForName(RT.java:2168)
        at clojure.lang.RT.classForNameNonLoading(RT.java:2181)
        at org.httpkit.server$loading__5569__auto____6381.invoke(server.clj:1)
        at org.httpkit.server__init.load(Unknown Source)
        at org.httpkit.server__init.<clinit>(Unknown Source)
        ... 98 more

According to SO the problem is removal of this library in new Java versions: In Java 11 they are completely removed from the JDK.

seq-kv reads never return final state for most recent write/cas

I'm doing the gossip glomers g-counter challenge at the moment in which you implement g-counter using seq-kv. Below is a spoiler of my solution, so just a warning for anyone who doesn't wish to be spoiled.

The solution I went for is that each node maintains its own sum under its own key in seq-kv, that means that if we have 3 nodes, then we will have 3 keys in seq-kv each storing the sum of its own node. On a background thread, each node repeatedly issues read requests for all 3 keys in seq-kv, and so my thinking was that after enough reads they should eventually get the latest value. This approach is also mentioned as one that should work in the other issue here.

What I am observing is that the most recent write or cas operation that occurs against any key in the seq-kv will never be read by other nodes in the system. I have attached at the bottom of the issue a log of the maelstrom output with --log-net-send turned on to show proof of this happening. If there is a better format to debug this data that you want me to upload, let me know. In this log I ran the g-counter workload against 2 nodes with a rate of 20 and a time limit of 1. The most recent cas/write message that occurs is: {:id 90, :src "n0", :dest "seq-kv", :body {:type "cas", :msg_id 28, :key "n0", :from 10, :to 14}}. After this, n1 sends a read for key "n0" 19 times, all of which return a value of 10 despite the fact that the cas sent from n0 changed it to 14.

I know according to the definition of sequential consistency that this is an allowed outcome, but I would have expected that eventually it would return the latest value. Sending noop cas's (e.g. from current value to current value) from all the nodes doesn't seem to fix this either. The solution I ended up getting to work was to set up another background thread on each node that just writes a random number to an unrelated key in seq-kv which seems to cause the reads to eventually return the latest values for each node.

maelstrom-seq-kv-no-progress.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.