protocol / beyond-bitswap Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
Be careful when testing the bandwidth savings of using compression. The testbed currently tracks the bandwidth usage through the IPFS node Reporter
, but the bandwidth counter of the Reporter tracks the bandiwdth use at a stream level. In our current stack, compression comes after the stream muxer, so this metric would account for protocol bandwidth use accurately, but it is not useful to track total bandwidth use.
https://github.com/adlrocha/beyond-bitswap/blob/feat/compression/testbed/utils/ipfs.go#L606
Solving this issue depends on the following one from libp2p: libp2p/go-libp2p#1008
In the meantime you can use tools such as https://github.com/raboof/nethogs to measure bandwidth savings.
Some tests throw the following Reprovider error:
Aug 26 14:21:44.867308 INFO 240.6045s ERROR << edb99dff7fea >> 2020-08-26T14:21:44.847Z ERROR reprovider.simple failed to reprovide: failed to find any peer in table
As seen in ipfs.go I am using the default configuration so Reproviding should be in all
every 12 hours
.
node.OnlineProviders(cfg.Experimental.StrategicProviding, cfg.Reprovider.Strategy, cfg.Reprovider.Interval)
I would recommend simplifying this to a combination of LimitReader
+ io.Copy
into ioutil.Discard
... There may be some hidden bugs in this logic.
This logic may have several bugs.
Read
has only read N bytes (less than the length of the buffer), but you're writing the entire buffer, which means that you may be writing more bytes than you read (sending leftovers from previous reads that remain in the buffer).files.ToFile()
, but at first sight it seems like on every iteration you're reading from the beginning of the file. You might want to seek to an offset, or better even, use io.Copy
to just pipe through until EOF.With #32, #32, #33, the testbed is getting into great shape, and considering @hannahhoward and @acruikshank's amazing job visualizing Graphsync's performance compared to TCP for several files and network conditions, it'd be great to have a public IPFS file-sharing performance comparison dashboard similar to this one for NodeJS: https://benchmarking.nodejs.org/
My proposal would be:
We have to agree on the test cases to run for the benchmark. But a good start would be:
We may be able to leverage Testground as a Service to trigger the test with every new commit.
BONUS: If we can add to this benchmark an experiment that perform an exchange between two nodes at different locations through the real IPFS network (instead of using Testground), we would be able to evaluate how the status of the network may affect the file exchange (although this may not be worth the effort considering the added value it really brings)
Add a script to automatically download a set of base datasets from https://awesome.ipfs.io/datasets/ to start using in the testbed.
The IPFS nodes being spawned in the current test bed have the DHT and MDNS enabled, so they may be making nodes to connect to each other even if they aren't explicitly connected.
Related to #18 , I noticed that when we write tests that transfer a dataset (such as a folder / file) as opposed to just random generated bytes, we import the data into UnixFS from the file system (in normal system format). If our intent is to truly test some of the data sets on https://awesome.ipfs.io/datasets/ as they exist on IPFS, the reliable way to bring them in is to export them as CAR files and import into the tests. The reason is a UnixFS import from system files is not gauranteed to produce the exact same DAG or root CID. There are several variables that affect how the DAG is built -- such as chunking strategy, use of raw leaves, etc. The only reliable way to know you the exact same dag is to use CAR files. This might also make sense in terms of writing scripts to download datasets-- as long as IPFS is, unfortunately, not as fast as HTTP on a fast hosted site, it's going to be much more efficient if we can import into the seeds blockstore from a car file on a CDN network -- plus that means we don't have to download ahead of time -- we can probably just include it as part of the test, which makes things more reproducable on CI.
Anyway, curious to get your thoughts @adlrocha -- also does this make sense? It may not be obvious if you haven't worked a lot with UnixFS files and DAG structures.
If we perform a test in which we use INPUT_DATA=files
using local:exec
and a DATA_DIR
from our local machine, the test runs successfully independent from the size of the dataset (tested from 1MB to 2GB).
If we try to perform the same test using local:docker
adding the DATA_DIR
through Testground's extra_sources
, the test case doesn't finish. I gets stuck.
Using INPUT_DATA=random
apparently works for both local:exec
and local:docker
The testbed currently implements a pretty flat interaction between peers and network. We choose the files to seed to the network, seeders Add
these files to the network, and leechers flatly Get
them. No complex interaction pattern is implemented. Instead of exclusively choosing the files to be used in the test, we should implement an interactionFunction
which is fed to the testbed to simulate different applications and request patterns in the test runs.
defaultInteraction
: Flatly Add/Get Files.getAllLeavesInteraction
: Retrieves the files of a DAG.databaseInteraction
: Performs read/write operations over DAGs implementing a database schema.interactionFunction
and plug it into the testbed.As part of RFC|BB|L203A of the Beyond Bitswap project we are exploring the use of compression within Bitswap. In our initial explorations we have implemented two main strategies:
BlockCompression
: It compresses the RawData
of blocks using GZip.FullCompression
: It compresses the full Bitswap message using GZip and adds it the new CompressedPayload
of a new Bitswap message.Thus, two additional fields have been added to Bitswap's protobuf message: CompressedPaylaod
including the compressed payload of the original Bitswap message if FullCompression
is enabled, and the CompressionType
flag to signal the compression strategy used.
Initial tests show an approximate x10 overhead by the use of compression compared to Bitswap without compression. We compared our GZip compression approach to existing implementations of GZip compression http handlers in golang, and the main difference is that GZip http handlers pipe the compression writer to the http server writer streaming the compressed bytes directly to the client. In our case, we can't use stream compression for several reasons:
CompressionType
field in the protobuf so we can be backward compatible. This is not a stopper because we could easily use a multicodec to signal that the stream is compressed.go-bitswap
where the libp2p protocol stream is referenced and written in message.go
, there is no easy way to pipe the compressed stream to the protocol stream writer (this is probably fixable once we figure out how to avoid the required size prefix).As a result of all of this evaluations we want to open a discussion on the following topics:
KeepReading
and EndOfStream
signals in protocol streams so we don't have to know message sizes beforehand and we can pipe streams (such as what we were looking to do for stream compression).RawData
compressed in the datastore removing the need of "on-the-fly" compression.Hey there.
I'm having issues with running the testground test plans. I followed the instructions in the testbed README to install everything but have issues with running the transfer
testplan and the RFC experiment test plans.
When running the single_run.sh
script, the nodes are hanging at "waiting for network to initialize" and the test is timing out with "context deadline exceeded". (I modified the script to use RUNNER="local:docker" and BUILDER="docker:go" but exec also does not work for me.)
Step 28/28 : ENTRYPOINT [ "/testplan"]
---> Running in a644b3cfb5ba
---> 900b91fa9f0c
---> 900b91fa9f0c
Successfully built 900b91fa9f0c
Successfully tagged 26327f500441:latest
Jul 9 17:59:44.075299 INFO build completed {"default_tag": "26327f500441:latest", "took": "38s"}
Jul 9 17:59:44.128650 INFO got docker image id {"image_id": "900b91fa9f0c"}
Jul 9 17:59:44.181896 INFO tagging image {"image_id": "900b91fa9f0c", "tag": "tg-plan-testbed:900b91fa9f0c"}
Jul 9 17:59:44.187222 INFO build succeeded {"plan": "testbed", "groups": ["single"], "builder": "docker:go", "artifact": "900b91fa9f0c"}
Jul 9 17:59:44.187312 INFO performing healthcheck on runner
Jul 9 17:59:44.194357 INFO container isn't running; starting {"container_name": "testground-grafana"}
Jul 9 17:59:44.582836 INFO container isn't running; starting {"container_name": "testground-influxdb"}
Jul 9 17:59:44.933611 WARN some healthchecks failed, but continuing
Jul 9 17:59:44.933668 INFO starting run {"run_id": "cb4s3q3ft8ahdoftmlb0", "plan": "testbed", "case": "transfer", "runner": "local:docker", "instances": 5}
Jul 9 17:59:44.968129 INFO creating container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "name": "tg-testbed-transfer-cb4s3q3ft8ahdoftmlb0-single-0"}
Jul 9 17:59:45.085128 INFO creating container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "name": "tg-testbed-transfer-cb4s3q3ft8ahdoftmlb0-single-1"}
Jul 9 17:59:45.129046 INFO creating container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "name": "tg-testbed-transfer-cb4s3q3ft8ahdoftmlb0-single-2"}
Jul 9 17:59:45.169380 INFO creating container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "name": "tg-testbed-transfer-cb4s3q3ft8ahdoftmlb0-single-3"}
Jul 9 17:59:45.198084 INFO creating container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "name": "tg-testbed-transfer-cb4s3q3ft8ahdoftmlb0-single-4"}
Jul 9 17:59:45.229807 INFO starting containers {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "count": 5}
Jul 9 17:59:45.229876 INFO starting container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "id": "8fb6df27ae5e1ecae505ed6634837ed3ad1be118c3a83d7bbe87ca5e3c0e253e", "group": "single", "group_index": 0}
Jul 9 17:59:45.229873 INFO starting container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "id": "965d0784d7ae662642d72f95fc560746531ebde38f74821fa65ad547b239887f", "group": "single", "group_index": 4}
Jul 9 17:59:45.229873 INFO starting container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "id": "4f0e47e5a6af633e5c265866dde91b3828e0922509b8a5d105f85bdc1f58bd02", "group": "single", "group_index": 3}
Jul 9 17:59:45.229878 INFO starting container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "id": "b86bc557c07dd42f91e21c45a179d894db0a2b57f20bfaa3881012337328c22c", "group": "single", "group_index": 1}
Jul 9 17:59:45.229923 INFO starting container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "id": "93a8aeb43e8ef9293dac1465e8b261bfbfe591263d4c91cd3516c845957cfa89", "group": "single", "group_index": 2}
Jul 9 17:59:45.954264 INFO 0.7244s OTHER << sidecar >> Jul 9 17:59:45.953718 INFO resolved route to host {"host": "testground-redis", "ip": "192.18.0.3"}
Jul 9 17:59:45.954327 INFO 0.7245s OTHER << sidecar >> Jul 9 17:59:45.953920 INFO resolved route to host {"host": "testground-sync-service", "ip": "192.18.0.4"}
Jul 9 17:59:45.954344 INFO 0.7245s OTHER << sidecar >> Jul 9 17:59:45.954105 INFO resolved route to host {"host": "testground-influxdb", "ip": "192.18.0.6"}
Jul 9 17:59:45.954711 INFO 0.7245s MESSAGE << sidecar >> InfluxDB unavailable; no metrics will be dispatched: no InfluxDB URL in $INFLUXDB_URL env var
Jul 9 17:59:45.958716 INFO 0.7288s OTHER << sidecar >> Jul 9 17:59:45.958406 INFO successfully resolved route to host {"container_id": "8fb6df27ae5e1ecae505ed6634837ed3ad1be118c3a83d7bbe87ca5e3c0e253e"}
Jul 9 17:59:45.968329 INFO 0.7371s MESSAGE << single[000] (8fb6df) >> registering default http handler at: http://[::]:6060/ (pprof: http://[::]:6060/debug/pprof/)
Jul 9 17:59:45.968576 INFO 0.7371s START << single[000] (8fb6df) >> {"plan":"testbed","case":"transfer","run":"cb4s3q3ft8ahdoftmlb0","outputs_path":"/outputs","instances":5,"params":{"bandwidth_mb":"150","bstore_delay_ms":"5","data_dir":"false","dialer":"default","disk_store":"false","enable_dht":"false","enable_providing":"false","enable_tcp":"100","exchange_interface":"bitswap","file_size":"15728640","input_data":"files","jitter_pct":"10","latency_ms":"10","leech_count":"3","long_lasting":"false","max_connection_rate":"0","node_type":"ipfs","number_waves":"1","parallel_gen_mb":"100","passive_count":"","request_stagger":"0","run_count":"2","run_timeout_secs":"90000","seeder_rate":"100","timeout_secs":"400000"},"group":"single","group_instances":5,"network":"16.12.0.0/16","start_time":"0001-01-01T00:00:00Z"}
Jul 9 17:59:45.973722 INFO 0.7438s OTHER << sidecar >> Jul 9 17:59:45.973444 INFO external routing disabled
Jul 9 17:59:45.973806 INFO 0.7439s OTHER << sidecar >> Jul 9 17:59:45.973498 INFO waiting for all networks to be ready {"sidecar": true, "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 17:59:45.973959 INFO 0.7438s MESSAGE << single[000] (8fb6df) >> waiting for network to initialize
Jul 9 17:59:46.296012 INFO 1.0659s MESSAGE << sidecar >> InfluxDB unavailable; no metrics will be dispatched: no InfluxDB URL in $INFLUXDB_URL env var
Jul 9 17:59:46.299944 INFO 1.0698s MESSAGE << single[001] (b86bc5) >> registering default http handler at: http://[::]:6060/ (pprof: http://[::]:6060/debug/pprof/)
Jul 9 17:59:46.300083 INFO 1.0698s START << single[001] (b86bc5) >> {"plan":"testbed","case":"transfer","run":"cb4s3q3ft8ahdoftmlb0","outputs_path":"/outputs","instances":5,"params":{"bandwidth_mb":"150","bstore_delay_ms":"5","data_dir":"false","dialer":"default","disk_store":"false","enable_dht":"false","enable_providing":"false","enable_tcp":"100","exchange_interface":"bitswap","file_size":"15728640","input_data":"files","jitter_pct":"10","latency_ms":"10","leech_count":"3","long_lasting":"false","max_connection_rate":"0","node_type":"ipfs","number_waves":"1","parallel_gen_mb":"100","passive_count":"","request_stagger":"0","run_count":"2","run_timeout_secs":"90000","seeder_rate":"100","timeout_secs":"400000"},"group":"single","group_instances":5,"network":"16.12.0.0/16","start_time":"0001-01-01T00:00:00Z"}
Jul 9 17:59:46.301369 INFO 1.0710s MESSAGE << single[001] (b86bc5) >> waiting for network to initialize
Jul 9 17:59:46.446754 INFO 1.2168s OTHER << sidecar >> Jul 9 17:59:46.445748 INFO successfully resolved route to host {"container_id": "b86bc557c07dd42f91e21c45a179d894db0a2b57f20bfaa3881012337328c22c"}
Jul 9 17:59:46.491011 INFO 1.2610s MESSAGE << sidecar >> InfluxDB unavailable; no metrics will be dispatched: no InfluxDB URL in $INFLUXDB_URL env var
Jul 9 17:59:46.496605 INFO 1.2664s MESSAGE << single[002] (93a8ae) >> registering default http handler at: http://[::]:6060/ (pprof: http://[::]:6060/debug/pprof/)
Jul 9 17:59:46.496757 INFO 1.2664s START << single[002] (93a8ae) >> {"plan":"testbed","case":"transfer","run":"cb4s3q3ft8ahdoftmlb0","outputs_path":"/outputs","instances":5,"params":{"bandwidth_mb":"150","bstore_delay_ms":"5","data_dir":"false","dialer":"default","disk_store":"false","enable_dht":"false","enable_providing":"false","enable_tcp":"100","exchange_interface":"bitswap","file_size":"15728640","input_data":"files","jitter_pct":"10","latency_ms":"10","leech_count":"3","long_lasting":"false","max_connection_rate":"0","node_type":"ipfs","number_waves":"1","parallel_gen_mb":"100","passive_count":"","request_stagger":"0","run_count":"2","run_timeout_secs":"90000","seeder_rate":"100","timeout_secs":"400000"},"group":"single","group_instances":5,"network":"16.12.0.0/16","start_time":"0001-01-01T00:00:00Z"}
Jul 9 17:59:46.497784 INFO 1.2677s MESSAGE << single[002] (93a8ae) >> waiting for network to initialize
Jul 9 17:59:46.508462 INFO 1.2784s MESSAGE << sidecar >> InfluxDB unavailable; no metrics will be dispatched: no InfluxDB URL in $INFLUXDB_URL env var
Jul 9 17:59:46.512022 INFO 1.2819s MESSAGE << single[003] (4f0e47) >> registering default http handler at: http://[::]:6060/ (pprof: http://[::]:6060/debug/pprof/)
Jul 9 17:59:46.512133 INFO 1.2819s START << single[003] (4f0e47) >> {"plan":"testbed","case":"transfer","run":"cb4s3q3ft8ahdoftmlb0","outputs_path":"/outputs","instances":5,"params":{"bandwidth_mb":"150","bstore_delay_ms":"5","data_dir":"false","dialer":"default","disk_store":"false","enable_dht":"false","enable_providing":"false","enable_tcp":"100","exchange_interface":"bitswap","file_size":"15728640","input_data":"files","jitter_pct":"10","latency_ms":"10","leech_count":"3","long_lasting":"false","max_connection_rate":"0","node_type":"ipfs","number_waves":"1","parallel_gen_mb":"100","passive_count":"","request_stagger":"0","run_count":"2","run_timeout_secs":"90000","seeder_rate":"100","timeout_secs":"400000"},"group":"single","group_instances":5,"network":"16.12.0.0/16","start_time":"0001-01-01T00:00:00Z"}
Jul 9 17:59:46.512874 INFO 1.2829s MESSAGE << single[003] (4f0e47) >> waiting for network to initialize
Jul 9 17:59:46.515609 INFO 1.2857s OTHER << sidecar >> Jul 9 17:59:46.515389 INFO successfully resolved route to host {"container_id": "93a8aeb43e8ef9293dac1465e8b261bfbfe591263d4c91cd3516c845957cfa89"}
Jul 9 17:59:46.515665 INFO 1.2858s OTHER << sidecar >> Jul 9 17:59:46.515429 INFO external routing disabled
Jul 9 17:59:46.521781 INFO 1.2919s OTHER << sidecar >> Jul 9 17:59:46.515445 INFO successfully resolved route to host {"container_id": "4f0e47e5a6af633e5c265866dde91b3828e0922509b8a5d105f85bdc1f58bd02"}
Jul 9 17:59:46.521840 INFO 1.2920s OTHER << sidecar >> Jul 9 17:59:46.515653 INFO waiting for all networks to be ready {"sidecar": true, "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 17:59:46.521861 INFO 1.2920s OTHER << sidecar >> Jul 9 17:59:46.515979 INFO external routing disabled
Jul 9 17:59:46.521872 INFO 1.2920s OTHER << sidecar >> Jul 9 17:59:46.516001 INFO external routing disabled
Jul 9 17:59:46.521881 INFO 1.2920s OTHER << sidecar >> Jul 9 17:59:46.516056 INFO waiting for all networks to be ready {"sidecar": true, "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 17:59:46.521890 INFO 1.2920s OTHER << sidecar >> Jul 9 17:59:46.516057 INFO waiting for all networks to be ready {"sidecar": true, "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 17:59:46.542666 INFO started containers {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "count": 5}
Jul 9 17:59:46.544171 INFO 1.3140s MESSAGE << sidecar >> InfluxDB unavailable; no metrics will be dispatched: no InfluxDB URL in $INFLUXDB_URL env var
Jul 9 17:59:46.544949 INFO 1.3151s OTHER << sidecar >> Jul 9 17:59:46.544804 INFO successfully resolved route to host {"container_id": "965d0784d7ae662642d72f95fc560746531ebde38f74821fa65ad547b239887f"}
Jul 9 17:59:46.545183 INFO 1.3153s OTHER << sidecar >> Jul 9 17:59:46.544991 INFO external routing disabled
Jul 9 17:59:46.545233 INFO 1.3154s OTHER << sidecar >> Jul 9 17:59:46.545031 INFO waiting for all networks to be ready {"sidecar": true, "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 17:59:46.545571 INFO 1.3157s OTHER << sidecar >> Jul 9 17:59:46.545424 INFO all networks ready {"sidecar": true, "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 17:59:46.545762 INFO 1.3159s OTHER << sidecar >> Jul 9 17:59:46.545439 INFO all networks ready {"sidecar": true, "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 17:59:46.545791 INFO 1.3159s OTHER << sidecar >> Jul 9 17:59:46.545455 INFO all networks ready {"sidecar": true, "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 17:59:46.545801 INFO 1.3159s OTHER << sidecar >> Jul 9 17:59:46.545455 INFO all networks ready {"sidecar": true, "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 17:59:46.545809 INFO 1.3160s OTHER << sidecar >> Jul 9 17:59:46.545639 INFO all networks ready {"sidecar": true, "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 17:59:46.549816 INFO 1.3196s MESSAGE << single[004] (965d07) >> registering default http handler at: http://[::]:6060/ (pprof: http://[::]:6060/debug/pprof/)
Jul 9 17:59:46.549959 INFO 1.3196s START << single[004] (965d07) >> {"plan":"testbed","case":"transfer","run":"cb4s3q3ft8ahdoftmlb0","outputs_path":"/outputs","instances":5,"params":{"bandwidth_mb":"150","bstore_delay_ms":"5","data_dir":"false","dialer":"default","disk_store":"false","enable_dht":"false","enable_providing":"false","enable_tcp":"100","exchange_interface":"bitswap","file_size":"15728640","input_data":"files","jitter_pct":"10","latency_ms":"10","leech_count":"3","long_lasting":"false","max_connection_rate":"0","node_type":"ipfs","number_waves":"1","parallel_gen_mb":"100","passive_count":"","request_stagger":"0","run_count":"2","run_timeout_secs":"90000","seeder_rate":"100","timeout_secs":"400000"},"group":"single","group_instances":5,"network":"16.12.0.0/16","start_time":"0001-01-01T00:00:00Z"}
Jul 9 17:59:46.551114 INFO 1.3210s MESSAGE << single[004] (965d07) >> waiting for network to initialize
Jul 9 18:09:05.237973 ERROR context.Background.WithDeadline(2022-07-09 20:09:05.236917778 +0200 CEST m=+615.193319498 [-1.052737ms]) {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0"}
Jul 9 18:09:05.237992 INFO 560.0081s INCOMPLETE << single[001] (b86bc5) >>
Jul 9 18:09:05.238023 INFO 560.0082s ERROR << single[002] (93a8ae) >> context deadline exceeded
Jul 9 18:09:05.238005 INFO 560.0081s ERROR << single[001] (b86bc5) >> context deadline exceeded
Jul 9 18:09:05.238031 INFO 560.0082s INCOMPLETE << single[002] (93a8ae) >>
Jul 9 18:09:05.237974 INFO deleting containers {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "ids": ["8fb6df27ae5e1ecae505ed6634837ed3ad1be118c3a83d7bbe87ca5e3c0e253e", "b86bc557c07dd42f91e21c45a179d894db0a2b57f20bfaa3881012337328c22c", "93a8aeb43e8ef9293dac1465e8b261bfbfe591263d4c91cd3516c845957cfa89", "4f0e47e5a6af633e5c265866dde91b3828e0922509b8a5d105f85bdc1f58bd02", "965d0784d7ae662642d72f95fc560746531ebde38f74821fa65ad547b239887f"]}
Jul 9 18:09:05.238038 INFO 560.0082s ERROR << single[000] (8fb6df) >> context deadline exceeded
Jul 9 18:09:05.238048 INFO 560.0082s INCOMPLETE << sidecar >>
Jul 9 18:09:05.238055 INFO 560.0082s ERROR << single[004] (965d07) >> context deadline exceeded
Jul 9 18:09:05.238054 INFO 560.0082s INCOMPLETE << single[000] (8fb6df) >>
Jul 9 18:09:05.238062 INFO 560.0082s INCOMPLETE << single[004] (965d07) >>
Jul 9 18:09:05.238067 INFO 560.0082s INCOMPLETE << single[003] (4f0e47) >>
Jul 9 18:09:05.238082 INFO deleting container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "id": "965d0784d7ae662642d72f95fc560746531ebde38f74821fa65ad547b239887f"}
Jul 9 18:09:05.238082 INFO deleting container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "id": "8fb6df27ae5e1ecae505ed6634837ed3ad1be118c3a83d7bbe87ca5e3c0e253e"}
Jul 9 18:09:05.238082 INFO deleting container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "id": "b86bc557c07dd42f91e21c45a179d894db0a2b57f20bfaa3881012337328c22c"}
Jul 9 18:09:05.238089 INFO deleting container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "id": "4f0e47e5a6af633e5c265866dde91b3828e0922509b8a5d105f85bdc1f58bd02"}
Jul 9 18:09:05.238090 INFO deleting container {"runner": "local:docker", "run_id": "cb4s3q3ft8ahdoftmlb0", "id": "93a8aeb43e8ef9293dac1465e8b261bfbfe591263d4c91cd3516c845957cfa89"}
Jul 9 18:09:05.238055 INFO 560.0082s ERROR << single[003] (4f0e47) >> context deadline exceeded
Jul 9 18:09:06.951796 WARN run finished in error {"run_id": "cb4s3q3ft8ahdoftmlb0", "plan": "testbed", "case": "transfer", "runner": "local:docker", "instances": 5, "error": "context deadline exceeded"}
Jul 9 18:09:06.951987 ERROR doRun returned err {"err": "context deadline exceeded"}
Jul 9 18:09:06.959021 INFO worker completed task {"worker_id": 0, "task_id": "cb4s3q3ft8ahdoftmlb0"}
When running an RFC experiment with ./run_experiment.sh rfcBBL102
I get the following error:
Step 19/26 : RUN cp /tmp/go.mod /tmp/go.sum ${PLAN_DIR}/
---> Running in 101d095bf46b
---> 57f8c2ec9daf
Step 20/26 : RUN cd ${PLAN_DIR} && go env -w GOPROXY="${GO_PROXY}" && CGO_ENABLED=${CgoEnabled} GOOS=linux GOARCH=amd64 go build -o ${PLAN_DIR}/testplan.bin ${BUILD_TAGS} ${TESTPLAN_EXEC_PKG}
---> Running in 4a7bffce0cbd
utils/bitswap.go:9:2: missing go.sum entry for module providing package github.com/ipfs/go-bitswap (imported by github.com/protocol/beyond-bitswap/testbed/testbed/utils); to add:
go get github.com/protocol/beyond-bitswap/testbed/testbed/utils
utils/bitswap.go:10:2: missing go.sum entry for module providing package github.com/ipfs/go-bitswap/network (imported by github.com/protocol/beyond-bitswap/testbed/testbed/utils); to add:
go get github.com/protocol/beyond-bitswap/testbed/testbed/utils
Jul 6 17:20:04.942631 INFO build failed {"plan": "testbed", "groups": ["nodes"], "builder": "docker:go", "error": "docker build failed: The command '/bin/sh -c cd ${PLAN_DIR} && go env -w GOPROXY=\"${GO_PROXY}\" && CGO_ENABLED=${CgoEnabled} GOOS=linux GOARCH=amd64 go build -o ${PLAN_DIR}/testplan.bin ${BUILD_TAGS} ${TESTPLAN_EXEC_PKG}' returned a non-zero code: 1"}
Jul 6 17:20:04.942677 ERROR doRun returned err {"err": "docker build failed: The command '/bin/sh -c cd ${PLAN_DIR} && go env -w GOPROXY=\"${GO_PROXY}\" && CGO_ENABLED=${CgoEnabled} GOOS=linux GOARCH=amd64 go build -o ${PLAN_DIR}/testplan.bin ${BUILD_TAGS} ${TESTPLAN_EXEC_PKG}' returned a non-zero code: 1"}
I tried to fix the issues with go mod tidy
but it didn't help. I'm locally running go version 1.18.3 but since I'm using the dockers runners this should not matter I guess (?)
I would highly appreciate help and tips pointing me in the right direction.
This function is a nightmare and needs to be cleaned!
func (n *IPFSNode) EmitMetrics(runenv *runtime.RunEnv, runNum int, seq int64, grpseq int64,
latency time.Duration, bandwidthMB int, fileSize int, nodetp NodeType, tpindex int, timeToFetch int64, tcpFetch int64, leechFails int64) error
Create a NodeMetrics
struct to track all the metrics for a node. Already tested the approach but I chose to leave it for when the full code is refactores and working.
type NodeMetrics struct {
TimeToFetch int64
TCPFetch int64
LeechFails int64
}
From this Testground commit testground/testground@92ff5c6 where the supervisor
was added, the execution of Testground tests is asynchronous breaking the following features of the testbed:
local:exec
doesn't work.This RFC proposes to add a protocol extension to GraphSync that would cause a GraphSync responder to send only metadata in a GraphSync response -- but no blocks. We would use the list of CIDs contained in the metadata to fetch all the blocks in a DAG with Bitswap in nearly in paralel.
When fetching a DAG with Bitswap, we must first request and receive blocks at level N of the DAG in order to request blocks at level N+1. This slows a DAG traversal significantly because we're can't parallelize requests for the entire DAG.
GraphSync is attempts to solve this problem by requesting a whole DAG from another peer, expressed through a selector. However, requesting at the DAG level makes spreading requests across peers more difficult -- splitting a DAG at a selector level is significantly more complex than simply splitting up block block requests
The idea here is to essentially use GraphSync to assemble a lightweight manifest for arbitrary DAG data. We use GraphSync to get a list of CIDs that are in the DAG we want, and then we use Bitswap to get the CIDs. In theory, this might offer a "best of both worlds" type solution to fetching large DAGs that multiple peers have possession of. Moreover, it requires minimal protocol changes -- and implementing the protocol extension in go-graphsync on the responder side is very trivial. Moreover, storing a list of CIDs in a DAG is sufficiently cheap that GraphSync peers might begin caching frequently requested CID lists. The system might also pair nicely with RFCBBL1201 - we could make a "WANT" request for a DAG to determine who's got it, then a CID only query with GraphSync, then fetch CIDs with Bitswap
graphsync/do-not-send-blocks
extensiongraphsync/do-no-send-blocks
extension will perform a selector query, and include metadata in responses but omit all blocksThe approach here is shares several similarities with the normal operation of a GraphSync Requestor:
The difference is:
We often want to traverse a whole DAG all the way to its leaves, but remote peers will limit the depth of recursion for GraphSync queries, meaning we may need to make multiple GraphSync queries to get the whole dag. One challenge however is we don't currently know in GraphSync whether CIDs are truly leaves or the point at which selector traversal hit maximum recursion depth. And in this scenario we won't know until we fetch them. This may be an additional extension we need to look at adding to the GraphSync protocol
We don't want to fetch large sets of blocks that turn out to be wrong. At the time, there is an inherent tension between wanting to fetch ahead enough not to be limited by Bitswap's round trip bottleneck, and not wanting to fetch too many blocks that turn out to be wrong. One factor we might use to "estimate" how far out we can safely go is agreement among multiple remote peers about what is contained in the CID list for a selector query
Build an interactive UI similar to this one for Bitswap: https://www.youtube.com/watch?time_continue=116&v=clzivvkpKpc&feature=emb_logo
This tool should read logs from a bitswap execution and show:
This would help the analysis of benchmarking results and debugging.
NOTE: I left the first few paragraphs as background, but the UPDATE section below describes the core issue more directly.
I'm running a custom Bitswap data exchange test with 3 nodes in a triangle topology that I'm calling trade
, adapted from the transfer
test in this repo. Each node uploads a unique file, then downloads each of their peers' file (so, each user shares one file with each of their peers, and downloads a file from each peer). This means a total of 6 file exchanges happens throughout a given run.
One difference from the transfer
test is that, when a user downloads a file from one of their peers, I don't have the user save that file on-disk. Once all of the files are exchanged, each user drops the files they fetched from their peers and disconnect. At the start of each run, the peers (re)connect to one another.
When trying to set the run count > 1 for this test, a run after the first (usually the second, at least with large enough files being exchanged) will timeout. This usually presents as 5 of the 6 files being exchanged, while the 6th hangs and never finishes. The pair of peers that fails seems to vary. The strange thing about this is that the file had to have already been sent from the user hosting that file to one of the peers, since 5/6 succeed, but for some reason they never finish sending to the other peer. I set the long_lasting
test parameter to true
to get more of a sense of what might be happening. Every peer seems to remain alive, but eventually no data is flowing between them.
I tried making it so that the users all clean out the file they are responsible for sharing at the end of each run and re-add that file to the Bitswap datastore (then the peer's re-exchange file CIDs from there), but that doesn't seem to have an affect. I can get more than 2 runs to succeed if I make the filesize small enough. Sometimes it'll fail at a later run. For my purposes, sharing somewhat large files is useful, so I'd rather figure out a better solution if possible.
I've come at this from various angles -- ensuring that all of the nodes are syncing at that right steps, clearing out nearly all of the relevant data at the end of each run and reinitializing, making sure the nodes are alive and that they all dial each other before they attempt to exchange data, making the pubsub channel that each node uses to publish its file CIDs unique per-run (which is sort of moot because the CIDs don't change, but I did this so I could also re-add the file to Bitswap since that happens in the same test utility function), etc. This also happens whether I run the tests locally or in Docker.
One thing worth noting (that I just considered while writing this): I am attempting to fetch these files in parallel on each node. Maybe I'm hitting some sort of deadlock with that? The call to transferNode.Fetch
is the call that hangs and each node calls that function from one goroutine for each file they want to fetch, so perhaps it's something about trying to fetch the same file concurrently. I haven't attempted to replicate this issue in the transfer
test, which doesn't fetch files concurrently, so perhaps that would be a good route to go.
UPDATE: That seems to be the issue -- the transfer
test works fine with 50 runs in a row. However, my test fails, and I believe calling transferNode.Fetch
to download multiple files in parallel is the issue. I'm guessing that I'm hitting some sort of deadlock in that function, or a weird race condition that only sometimes arises, and for some reason never happens in the first run (at least as far as I've seen).
So my question (sorry for the noise, wanted to leave everything for documentation/in case my understanding is incorrect): Is there a way to easily fetch multiple files at once with a Bitswap node in a test? Should I modify the Bitswap
test node's implementation of the Fetch()
function? For convenience, the Fetch()
function is currently implemented here.
Pinging everyone who seems to have contributed to the transfer test (@hannahhoward @adlrocha @daviddias @aarshkshah1992), as well as @Stebalien
For the sake of development speed and rapid result explorations, there is a lot of redundant code in test cases. This needs to be fixed before it becomes "technical debt".
Bitswap is a simple file exchange protocol. Given that it was not designed to be a content routing protocol, its feature-set was kept to a minimum, which is probably the right thing to do. I believe there is lots that can be done to improve its performance and this project is certainly on the right direction, but I think we should categorise carefully the nature of changes we're proposing in order to be able to evaluate the significance of the results we get out of the study.
I'm presenting some thoughts on the RFCs currently being worked on. The current set of RFCs can be split in two categories: i) those that improve content fetching, and ii) those that improve content discovery.
Content Fetching RFCs
RFC|BB|L2-03A: Use of compression and adjustable block size
RFC|BB|L2-07: Request minimum piece size and content protocol extension
RFC|BB|L12-01: Bitswap/Graphsync exchange messages extension and transmission choice
I suggest that the first two RFCs above get integrated under the umbrella of the third one. The third RFC is the most sophisticated one IMO. The reasoning is that it is difficult to set global standards/values for optimisation, e.g., all content transferred by Bitswap is compressed or all applications use the same block/chunk size. These decisions depend largely on the application and our approach should be to provide to applications all the knobs they need in order to optimise for their own case.
Content Discovery RFCs
RFC|BB|L1-04: Track WANT Message for future queries
RFC|BB|L1-02: TTLs for rebroadcasting WANT messages
These RFCs go towards the content routing approach. I suggest that we do implement and evaluate their performance, but we should target situations where Bitswap is used within the remit of some specific application, i.e., not globally, say within the global IPFS network. My fear is that doing content discovery with a simple protocol won't scale as the network size increases and we'll end up re-inventing the wheel by optimising towards directions that DHTs or pubsub protocols do out of the box.
Consider the following: assuming each node has ~1k connections: with a TTL=1 if every node sends a WANT message to all of its connections then we effectively flood the network in it's current size - the message will be propagated to 1M nodes (larger than the entire IPFS network right now). It might actually be ok to do flooding with the current network size in order to improve content discovery performance (this would be something cool to evaluate and propose), but this won't take us far as the network size increases (i.e., too much overhead).
Furthermore, the evaluation here needs to take into account the popularity of content. For instance, we should not try to optimise for content that has only one copy in the entire network. There is a tradeoff here between the request span we impose and the success rate this returns. In other words, given we now span across the connections of many peers, what we could do is limit the number of peers each node is reaching out to. We could envision a scheme where the requestor node is sending the request to 75% of its peers, then those nodes are sending to 50% of their peers and the final hop nodes send to 25% of their own peers. Again, these percentages can be tunable by the application. The applications are incentivised to set the values right in order to avoid overloading their own users.
On the "symmetric vs asymmetric" routing options of RFC|BB|L1-02, I would suggest that we evaluate both, as they can serve different purposes. For instance, "symmetric" routing can improve privacy out of the box (it "emulates" an onion-routing-like approach), which we should not disregard. There is the IPFS principle of not storing content into nodes that have not requested it, which we should not forget. However, we can work around it in several ways, e.g., as opt-in. We should also not forget that Bitswap is a protocol of its own, so it can possibly deviate slightly, without removing the options that IPFS actually requires.
In process.py
when files are not used in the test case runs in ascending order according to their size, figures plot files in the order they were run. For the sake of clarity, it would be better to show file sizes in ascending order.
Running test cases using local:exec
with over 20 instances make test cases to freeze in the middle of the execution. Sometimes it freezes in a deterministic way (always in the same step), and others it freezes randomly at different stages of the test. For local:docker
it always freezes in the beginning of the test. The test fails with a context deadline
error
It may be a synchronization issue, but then why it is working with less instances?
I have installed the testbed and I have been able to successfully execute single_run.sh
. However, when I run the RFC experiments I get an error.
Below, I show the output of the testground daemon when executing ./run_experiment.sh rfcBBL104
(I get the same output with rfcBBL102
) in the master
branch:
Mar 17 00:22:52.735307 INFO handle request {"req_id": "87d7be15", "command": "run"}
Mar 17 00:23:07.579386 INFO request handled {"req_id": "87d7be15", "command": "run"}
Mar 17 00:23:07.897632 INFO worker processing task {"worker_id": 1, "task_id": "c18knqquibq6d3jphfc0"}
Mar 17 00:23:07.897825 INFO performing build for groups {"plan": "testbed", "groups": ["single"], "builder": "docker:go"}
Mar 17 00:23:07.899146 INFO found existing volume {"volume_name": "testground-goproxy-vol"}
Mar 17 00:23:07.900739 INFO container is already running {"container_name": "testground-goproxy"}
Step 1/21 : ARG BUILD_BASE_IMAGE
Step 2/21 : ARG RUNTIME_IMAGE=busybox:1.31.1-glibc
Step 3/21 : FROM ${BUILD_BASE_IMAGE} AS builder
---> 00d970a31ef2
Step 4/21 : ENV PLAN_DIR /plan
---> Using cache
---> 263fc3f92476
Step 5/21 : ENV SDK_DIR /sdk
---> Using cache
---> 352d2416dc7a
Step 6/21 : RUN rm -rf ${PLAN_DIR} ${SDK_DIR} /testground_dep_list
---> Using cache
---> 663ccc7cc614
Step 7/21 : ARG TESTPLAN_EXEC_PKG="."
---> Using cache
---> f3c150085fce
Step 8/21 : ARG GO_PROXY=direct
---> Using cache
---> fa3b8c779c2b
Step 9/21 : ARG BUILD_TAGS
---> Using cache
---> 8ca082c0ed4b
Step 10/21 : ENV TESTPLAN_EXEC_PKG ${TESTPLAN_EXEC_PKG}
---> Using cache
---> b20e2b52e338
Step 11/21 : ENV GOCACHE /go/cache
---> Using cache
---> cf47ae2f80fb
Step 12/21 : COPY /plan/go.mod ${PLAN_DIR}/go.mod
---> Using cache
---> 607110f75dc2
Step 13/21 : RUN echo "Using go proxy: ${GO_PROXY}" && cd ${PLAN_DIR} && go env -w GOPROXY="${GO_PROXY}" && go mod download
---> Running in 838dc490f5ee
Using go proxy: http://testground-goproxy:8081
go: github.com/ipfs/[email protected] requires
bazil.org/[email protected]: reading http://testground-goproxy:8081/bazil.org/fuse/@v/v0.0.0-20200117225306-7b5117fecadc.mod: 404 Not Found
server response:
go mod download -json bazil.org/[email protected]:
{
"Path": "bazil.org/fuse",
"Version": "v0.0.0-20200117225306-7b5117fecadc",
"Error": "bazil.org/[email protected]: unrecognized import path \"bazil.org/fuse\": https fetch: Get \"https://bazil.org/fuse?go-get=1\": dial tcp: lookup bazil.org: Try again"
}
Mar 17 00:23:22.658948 INFO build failed {"plan": "testbed", "groups": ["single"], "builder": "docker:go", "error": "docker build failed: The command '/bin/sh -c echo \"Using go proxy: ${GO_PROXY}\" && cd ${PLAN_DIR} && go env -w GOPROXY=\"${GO_PROXY}\" && go mod download' returned a non-zero code: 1"}
Mar 17 00:23:22.659049 ERROR doRun returned err {"err": "docker build failed: The command '/bin/sh -c echo \"Using go proxy: ${GO_PROXY}\" && cd ${PLAN_DIR} && go env -w GOPROXY=\"${GO_PROXY}\" && go mod download' returned a non-zero code: 1"}
Mar 17 00:23:22.660652 INFO worker completed task {"worker_id": 1, "task_id": "c18knqquibq6d3jphfc0"}
Mar 17 00:23:39.465684 WARN collect outputs error {"req_id": "af618db0", "err": "run ID c18knqquibq6d3jphfc0 not found with runner local:docker"}
Based on the output, I am assuming the error comes from the proxy that testground uses to speed up fetching of Go modules. But if that is the case, the rest of you also should get this error when running RFC experiments, right?
Until this error is fixed I will run the experiments with the exec:go
. I would appreciate if you could confirm if this happens to more people.
I'm running bitswap-transfer
tests and using the master
branch of go-bitswap
as my baseline. However, I'm getting an unexpected jump in data exchanged once the actual file seems to have been successfully transferred.
I'm using a star topology, 1 seeder connected to each of 3 leechers and the leechers are not connected to one another. The seeder simply uploads to all of the leechers, and while that's happening I record the Bitswap ledger of each leecher with this goroutine. Then I plot the data sent to each leecher from the seeder's perspective vs. time. The plot for one run looks like:
The horizontal green line is the size of the randomly generated file that the seeder sends to each leecher. It looks like the seeder successfully uploads to all of the leechers, but then a substantial amount of data (> 50% of the original file size) is sent to all of the leechers 15-20 seconds after the file has already been sent.
Does anyone have insight into what this might be, or if I'm perhaps misunderstanding the test/missing some other detail?
I'm using the exec:go
builder and local:exec
runner, and the file size in this run was a little over 1GB. Aside from my background goroutine and setting up the star topology, my test should be about the same as the bitswap-transfer
test on master
in this repo.
Let me know if I can provide more info! In case they might know something here, pinging @Stebalien @whyrusleeping @daviddias @adlrocha @yiannisbot
Codifying some thoughts I conveyed to @adlrocha in our conversation about RFCBBL1201.
At a base level, I feel this is the right approach. Having a discovery layer, independent of Bitswap and Graphsync is key to mix/matching the protocols effectively. To effectively split a request between the two protocols, you have to know something about who has what.
Some things to know about selectors, especially as it relates to WANTS:
a. Absent cached information, determining if I can serve a CID + a selector carries a non-trivial CPU + disk cost. For example, to determine if I have CID QMxxx + a selector for the whole DAG beneath it, we must perform the traversal of the whole dag locally. Depending on the size of the DAG, this might carry a huge cost. As a baseline, we would probably never respond to an unbounded recursive selector in a WANT list purely for DOS protection.
b. This then raise the question: how do we efficiently determine if we can serve a selector in a WANT? One obvious solution is to cache what selectors we've already served successfully. That probably isn't enough, and there are some addition questions we might look at:
i. trying to develop logic, absent of data, about what selectors are contained in other selectors. For example, if we have already served a selector for a whole dag, we can safely say we can serve a selector for a subset of the dag. That example sounds obvious, but trying to determine the exact hueristic for this is something that may actually be quite complicated and involve some set theory :)
ii. Perhaps we allow some kind of budgetting system for asking for selectors in WANTS + possibly a tit-for-tat system
Once we get back information about a WANT for a selector, a big topic for exploration is how to split graphsync queries among multiple peers if multiiple peers have the selector. We may be able to do this by "splitting the selector" -- i.e. traversing different branchs of the DAG tree. However, it may also be more efficient to split at the block level. At the moment, selector traversal is deterministic. So if we have 3 peers, we might say to the one peer, that, given a selector traversal that results in blocks indexed 0...n, send blocks where index modulo 3 = 0, and then to another send blocks where index modulo = 1, and to the third send blocks where index modulo 3 = 2. Etc. This may be more efficient for a broader set of dag structures (as opposed to tree splitting which works best on a balanced, wide tree). There are lots of areas to explore here -- just flagging this as a non-trivial implementation detail that does not yet exist.
In order to have all the common test parameters required to perform some of the basic tests designed in the Beyond Bitswap Evaluation Plan, the following need to be included in the testbed:
CHURN_RATE
MAX_CONNECTION
The in-process IPFS nodes currently spawned in the testbed are configured to use an in-memory datastore. This is preventing us from being able to run experiments with large datasets (https://github.com/adlrocha/beyond-bitswap/blob/bcbbdfd7bdf20ea65e8d7a256bcc95272ee2059a/testbed/utils/ipfs.go#L172).
This can be fixed by using, for instance, a flat-fs datastore. I performed a simple test using a flatfs repo (without dependency injection) and it fixes the problem. Fixing this for the configurable in-process node with dependency injection needs a bit more of work, and I am not going to address this right now.
The files.ToFile()
function in tcp.go
only works when the input is a file, not a directory. We need to include additional logic so when a directory is detected, we generate it from the received fileNode
. The files
configuration will work when inside the directory there are only files but not directories.
https://github.com/adlrocha/beyond-bitswap/blob/develop/testbed/utils/tcp.go#L104
For the option to use random
files in the testbed, the local machine chokes when trying to generate files over 100MB. This issue may extrapolate to cloud environments because in the current implementation a single seed generates the full file. We should devise a way in which all seeds collaborate on generating non overlapping pieces of the file so that the single seed doesn't choke.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.