Comments (3)
Current progress:
- After couple days of learning, could launch mirroring for the only existing setup on height 74020501. My branch with disabled background fetching works (Grafana)
- With enabled background fetching nodes crash with error, I will investigate it
thread 'actix-rt|system:0|arbiter:2' panicked at 'Cannot update flat head to GTrgwbrks8pjEwd6vdWvKUBkZZPVEFDyRM9cJuCGt4Qj: StorageInternalError("delta does not exist for block GTrgwbrks8pjEwd6vdWvKUBkZZPVEFDyRM9cJuCGt4Qj")', chain/chain/src/chain.rs:2294:25
stack backtrace:
0: rust_begin_unwind
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5
1: core::panicking::panic_fmt
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14
2: near_chain::chain::Chain::update_flat_storage_for_block
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
- Asked @marcelo-gonzalez to create setup for height 94194482. Discussing possibility to merge most recent mirror code to master.
from nearcore.
As for the best way to produce load that maximizes storage pressure, I have been able to consistently observe slight undercharging in the following way:
- Use large Sweat batches (should be in master soon: #9385)
- Reduce the shard cache size in config.json to 1MB (to simulate when the state doesn't fit in the cache)
With that, using glcoud peristent SSDs, I saw slight undercharging. But not more than compute costs account for.
You can try to make the test more extreme by adding more accounts to the Sweat contract but that will take more time.
You can also try to run one Sweat contract on every shard, let me know if you need help how to arrange that.
Grafana board that shows constant undercharging in shard 3 for almost a week: https://nearinc.grafana.net/d/1dZGhpJ4k/blockchain-utilization?orgId=1&refresh=30m&var-chain_id=testnet-experimental&var-role=All&var-node_type=All&var-node_id=jakmeier-benchmarking-tps-validator0&var-shard_id=3&from=now-7d&to=now
from nearcore.
I was giving updates in the Zulip thread during the last week. Current summary:
Too small State latency
-
I was testing the idea on the mirrored mainnet traffic since height 74020501. Though peak BP time reaches 10s many times for current protocol, avg State latency is below 100 us. More granular debug metrics show that Main Thread state reads don't reach 500 ms which is way above 10s again.
But my target is 600 us which was reached during the incidents. Consistent 6x lower latency means that storage is not a bottleneck for that test, thus background fetching improvement was not visible. -
I tried the same experiment on the locust workload. And it has the same issue - during high load State latency is too smal.
Finally, an improvement
However, using the same debug metrics we can clearly see that idea improves time spent on Main Thread State reads: without fetching, with fetching. But there is no confirmation that it resolves the incident, yeah.
Next steps ideas
I'll carefully say that background fetching works. Still, we need to show that it resolves the bottleneck we had during high load. What we can do:
- I would try to understand where is the actual confusing bottleneck in the mirrored traffic. But that's not the priority.
- Try to increase State latency for mirroring. One discrepancy we have with real setup is that we start from DBCol::State storing only latest State. In reality, State stores data for 5 epochs which should take ~2x more space
- Focus on measuring traffic from 94194482 which corresponds to the latest incident.
- Increase State latency for locust setup. @jakmeier's setup shows that it is pretty much possible (grafana)
from nearcore.
Related Issues (20)
- [EpochSync] Run a canary node on testnet and mainnet
- Reducing hard drive validator footprint HOT 4
- `download_file::tests::test_file_download_plaintext` spuriously fails HOT 2
- Insufficient resources: could not allocate code memory HOT 4
- Blockers and Wishlist for the 1.37 release HOT 3
- Test combination of resharding and high load in mocknet
- [BUG] By default, the first time you deploy a node localnet on gitpod, then execute with tools/restaked, and then an error occurred
- ใBUGใnear-olly opentelemetry conflicts with the standard library and has been contaminated HOT 2
- WARN jsonrpc: Timeout: tx_polling method.
- `Receipt` and `Action` validation is error prone and is missing compile-time checks
- Monthly issue metrics report
- 1.37 release timeline
- Implement transactions validation for Stateless Validation
- unmerge near-vm-runner from near-vm-logic HOT 2
- Run Stateless Validation code as part of mainnet
- Properly use source_receipt_proofs in ChunkStateWitness
- `yapf` test leaves a stray process that never quits
- Yield Execution (NEP 516 / NEP 519) HOT 6
- [State Sync] Create separate DB for State snapshots
- [State Sync] Upload headers to GCP to be able to work with stateless validation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nearcore.