Giter Club home page Giter Club logo

Comments (17)

cheatfate avatar cheatfate commented on July 18, 2024 1

This issue should be fixed in unstable.

from nimbus-eth2.

cheatfate avatar cheatfate commented on July 18, 2024 1

Sure, thanks, any explanation for what was actually happening? I see nothing in the PR or the commit.

It was an issue some small race condition which exits accept loop so no more connections are being accepted and processed after bug happens. It exists for more than 7 months already, but looks like it get revealed only when host is very busy.

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

We might be seeing some kind of issue with freeing sockets:

image

https://grafana.infra.status.im/d/QCTZ8-Vmk/single-host-dashboard?orgId=1&refresh=1m&var-host=linux-01.ih-eu-mda1.nimbus.prater&from=1706382141252&to=1706554941252

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

It appears the unstable node on linux-01.ih-eu-mda1.nimbus.mainnet for Mainnet has the same issue:

image

Though the socket graph looks a bit different:

image

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

Same behavior again:

[email protected]:~ % curl --max-time 5 0:9302/eth/v1/node/version                    
curl: (28) Operation timed out after 5000 milliseconds with 0 bytes received
[email protected]:~ % curl --max-time 10 0:9302/eth/v1/node/version
{"data":{"version":"Nimbus/v24.1.1-0e63f8-stateofus"}}%  

Sometimes it does work tho.

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

Not sure if this socket usage growth is at fault, possibly:

image

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

It's not files open limit since the currently used number is low:

[email protected]:~ % sudo ls /proc/$(systemctl show --property MainPID --value beacon-node-mainnet-unstable-01)/fd/ | wc -l
385

[email protected]:~ % s cat beacon-node-mainnet-unstable-01 | grep LimitNOFILE                                              
LimitNOFILE=16384

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

It doesn't seem to be a firewall issue because I can see the packet attempting to open the connection pass through correctly:

PACKET: 2 42017d99 IN=lo LOOPBACK SRC=127.0.0.1 DST=127.0.0.1 LEN=60 TOS=0x0 TTL=64 ID=17113DF SPORT=33346 DPORT=9304 SYN 
 TRACE: 2 42017d99 raw:PREROUTING:rule:0x2:CONTINUE  -4 -t raw -A PREROUTING -p tcp -m tcp --dport 9304 -j TRACE
 TRACE: 2 42017d99 raw:PREROUTING:return:
 TRACE: 2 42017d99 raw:PREROUTING:policy:ACCEPT 
PACKET: 2 42017d99 IN=lo LOOPBACK SRC=127.0.0.1 DST=127.0.0.1 LEN=60 TOS=0x0 TTL=64 ID=17113DF SPORT=33346 DPORT=9304 SYN 
 TRACE: 2 42017d99 filter:INPUT:rule:0xc:ACCEPT  -4 -t filter -A INPUT -i lo -m comment --comment "loopback interface" -j ACCEPT

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

I stopped all the other nodes on linux-01 on Mainnet to see if freeing up CPU and RAM will make a difference but it doesn't:

[email protected]:~ % c --max-time 10 0:9304/eth/v1/node/version
curl: (28) Operation timed out after 10001 milliseconds with 0 bytes received

So I don't think it's related to high system load or low memory.

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

We identified that Prater node doesn't have the --rest-allow-origin="*" flag, so we've disabled it on mainnet:

Doubt it will do anything though.

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

I cannot see any public REST API endpoint issues currently. It is possible the removal of --rest-allow-origin="*" did fix the issue for the publicly exposed nodes. But that does not actually address the underlying issue with the REST API endpoint.

We will continue to monitor for API problems like this, but this particular issue is not resolved, only mitigated on infra side.

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

Actually, scratch that. The testing node on geth-02.ih-eu-mda1.nimbus.holesky is now showing the same symptoms:

[email protected]:~ % c --max-time 5 0:9302/eth/v1/node/syncing
curl: (28) Operation timed out after 5001 milliseconds with 0 bytes received
[email protected]:~ % c --max-time 10 0:9302/eth/v1/node/syncing
curl: (28) Operation timed out after 10000 milliseconds with 0 bytes received
[email protected]:~ % c --max-time 30 0:9302/eth/v1/node/syncing
curl: (28) Operation timed out after 30001 milliseconds with 0 bytes received

And it is exposed publicly:

  'geth-02.ih-eu-mda1.nimbus.holesky': # 1 each
    - { branch: 'stable',   start:     0, end:     1,  el: 'geth',      vc: true  }
    - { branch: 'testing',  start:     1, end:     2,  el: 'geth',      vc: false, public_api: true }
    - { branch: 'unstable', start:     2, end:     3,  el: 'geth',      vc: false }
    - { branch: 'libp2p',   start:     3, end:     4,  el: 'geth',      vc: false }

https://github.com/status-im/infra-nimbus/blob/f446bd309ff4a4e004fb3c42c6e0bc9cbf0b0a0b/ansible/vars/layout/holesky.yml#L11-L15

And it does not have the --rest-allow-origin flag:

[email protected]:~ % s cat beacon-node-holesky-testing | grep rest-
    --rest-address=0.0.0.0 \
    --rest-port=9302 \
    --rest-max-body-size=16384 \
    --rest-max-headers-size=128 \

Which suggests that the --rest-allow-origin flag is a red herring.

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

Another instance of the issue on linux-02.ih-eu-mda1.nimbus.mainnet node beacon-node-mainnet-testing-01:

[email protected]:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/version | jq -c; done
{"data":{"version":"Nimbus/v24.2.0-742f15-stateofus"}}
{"data":{"version":"Nimbus/v24.2.0-742f15-stateofus"}}
curl: (28) Connection timed out after 5001 milliseconds
{"data":{"version":"Nimbus/v24.2.0-742f15-stateofus"}}
{"data":{"version":"Nimbus/v24.2.0-81b849-stateofus"}}
{"data":{"version":"Nimbus/v24.2.0-81b849-stateofus"}}

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

Sure, thanks, any explanation for what was actually happening? I see nothing in the PR or the commit.

from nimbus-eth2.

apentori avatar apentori commented on July 18, 2024

Almost all beacon node of the Holesky network built on the branch nim-libp2p-auto-bump-unstable show this issue this weekend:

Get "http://localhost:9304/eth/v1/node/version": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

from nimbus-eth2.

cheatfate avatar cheatfate commented on July 18, 2024

Almost all beacon node of the Holesky network built on the branch nim-libp2p-auto-bump-unstable show this issue this weekend:

Get "http://localhost:9304/eth/v1/node/version": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Could you please provide some commit hash or something, because branch is moving all the time and right now its impossible to verify is this branch had fixes from unstable or not.

from nimbus-eth2.

jakubgs avatar jakubgs commented on July 18, 2024

Thanks for the explanation and more importantly the fix.

I reviewed some of our alerts and I do think this has been resolved. If need be we can always reopen.

from nimbus-eth2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.