Comments (5)
Here are logs from another test run: nsmgr-2021-07-02T16.36.45+07.00.zip (20 MB)
This time the issue occurred after first run of a test with 50 requests.
from deployments-k8s.
I think I have found the root cause of the issue.
Let's look at our heal:
When heal server is requested to do heal, it does it on a heal context.
This context doesn't have any timeout, only cancel.
So heal either succeeds, or is stopped manually, or works indefinitely.
Usually, client sends Close
, heal receives it, stops heal and doesn't accept new heal requests for this connection.
However, in my tests sometimes the following was happening:
- Heal starts and blocks serialize element for this connection
- nsmgr receives close, but it is blocked by serialize
- Heal attempt ends, close starts propagating, but heal server doesn't receive it, so it doesn't know it has to stop.
The culprit here is discover server: it tries to find alive endpoints. Usually right after we delete the endpoint, the registry still holds its record, and everything works.
In my tests I was sometimes getting long enough delay for the endpoint record in registry to expire, discover doesn't find NSE in registry, and then it just doesn't propagate close.
Here are logs of this happening: logs.zip
There is a bunch of connections in these logs. You can can use grep 2197eb15-19ce-49e8-ad98-d43db51896c2
to get the part of the logs that allows you to see the issue.
from deployments-k8s.
This only happened to me when I created ~50 connections (or more), and nsmgr was CPU limited. Also when I modified the logs in some way, it suddenly stopped happening (I wasn't able to figure out which exactly of my modifications affected it).
from deployments-k8s.
Looks like this issue is reproducible only in lack of resources. Even then, if client fails to close the connection until it cancels, heal will be closed by local NSMgr timeout.
There are 2 cases how can we met this issue:
- Close doesn't reach NSMgr until the NSC Close context timeout happens - #2085.
- Close reaches NSMgr, but it fails to reach all hops in the chain until the NSC Close context timeout happens.
In [1] case there is nothing to do - NSMgr doesn't know anything about this Close, so timeout here is the only way.
In [2] case we can create additional chain element to capture the Close and restart it with the NSMgr context.
@denis-tingaikin
What do you think about the [2]? Do we need it or should we think about both [1, 2] cases in the same way?
from deployments-k8s.
Made an investigation of NSM work on high load (networkservicemesh/sdk#1031), so here are additional 2 issues for the [2] case:
from deployments-k8s.
Related Issues (20)
- Policy Base Routing failure; Possible table ID collision. HOT 2
- Fix log timestamp in forwarder-vpp for [cmd:vpp] printouts HOT 3
- `kubectl delete` is slow in tests HOT 2
- cmd-admission-webhook-k8s generates client IDs properly only for Kind: Pod HOT 2
- TestRemote_nsmgr_remote_endpoint failed
- Add LoadBalancer for vl3 networks HOT 4
- Missing arm64 Linux images HOT 8
- Update go to v1.20.x HOT 3
- NSC - Add support for K8S PSS restricted/baseline profiles (for hostPath volumes) HOT 5
- How configuration change should be handled during healing? HOT 2
- [R&D] NSM IdP/OIDC integration
- Question: Monolith example EC2 + EKS HOT 8
- NSM setup doesn't wait for NSM components to become ready HOT 2
- Missing log printouts in forwarder-vpp at DEBUG log level HOT 3
- Add a new example when one nsc connects to two nses with different services HOT 1
- Interdomain (2 cluster) setup uses different readme structure compared to other NSM example setups
- Faulty ip rules after forwarder restart
- K8S PSS restricted profile support
- Update local spire binary in cmd Dockerfiles to support ARM docker testing HOT 2
- Policy Based Routing; empty routing tables on NSC reconnect HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deployments-k8s.