Giter Club home page Giter Club logo

Comments (33)

ZackButcher avatar ZackButcher commented on June 26, 2024

Unfortunately I don't have permissions to add labels or update assignees. cc @rshriram @kyessenov

from pilot.

kyessenov avatar kyessenov commented on June 26, 2024

Something to keep in mind is that the cache has to keep track of old versions to be able to stage updates across xDS. We should remove stale versions only when no proxy is utilizing it.

from pilot.

andraxylia avatar andraxylia commented on June 26, 2024

Before we make this optimization, we have to check if it is needed, determine the breaking point, the number of pods or the conditions justifying it.

from pilot.

ZackButcher avatar ZackButcher commented on June 26, 2024

To be clear, the optimization is in the caching. Regardless of caching/performance, we need to invert config generation so its eager so that we're ready for the push-based world of xDS v2.

from pilot.

kyessenov avatar kyessenov commented on June 26, 2024

To @andraxylia 's point, we're stuck with v1 until we can prove v2 performs better. The work will have to happen in a separate envoy adapter module (proxy/envoy2) since this is a large effort.

from pilot.

ZackButcher avatar ZackButcher commented on June 26, 2024

How are we sequencing things? I thought we had to move to v2 APIs to address the 404 issues, which means we need to land updated config handling in 0.3 as well.

from pilot.

andraxylia avatar andraxylia commented on June 26, 2024

My point was about avoiding premature optimization, and relative to the caching/invalidation of the config. I would defer this to when/if needed. We still need to prepare the config for each Envoy instance.

We are not stuck with v1, we want to move to v2, and we do not need to prove it performs better than v2, just that the pilot implementation of v2 is production ready. We should also have a way to switch from v1 and v2 and revert if needed, so @kyessenov suggestion is relevant.

Though ADS will solve the 404 issues, they can also be handled separately, in case v2 is not ready with production grade quality for 0.3.

Sequencing wise, I thought #1257 will cover everything, but if we need multiple issues and multiple owners, let's use an epic.

from pilot.

kyessenov avatar kyessenov commented on June 26, 2024

The work has moved to a new envoy org project as there is a lot of complexity in the push/streaming interface. It's definitely far from a single issue/epic.
I can't vouch that we can reach parity with v1 since there are some serious semantic differences (deprecation of bind_to_port that need to be reconciled).

from pilot.

rshriram avatar rshriram commented on June 26, 2024

@andraxylia how do you envision the 404s can be solved with v1? I would be interested in exploring that solution.

from pilot.

costinm avatar costinm commented on June 26, 2024

There are several ways to solve the 404. The most direct is to fix Envoy, so it returns 503 as expected.
AFAIK there is no technical reason to return a 404 if a route exists but the cluster is not yet available.

The solution that I am proposing is to change the proxy agent, and have it download the full config before
starting envoy, and act as a local caching proxy. The primary goal of this will be to handle cases where
the pilot becomes unavailable, in particular for hybrid/onprem cases.

Obviously, such a cache could directly expose a v2 API - converting from v1 into v2 is apparently what Envoy is doing internally as well if I understood it correctly. And the code can be extremely simple - only
depends on grpc, http/json and the interfaces defined by envoy.

It is also ok to start with a v1 cache in the agent - only special behavior is to not return routes or listeners
until all dependent clusters/routes have been downloaded. On top of this it may allow some local overrides and/or templates.

Even if envoy is fixed to not return 404 on miss-config - it is likely we'll still need to properly deal with
cases where pilot becomes unavailable, cascading failures can be pretty bad.

As a side note - the local cache + proxy can also greatly simplify push and reduce the complexity of
pilot.

from pilot.

kyessenov avatar kyessenov commented on June 26, 2024

I think using the agent to serve xDS is an interesting solution for stickiness of the updates. The agent can sequences the polled updates in the order that Envoy would always have internally consistent view of the config. The logic to sequence updates is required in v2 implementation, so this is just moving it closer to the proxy.

The downside is the increase in the complexity in the agent. We've been fighting for simplification of the agent since it's attached to every single pod.

Another tricky bit is lack of ACK/NACKs in v1. The server does not receive confirmation that the update is applied in v1 API.

from pilot.

rshriram avatar rshriram commented on June 26, 2024

It is also taking us back to the state 8 months ago when we did this in the agent. I don’t see any advantage of double caching in the proxy agent and Envoy. Envoy automatically falls back to the cached version if pilot is unavailable. That has been our design since day 1. So the argument about helping in hybrid/onprem doesn’t seem convincing. I also don’t see how it will reduce complexity (if any) in pilot because we don’t do any sequencing in pilot. Implementing v2 api in the agent seems suboptimal, more like I see no use for the v2 apis in Envoy for large scale installations.

Finally, not everyone will use the agent. Because they could be running Envoy in different formats (cloudfoundry for example) where there is no accompanying agent. This could be a temporary solution but not a permanent one.

@mattklein123 / @htuch do you think @costinm’s concerns are real ? - about envoys failing upon failure of management planes (like pilot or the one at lyft or what you guys are building at google). Do you guys have double caching, with local agent and a central agent? I don’t find it necessary.

from pilot.

htuch avatar htuch commented on June 26, 2024

@rshriram no need for this local cache. As you correctly state, if a management update fails, the last valid config remains latched and in effect.

from pilot.

mattklein123 avatar mattklein123 commented on June 26, 2024

Agreed, there is no reason to cache anything on the node.

re: 404 vs. 503, there are legitimate reasons why someone may want to return 404 in this scenario. Having an option to return 503 instead in the case the cluster is not found is trivial. Someone just needs to do the work (envoyproxy/envoy#1820).

from pilot.

costinm avatar costinm commented on June 26, 2024

from pilot.

costinm avatar costinm commented on June 26, 2024

One extra note on the subject - in a hosted environment, with very large number of domains (
like 'custom domains' in the case of a hosted app ), keeping the entire routing/listeners in pilot's
memory becomes impossible, and it's not really needed since envoy already caches the stable
config (and agent combined with some memcache could also do it on a local level).

So long term I don't think the current design of keeping the entire mesh config cached in Pilot's memory
is viable, it will need to be sharded and locally cached/distributed, and Envoy will probably
need to support more 'on demand' loading of config and certs.

The use case is something like a an app that allows users to bring their own domains ( something like
hosted gmail for example ) - a single backend, but with 1M+ routes and certs.

from pilot.

rshriram avatar rshriram commented on June 26, 2024

I don't think Envoy returning 503 plus the 'liveness' check at startup will take care of the 404 bug, and I think it is the right solution. takes care of 503s or 404s fully. Everytime you change a route rule, you will hit this and the chances of hitting this depends on your polling interval. You can solve this by having a temporary local cache, but thats a short term solution, until V2 api lands - at which point the agent becomes useless.

The envoy crash happens if we restart it too frequently - which is caused by auth certificates changing rapidly. I think its a question for the auth folks, and that too will become moot with v2 api. My two cents. We could also totally rearchitect Pilot (in my mind this solution is basically what we had in Amalgam8 a year ago) and finish this up, call it production ready by 0.3 time frame. Whether this is more stable than the V2 implementation, is up for debate.

from pilot.

andraxylia avatar andraxylia commented on June 26, 2024

@mattklein123 would you agree that adding support in Envoy for snapshotting/checkpointing current running config is the right thing to do to allow data plane to correctly function if the control plane is partitioned, and Envoy needs to restart or hot-restart on the same host? There will be no control plane to provide clusters and listeners, so traffic will be lost.

from pilot.

mattklein123 avatar mattklein123 commented on June 26, 2024

@mattklein123 would you agree that adding support in Envoy for snapshotting/checkpointing current running config is the right thing to do to allow data plane to correctly function if the control plane is partitioned, and Envoy needs to restart or hot-restart on the same host? There will be no control plane to provide clusters and listeners, so traffic will be lost.

No. I think this is complexity that is not actually needed and is more likely to lead to an outage than the situation it is trying to cover for.

from pilot.

costinm avatar costinm commented on June 26, 2024

from pilot.

htuch avatar htuch commented on June 26, 2024

Also, Envoy may be deployed in a diskless scenario (or read only storage), so there inherently needs to exist a story for configuration recovery that doesn't rely on this.

I think a useful first step in this discussion is to get a clear picture (and consensus) on what the failure scenarios that Istio considers important are. There have been various suggested, can we get a doc written?

from pilot.

costinm avatar costinm commented on June 26, 2024

from pilot.

rshriram avatar rshriram commented on June 26, 2024

from pilot.

costinm avatar costinm commented on June 26, 2024

Will move the discussion on how to 'fail open', HA and avoiding cascading failires to separate document, it's a different topic.

On the 404: I think it is clear that 503-if-cluster missing and liveness checks will mitigate some of the
problems and will both be needed even with v2. While v2 may return a full and consistent config - it doesn't avoid the interval between startup and getting the initial config, nor the case where a cluster
is simply missing due to replication delays.

The main remaining issue is the case Shriram describes - which I don't understand and I was not
able to reproduce if the service is created in advance. Can you provide more details/example ?
And why would the 404 happen on 'weighted routing', if all the services and clusters are created
and propagated before the routing rule is applied ? The bookinfo samples are targeted for demo,
and typically don't wait for service/cluster propagation before applying route rules, but in a production
environment it is quite reasonable to not route traffic to a new service immediately.

from pilot.

rshriram avatar rshriram commented on June 26, 2024

While v2 may return a full and consistent config - it doesn't avoid the interval between startup and getting the initial config,
Needs to be solved with health checks.

nor the case where a cluster is simply missing due to replication delays.
Thats the whole point of ADS where we get to sequence updates to envoy over a single connection.

Getting a 404 for a newly created service is acceptable. After all, the configuration has to propagate to all callers, and if the external world clients start accessing it before its done, a 404 is what a server sends. In other words, end users are actually used to this scenario. What they are not used to is getting a 404 when they split an existing service into multiple clusters for the purpose of version routing. And this is the bigger issue that we need to solve. We would not be undertaking such a big effort just for the healthcheck or initial startup thing :).

#926 is the first occurrence of this issue (this user is trying to deploy Istio to production :) ). The problem only compounds when we increase the refresh interval from 1s to 60s. While a partial solution is to have high refresh rate, it burns a lot of CPU and tanks performance (evidence: #1237 where we found "First tested with Istio 0.1.6. Big performance jump when refresh interval was increased from 10ms to 60s. " ).

from pilot.

costinm avatar costinm commented on June 26, 2024

I see the comment that "Yes, a route rule can introduce a new cluster (for special version, for example)" -
I'm not familiar enough with the weighted routing rules, do you have any pointers on where the new
cluster gets introduced ? If we can decouple the creation of clusters / services from route changes - we
would have a pretty good fix for this.

It seems we agree that healthchecks / liveness checks are needed for both v1 and v2 - and having some
503 when a new service is just added is ok.

Short refresh rate is not a solution - everything should work with 60s or more, with understanding that
all route changes must wait for refresh_interval to avoid 503.

from pilot.

rshriram avatar rshriram commented on June 26, 2024

new clusters are created because of route changes (version split). An existing service's cluster is split into multiple newer clusters, and traffic is split by weights. This involves changing the route config to use weighted clusters and pushing those new clusters to envoy. And this is where the 404s/503s are occurring. and thats the whole reason for v2 push api.

from pilot.

ldemailly avatar ldemailly commented on June 26, 2024

we should probably consolidate the various 404/503 open issues and move this to istio/istio

istio/istio#1038
istio/istio#1041
#1422

from pilot.

costinm avatar costinm commented on June 26, 2024

from pilot.

kyessenov avatar kyessenov commented on June 26, 2024

Why would you not have labels on VMs? k8s, consul, eureka and whatnot all have labels on workloads. An API change is a drastic measure to fix this problem. The actual solution will happen in v2 library under envoy org, once we work out the ADS protocol. This is the solution that will scale, be backwards/future compatible, and fix all the other "inconsistency" issues that are possible with v1. Once it's ready we'll replace the pilot core with it, and only need to translate istio config into data plane config to complete the picture.

from pilot.

costinm avatar costinm commented on June 26, 2024

from pilot.

kyessenov avatar kyessenov commented on June 26, 2024

from pilot.

kyessenov avatar kyessenov commented on June 26, 2024

Issue moved to istio/istio #1370 via ZenHub

from pilot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.