Giter Club home page Giter Club logo

Comments (19)

davemoore- avatar davemoore- commented on June 12, 2024 1

@austince

What's the best way to look into the potential changes on the ES plugin handling side?

I'm going to approach this from a few angles:

  • Creating a sample plugin whose only function is to create an index using the given NodeClient, ruling out other variables that zentity might be introducing, and see if the behavior still appears in the stress test.
  • Reviewing how official plugins use the NodeClient, to see if and how zentity implements the client differently.
  • Possibly introducing debug logging to help isolate the point of failure in zentity. This may be an outcome of the troubleshooting process that I merge into the main plugin either way.
  • Profiling the behavior of the plugin (e.g. using VisualVM) as you have been.

Creating the sample plugin would be a good thing to try before reaching out to discuss.elastic.co or the elasticsearch repo, because it could help us further isolate the issue and make a more actionable question if needed. It would also give us something that they can use to reproduce the issue easily, if needed.

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024 1

Pushed a commit to the feature-async-actions branch (diff), which replaces blocking search calls with async search calls in the Job class.

Things are looking promising (async pun?):

  • Integration tests are passing
  • Manual tests are passing with assertions enabled in the Elasticsearch JVM
  • Performance of resolution jobs are at least as good as before

The solution was much simpler than I anticipated. Thanks to this recent blog post for showing a pattern that uses CompletableFuture in Elasticsearch. That saved me from refactoring the entire Job class. I almost want to rework my prior commits (the REST handlers) to use this pattern. Not a big priority, though. They work and we need to ship out this fix soon.

from zentity.

austince avatar austince commented on June 12, 2024 1

I had some time to read through and give some notes on the commits, let me know what you think!

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024 1

++ Sounds good. It's ready at PR #67.

from zentity.

austince avatar austince commented on June 12, 2024

visual-vm-heap-growth

Here's the local docker cluster profiled by VisualVM after around 20 alternating setup then delete index requests. I'm not very experienced in profiling Java apps, but it looks like each request is creating quite a bit on the heap and leaving it behind, then perhaps the "hanging" happens on garbage collection. Something about going from ~100MB to ~350MB for those requests feels funny, like something is hanging around.

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024

@austince I haven't seen this error message in Elasticsearch before:

after [...] publication of cluster state version [...] is still waiting for ...

I found that the message originates from a method that Elasticsearch nodes use to publish cluster state. Here's the documented behavior for cluster state publication. The 10s and 30s durations reported by your log messages are really long, especially for a smaller cluster, and so my initial speculation is that this is caused by either a cluster misconfiguration or a bottleneck on the network or disk. I'd need to reproduce the issue and dig a bit to be sure.

The first thing I would try is to rule out the plugins. If you remove the zentity plugin, and then repeatedly create/remove indices or perform other similar operations that zentity performs, do you still see the behavior?

from zentity.

austince avatar austince commented on June 12, 2024

I was not able to reproduce this when testing on a local docker cluster without the plugin installed, nor on a cluster with the plugin installed but not used. I tried to reproduce the issue by sending repeating create/ delete index requests.

I created a small "stress test" that can reproduce the issue, which is in the attached zip (stress-test-cluster.sh) along with the docker-compose.yml file used for testing, which may be the source of cluster misconfiguration errors. I don't think the issue lies in the local network or disk, but could be wrong there.

zentity-stress-test.zip

Here's the profiling when running that:

zentity-stress-test

The first spike is during non-zentity requests, and the more drawn out spikes are from repeated Zentity setup (and then delete zentity model index) requests. The timeouts seem to roughly line up with the 30s timeout.

It also looks like the issue gums up the TransportService as well, which I think controls the threads that the plugin handlers run on? Though I'm not really sure, just going by some logged thread names. If that is the service responsible for publishing cluster state changes, that would tie in to the timeouts we're seeing in that area.

elasticsearch    | {"type": "server", "timestamp": "2021-01-22T21:43:36,620Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "docker-cluster", "node.name": "primary", "message": "Received response for a request that has timed out, sent [19614ms] ago, timed out [4604ms] ago, action [cluster:monitor/nodes/stats[n]], node [{es-data-2}{Xjwq8qUrReyh5VUi21l3aQ}{6HLd44y9Qriva3OcIXukKg}{172.22.0.2}{172.22.0.2:9300}{dir}], id [4037]", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024

Thanks @austince this is good to know about. Looking at the contents of your linked stress-test-cluster.sh, the actions that produce this behavior are actions that change cluster state (e.g. adding and removing indices), which lines up with the log messages about cluster state publication timeouts. An action that shouldn't affect cluster state is a search (e.g. resolution) so it will be good to confirm if the issue doesn't appear for resolution jobs and model GETs.

I'll have to dig in and see what the issue might be. Thanks for sharing the stress test package!

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024

I narrowed down a couple things after using your stress test scripts:

  • The issue only occurs on Elasticsearch >= 7.9.0.
  • The issue occurs for any _zentity endpoint, including resolution requests which are read-only operations.
  • The issue occurs after only a half-dozen or so requests to any endpoint.

I don't see anything in the Elasticsearch 7.9.0 release notes that explains the change in behavior. It lists a known memory leak ("manifests in Elasticsearch when a single document is updated repeatedly with a forced refresh"), but the issue we observe occurs during any operation from a plugin. I suspect something changed in the behavior of the Elasticsearch Java APIs that zentity uses (e.g. BaseRestHandler, RestChannelConsumer, or NodeClient), or in the behavior of the Elasticsearch server and its handling plugins.

from zentity.

austince avatar austince commented on June 12, 2024

Awesome, thanks for digging in Dave. We'll test out ES 7.9.0 in kubernetes on our end as well -- that's a good enough work around for now. What's the best way to look into the potential changes on the ES plugin handling side? Should I open up a discussion on discuss.elastic.co?

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024

@austince I was able to reproduce the issue on a more simple plugin.

I've submitted a bug report to the elasticsearch team with a modified version of your stress test. I labeled it as a "bug" because our usage of the NodeClient has worked with plugins since at least Elasticsearch 5.x, and the expected behavior stopped working after 7.9.0 without an explanation in the release notes. Let's see what the team responds with.

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024

Update from the bug report:

The issue appears to be rooted in the use of blocking calls in zentity's handlers (e.g. whenever the NodeClient calls the get() method). Apparently this has always been an anti-pattern for plugins, and it would cause an integration test to fail (even in versions prior to 7.9) when assertions are enabled on the Elasticsearch cluster (i.e. setting -ea in the JVM options). I didn't know that this assertions capability existed, so I never saw the issue. See more details in the bug report response.

So the next question is, what's the right way for a plugin to await the response of requests to Elasticsearch before returning a single response to the user? I submitted this question on the discussion board to see if other plugin authors can offer guidance.

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024

The question on the discussion board now has an answer with some guidance.

This issue will continue to be my priority until it's resolved and released in the next patch version 1.6.2 (release plan).

I've updated the website to include a warning label for this known issue to help people avoid choosing a version that will lead to cluster instability.

from zentity.

austince avatar austince commented on June 12, 2024

Ah, that makes a lot of sense. That seems like a very reasonable solution as well. Thanks so much for these findings.

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024

I pushed my first commit to the new feature-async-actions branch (compare to master).

  • Manual tests of SetupAction and ModelsAction show promise. I can setup zentity and manage entity models on Elasticsearch >= 7.9.0, which is progress.
  • There's some duplicate code in ResolutionAction. There's probably a better way to implement async there.
  • Not sure how I feel about the prolific use of anonymous functions, but it's working so far.

I haven't yet attempted to implement async actions to Job, which I believe has the only other blocking call in the plugin. The blocking method is defined here, and it's called many times here. This going to take some careful thought. @austince maybe you could take a look, too?

from zentity.

austince avatar austince commented on June 12, 2024

Looks great! I'll take a read through tomorrow and see if I can think of any useful notes.

Update: sorry I haven't had time to look into this yet, busy end of the week but hope to have time tomorrow.

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024

Pushed a second commit to the feature-async-actions branch (diff). Refactored the async code to achieve proper exception handling, which I verified through manual tests. Also started to implement debug logging.

Now that I'm more familiar with the async APIs in Elasticsearch, and being satisfied with the current state of async in the REST handlers, I'll move onto the more challenging async implementation in the Job class.

from zentity.

austince avatar austince commented on June 12, 2024

Maybe opening a draft PR for that branch would let us consolidate the conversation about the implementation?

from zentity.

davemoore- avatar davemoore- commented on June 12, 2024

Merged to master

from zentity.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.