Comments (19)
What's the best way to look into the potential changes on the ES plugin handling side?
I'm going to approach this from a few angles:
- Creating a sample plugin whose only function is to create an index using the given NodeClient, ruling out other variables that zentity might be introducing, and see if the behavior still appears in the stress test.
- Reviewing how official plugins use the NodeClient, to see if and how zentity implements the client differently.
- Possibly introducing debug logging to help isolate the point of failure in zentity. This may be an outcome of the troubleshooting process that I merge into the main plugin either way.
- Profiling the behavior of the plugin (e.g. using VisualVM) as you have been.
Creating the sample plugin would be a good thing to try before reaching out to discuss.elastic.co or the elasticsearch repo, because it could help us further isolate the issue and make a more actionable question if needed. It would also give us something that they can use to reproduce the issue easily, if needed.
from zentity.
Pushed a commit to the feature-async-actions
branch (diff), which replaces blocking search calls with async search calls in the Job class.
Things are looking promising (async pun?):
- Integration tests are passing
- Manual tests are passing with assertions enabled in the Elasticsearch JVM
- Performance of resolution jobs are at least as good as before
The solution was much simpler than I anticipated. Thanks to this recent blog post for showing a pattern that uses CompletableFuture in Elasticsearch. That saved me from refactoring the entire Job class. I almost want to rework my prior commits (the REST handlers) to use this pattern. Not a big priority, though. They work and we need to ship out this fix soon.
from zentity.
I had some time to read through and give some notes on the commits, let me know what you think!
from zentity.
++ Sounds good. It's ready at PR #67.
from zentity.
Here's the local docker cluster profiled by VisualVM after around 20 alternating setup then delete index requests. I'm not very experienced in profiling Java apps, but it looks like each request is creating quite a bit on the heap and leaving it behind, then perhaps the "hanging" happens on garbage collection. Something about going from ~100MB to ~350MB for those requests feels funny, like something is hanging around.
from zentity.
@austince I haven't seen this error message in Elasticsearch before:
after [...] publication of cluster state version [...] is still waiting for ...
I found that the message originates from a method that Elasticsearch nodes use to publish cluster state. Here's the documented behavior for cluster state publication. The 10s
and 30s
durations reported by your log messages are really long, especially for a smaller cluster, and so my initial speculation is that this is caused by either a cluster misconfiguration or a bottleneck on the network or disk. I'd need to reproduce the issue and dig a bit to be sure.
The first thing I would try is to rule out the plugins. If you remove the zentity plugin, and then repeatedly create/remove indices or perform other similar operations that zentity performs, do you still see the behavior?
from zentity.
I was not able to reproduce this when testing on a local docker cluster without the plugin installed, nor on a cluster with the plugin installed but not used. I tried to reproduce the issue by sending repeating create/ delete index requests.
I created a small "stress test" that can reproduce the issue, which is in the attached zip (stress-test-cluster.sh
) along with the docker-compose.yml
file used for testing, which may be the source of cluster misconfiguration errors. I don't think the issue lies in the local network or disk, but could be wrong there.
Here's the profiling when running that:
The first spike is during non-zentity requests, and the more drawn out spikes are from repeated Zentity setup (and then delete zentity model index) requests. The timeouts seem to roughly line up with the 30s timeout.
It also looks like the issue gums up the TransportService as well, which I think controls the threads that the plugin handlers run on? Though I'm not really sure, just going by some logged thread names. If that is the service responsible for publishing cluster state changes, that would tie in to the timeouts we're seeing in that area.
elasticsearch | {"type": "server", "timestamp": "2021-01-22T21:43:36,620Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "docker-cluster", "node.name": "primary", "message": "Received response for a request that has timed out, sent [19614ms] ago, timed out [4604ms] ago, action [cluster:monitor/nodes/stats[n]], node [{es-data-2}{Xjwq8qUrReyh5VUi21l3aQ}{6HLd44y9Qriva3OcIXukKg}{172.22.0.2}{172.22.0.2:9300}{dir}], id [4037]", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ" }
from zentity.
Thanks @austince this is good to know about. Looking at the contents of your linked stress-test-cluster.sh
, the actions that produce this behavior are actions that change cluster state (e.g. adding and removing indices), which lines up with the log messages about cluster state publication timeouts. An action that shouldn't affect cluster state is a search (e.g. resolution) so it will be good to confirm if the issue doesn't appear for resolution jobs and model GETs.
I'll have to dig in and see what the issue might be. Thanks for sharing the stress test package!
from zentity.
I narrowed down a couple things after using your stress test scripts:
- The issue only occurs on Elasticsearch >= 7.9.0.
- The issue occurs for any _zentity endpoint, including resolution requests which are read-only operations.
- The issue occurs after only a half-dozen or so requests to any endpoint.
I don't see anything in the Elasticsearch 7.9.0 release notes that explains the change in behavior. It lists a known memory leak ("manifests in Elasticsearch when a single document is updated repeatedly with a forced refresh"), but the issue we observe occurs during any operation from a plugin. I suspect something changed in the behavior of the Elasticsearch Java APIs that zentity uses (e.g. BaseRestHandler, RestChannelConsumer, or NodeClient), or in the behavior of the Elasticsearch server and its handling plugins.
from zentity.
Awesome, thanks for digging in Dave. We'll test out ES 7.9.0 in kubernetes on our end as well -- that's a good enough work around for now. What's the best way to look into the potential changes on the ES plugin handling side? Should I open up a discussion on discuss.elastic.co?
from zentity.
@austince I was able to reproduce the issue on a more simple plugin.
I've submitted a bug report to the elasticsearch team with a modified version of your stress test. I labeled it as a "bug" because our usage of the NodeClient has worked with plugins since at least Elasticsearch 5.x, and the expected behavior stopped working after 7.9.0 without an explanation in the release notes. Let's see what the team responds with.
from zentity.
Update from the bug report:
The issue appears to be rooted in the use of blocking calls in zentity's handlers (e.g. whenever the NodeClient
calls the get()
method). Apparently this has always been an anti-pattern for plugins, and it would cause an integration test to fail (even in versions prior to 7.9) when assertions are enabled on the Elasticsearch cluster (i.e. setting -ea
in the JVM options). I didn't know that this assertions capability existed, so I never saw the issue. See more details in the bug report response.
So the next question is, what's the right way for a plugin to await the response of requests to Elasticsearch before returning a single response to the user? I submitted this question on the discussion board to see if other plugin authors can offer guidance.
from zentity.
The question on the discussion board now has an answer with some guidance.
This issue will continue to be my priority until it's resolved and released in the next patch version 1.6.2 (release plan).
I've updated the website to include a warning label for this known issue to help people avoid choosing a version that will lead to cluster instability.
from zentity.
Ah, that makes a lot of sense. That seems like a very reasonable solution as well. Thanks so much for these findings.
from zentity.
I pushed my first commit to the new feature-async-actions
branch (compare to master).
- Manual tests of SetupAction and ModelsAction show promise. I can setup zentity and manage entity models on Elasticsearch >= 7.9.0, which is progress.
- There's some duplicate code in ResolutionAction. There's probably a better way to implement async there.
- Not sure how I feel about the prolific use of anonymous functions, but it's working so far.
I haven't yet attempted to implement async actions to Job, which I believe has the only other blocking call in the plugin. The blocking method is defined here, and it's called many times here. This going to take some careful thought. @austince maybe you could take a look, too?
from zentity.
Looks great! I'll take a read through tomorrow and see if I can think of any useful notes.
Update: sorry I haven't had time to look into this yet, busy end of the week but hope to have time tomorrow.
from zentity.
Pushed a second commit to the feature-async-actions
branch (diff). Refactored the async code to achieve proper exception handling, which I verified through manual tests. Also started to implement debug logging.
Now that I'm more familiar with the async APIs in Elasticsearch, and being satisfied with the current state of async in the REST handlers, I'll move onto the more challenging async implementation in the Job class.
from zentity.
Maybe opening a draft PR for that branch would let us consolidate the conversation about the implementation?
from zentity.
Merged to master
from zentity.
Related Issues (20)
- [Bug] HOT 2
- Support for ElasticSearch 8.X HOT 12
- support recent patch versions for ES 7.17.x HOT 1
- Non-Lating Language Support
- Zentity not honoring nesting HOT 6
- [Bug] Release 1.8.3 should have been 2.0 or 1.9 because it requires Eleasticsearch 8 HOT 1
- Add logging to zentity
- Use a multi-node cluster in production mode for integration tests HOT 3
- Rename the default branch from "master" to "main" HOT 1
- Allow creation of entity models with empty top-level objects. HOT 2
- Implement and enforce requirements for names of attributes, resolvers, and matchers. HOT 2
- Allow attributes to be represented as nested fields HOT 1
- Release Plan 1.8.0 HOT 2
- Add integration tests for Elasticsearch security features HOT 2
- [Bug] zentity fails to obtain attribute values from object arrays during resolution
- Release Plan: 1.8.1 HOT 1
- OpenSearch support HOT 1
- Releases 7.12.1.jar instead of .zip HOT 1
- Support for Elasticsearch 7.13.x and 7.14.x HOT 1
- Support OpenSearch HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zentity.