Comments (21)
If every hook has its own connection and each connection
Hooks do not manage connections. That's the responsibility of the endpoint.Manager
. There's only one manager for entire server instance.
From the hook's perspective an endpoint is just URL string (or multiple strings if the hook uses failover).
When a hook must send an event, that event, plus the endpoint string, is sent to the Manager.
tile38/internal/server/hooks.go
Lines 696 to 707 in a08c55b
The Manger will then check if there's a usable connection already mapped to that endpoint string. That's the map you are talking about. If the connection is usable then the event is sent, otherwise a new connection is opened and assigned to the endpoint string, and then the event is sent.
Those connection open on demand, and are automatically closed when idle after a number of seconds, usually 30.
tile38/internal/endpoint/kafka.go
Line 19 in a08c55b
There will only ever be one connection per unique endpoint.
So if you have 100k hooks using the same exact endpoint URL string, then all events for all hooks will pipeline through that connection. Which is the case in @Mukund2900's example above.
from tile38.
- 1000 hooks
- 1000 objects where each is "moving" through space and triggers one or more of the hooks
this is 5 minutes after i stop moving the vehicles, closing the connection to kafka and the clients.
And this is a production leader with approx 100k hooks and 300k iots.
Slowly but steadily it leaks, upon restarts frees up.
For comparison this is another tile38 leader without kafka hooks with a steady memory usage.
from tile38.
Which kind of hook are you using? Also, your collection and hooks are long lived objects in memory, so it will never go down entirely.
However, I can attest to the leak to some degree. If I have a long lived tile38 leader that runs for weeks without a restart and a couple of 100k kafka hooks, at some point it will run OOM. Only a restart will free up memory again.
For comparison, I have a another tile38 leader on the same version that is not using hooks at all. The latter does not have the issue at all.
I can imagine that the issue becomes more apparent the more you're storing and the more hooks you have.
I think we had multiple threads on slack about that already, and some issues here on Github. While some culprits have been found, either on my side or with the underlying kafka library sarama (remember prometheus leaking memory), it never truly was fixed.
from tile38.
I am using kafka hooks and the expiry of that hook in my system is of around 1 year at the moment. I am ready to change that if that helps in memory management. But i dont think that helps. Even with around 100k hooks i am facing this issue.
Here is an example of hook I am saving
{"ok":true,"hooks":[{"name":"263:TEST1:1","key":"vehicles","ttl":31535996,"endpoints":["kafka://127.0.0.1:9092/geofence-hook-callbacks"],"command":["INTERSECTS","vehicles","WHEREIN","buid","1","263","WHEREEVAL","return ( FIELDS.timeOfTheDay >= 0540 and FIELDS.timeOfTheDay <= 2400 ) or ( FIELDS.timeOfTheDay >= 0 and FIELDS.timeOfTheDay <= 0308 ) ","0","FENCE","DETECT","enter,exit","GET","263","TEST1:1"],"meta":{"data":"{}","element":"{\"centerLng\":78.01018375615021,\"centerLat\":35.273456908142805,\"circleRadius\":45,\"type\":\"circle\"}","endTimeOfTheDay":"0308","startTimeOfTheDay":"0540"}}],"elapsed":"119.166µs"}
.
Is there any approach that can help solve this memory issue.
from tile38.
Based on the steps above I cannot reproduce the same issue on my side.
Heap size is never coming down, even after running manual GC it is not releasing the heap.
The heap should be going up and down all the time. The GC command only forces an immediate garbage collection, but there will be automatic garbage collection happening continuously in the background.
what I tried
I opened a terminal and started a fresh tile38-server instance:
rm -rf data && tile38-server
Then I opened up another terminal and polled for heap_size every 1 second.
while true; do tile38-cli server | jq .stats.heap_size ; sleep 1; done
# example output
3724152
3804120
3952664
4100920
4249160
3423384
3572968
...
You will see with an idle system the heap_size will continuously grow, but then suddenly shrink, then grow and shrink again.
This is normal will a stable system. Tile38 has a stuff going on in the background even when there is no data.
Then I opened a third terminal and issued the following commands:
tile38-cli SET fleet truck1 POINT 33 -112
tile38-cli FLUSHDB
tile38-cli AOFSHRINK
tile38-cli GC
Those commands have very little effect on the overall heap_size. And the system still seems stable.
Now from a fourth terminal I insert about 100k random objects using the tile38-benchmark tool.
./tile38-benchmark -q -t SET
And the heap_size suddenly jumps up significantly, as expected:
3584064
3734600
31831976 # <-- benchmark started
43412232
92726424
75022664
82000280
Now I reissue the FLUSHDB, AOFSHINK, GC and it goes down again.
96183288
96331576
96553400
96702344
3994632 # <-- GC
4146840
If it's related to the Kafka plugin or something else, then I would absolutely like to find a way to reproduce the issue and plug the leak.
from tile38.
@tidwall Initially i thought this is happening with all the objects, but after @iwpnd suggested that this could be something specific to hooks I checked that and it is.
To reproduce this you can add random 100k hooks.
List<CommandArgs<String, String>> commandArgsList = new ArrayList<>();
for (int i = 1; i <= 100000; i++) {
int lat = 8 + (37 - (8)) * random.nextInt();
int lng = 68 + (98 - (68)) * random.nextInt();
commandArgsList.add(new CommandArgs<>(StringCodec.UTF8)
.add(i+"testhook")
.add("kafka://127.0.0.1:9092/bbb")
.add("INTERSECTS")
.add("vehicles")
.add("WHEREIN")
.add("id")
.add("1").add(i+"test")
.add("FENCE")
.add("DETECT")
.add("enter,exit")
.add("CIRCLE")
.add(lng)
.add(lat)
.add(i));
}
template.executeBatchedCommands(commandArgsList , BatchedCommandType.SETHOOK);
As u can see i have added 100k hooks
127.0.0.1:9851> SERVER
{"ok":true,"stats":{"aof_size":47957940,"avg_item_size":32050752,"cpus":10,"heap_released":136577024,"heap_size":288456768,"http_transport":true,"id":"1bea22bb0e8946896b1ff9f0024a9133","in_memory_size":153,"max_heap_size":0,"mem_alloc":288456768,"num_collections":2,"num_hooks":100000,"num_objects":2,"num_points":9,"num_strings":0,"pending_events":0,"pid":39868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.30.2"},"elapsed":"292.291µs"} --after i added 100k hooks
127.0.0.1:9851> FLUSHDB
{"ok":true,"elapsed":"3.042µs"}
127.0.0.1:9851> AOFSHRINK
{"ok":true,"elapsed":"1.125µs"}
127.0.0.1:9851> GC
{"ok":true,"elapsed":"41ns"}
127.0.0.1:9851> SERVER
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":256589824,"heap_size":249559376,"http_transport":true,"id":"1bea22bb0e8946896b1ff9f0024a9133","in_memory_size":0,"max_heap_size":0,"mem_alloc":249559376,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":39868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.30.2"},"elapsed":"123.542µs"}. -- heap_size not coming down
Same i did with just adding normal geofences using SET and after the flush commands heap size came down.
If this code snippet is not enough i can create a small spring-boot server with the required code and push to github so you can check properly.
from tile38.
tile38/internal/endpoint/endpoint.go
Line 148 in a08c55b
If every hook has its own connection and each connection is an entry in that map then adding new hooks over the course of a Tile38 leader lifetime will leak memory eventually because it will never be freed up - or I don't see where. Upon delete of the hook the item is removed from the collection, but the connection remains in that map. Does it not?
from tile38.
I pushed a change to the master branch that specifically addresses the issue from @Mukund2900 example above.
It appears that the FLUSHDB did not cleanup all the memory referenced by a hook. Now it does.
from tile38.
@tidwall hope this will get triggered when specific hooks expire or when I run gc or AOFSHRINK . Because in real world scenario that is how I expect it to work I.e. free up memory on hook expiry. To showcase the issue I used FLUSHDB.
Thanks a lot for quick support and response.
from tile38.
Ah gotcha!! It's unique connections, not duplicate per hook.
from tile38.
@tidwall with the changes you have made I still see memory not releasing completely.
See the cmds ->
After I add the hooks ->
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":203617916,"avg_item_size":0,"cpus":10,"heap_released":141246464,"heap_size":1234877480,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":1234877480,"num_collections":0,"num_hooks":100000,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"199.042µs"}
After the hooks expire ->
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":207877862,"avg_item_size":0,"cpus":10,"heap_released":135028736,"heap_size":1290472848,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":1290472848,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"208.084µs"}
Running GC & AOFSHRINK ->
127.0.0.1:9851> AOFSHRINK
{"ok":true,"elapsed":"833ns"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":135028736,"heap_size":1290485840,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":1290485840,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"36.583µs"}
127.0.0.1:9851> GC
{"ok":true,"elapsed":"417ns"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":700637184,"heap_size":899132608,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":899132608,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"231.791µs"}
Some amount of memory is released but still 70% memory is not.
To check if this works with FLUSHDB i did this ->
127.0.0.1:9851> FLUSHDB
{"ok":true,"elapsed":"4.75µs"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":17,"avg_item_size":0,"cpus":10,"heap_released":700563456,"heap_size":899221200,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":899221200,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"189.25µs"}
127.0.0.1:9851> AOFSHRINK
{"ok":true,"elapsed":"833ns"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":700563456,"heap_size":899234552,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":899234552,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"127.75µs"}
127.0.0.1:9851> GC
{"ok":true,"elapsed":"292ns"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":700760064,"heap_size":899203424,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":899203424,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"165.333µs"}
As you can see same response memory is not releasing completely after all this.
from tile38.
Uploading the memory profiling data if that helps.
I see the problem now, i think this is because I am using WHEREEVAL conditions with the hooks, which is causing this issue.
Same test i did where i saved hooks but without any where/whereeval conditions this time.
127.0.0.1:2000> server ---hooks added
{"ok":true,"stats":{"aof_size":188547809,"avg_item_size":151,"cpus":10,"heap_released":6561792,"heap_size":499829776,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":54277790,"max_heap_size":0,"mem_alloc":499829776,"num_collections":600,"num_hooks":100000,"num_objects":100000,"num_points":3300000,"num_strings":0,"pending_events":0,"pid":63526,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"146.125µs"}
127.0.0.1:2000> server --hooks removed
{"ok":true,"stats":{"aof_size":193396238,"avg_item_size":0,"cpus":10,"heap_released":4390912,"heap_size":552040016,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":552040016,"num_collections":0,"num_hooks":88062,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":63526,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"41.083µs"}
127.0.0.1:2000> AOFSHRINK
{"ok":true,"elapsed":"500ns"}
127.0.0.1:2000> GC
{"ok":true,"elapsed":"291ns"}
127.0.0.1:2000> server ----heap_size comes back to normal
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":800399360,"heap_size":49556104,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":49556104,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":63526,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"192.958µs"}
while building the geofences i was adding one of these commands which is the root cause
commandArgs
.add("WHERE")
.add("timeOfTheDay").add(startTimeOfTheDay).add(endTimeOfTheDay)
.add("FENCE")
.add("DETECT")
.add("enter,exit")
.add("GET").add(buid).add(geofenceMemberId);
String expression = "return ( FIELDS.timeOfTheDay >= " + repetitiveGeofence.getStartTime() + " and FIELDS.timeOfTheDay <= 2400 ) or ( FIELDS.timeOfTheDay >= 0 and FIELDS.timeOfTheDay <= "+ repetitiveGeofence.getEndTime() +" ) ";
commandArgs.add("WHEREEVAL")
.add(expression)
.add("0")
.add("FENCE")
.add("DETECT")
.add("enter,exit")
.add("GET").add(buid).add(geofenceMemberId);
from tile38.
Adding the following command while flushing all data and does the job,
s.luascripts.Flush()
But this will help only when we FLUSHDB. There needs to be a function trigger which will delete the script when a hook corresponding to the same is removed.
from tile38.
What's happening is a WHEREEVAL
will compiles the script, generates an sha1, and stores it in the luascripts cache map, with the sha1 as the key and lua function as the value.
tile38/internal/server/token.go
Line 429 in 5642fc4
Then the next time a new WHEREEVAL is encountered with the same script/sha1 the existing lua script is used instead of having to compile a duplicate.
This is great for performance when there are many search queries with the same WHEREEVAL, but those scripts are not removed until SCRIPTS FLUSH
is called; which in-turn calls s.luascripts.Flush()
under the hood.
This leaves all those script unused in memory.
The WHEREEVAL is the same as calling the SCRIPT LOAD
function, which is intended more for running scripts using the EVALSHA
function rather that for geofencing and general search queries.
The easy solution is to just not cache for WHEREEVAL by removing the line above s.luascripts.Put(shaSum, fn.Proto)
, but that will Likey degrade performance for folks that often use basic search queries such INTERSECTS fleet WHEREEVAL script ...
with the same script over and over again.
I think a more robust solution is to cache the WHEREEVAL scripts in their an LRU cache, instead of sharing the with the SCRIPTS LOAD
cache.
I just pushed an change that does that.
from tile38.
@tidwall can you please release a version with these changes?
from tile38.
@iwpnd
The graphs in your last comment has a fair amount of information that I will need to wade through. Any additional context would be helpful. To be clear, you are saying there's still a leak due to having Kafka hooks? If so, are there more specific steps I can take to reproduce the issue? Are you using a test environment or production? Is this running the latest stable version, or edge?
Most importantly is reproducing on my side.
If the graphs you provided are based on some mocked up test code, perhaps I can use that to diagnose the issue?
from tile38.
can you please release a version with these changes?
Are you referring to the changes in the master branch that I pushed yesterday?
from tile38.
@tidwall yes
from tile38.
I'm sorry that was really not helpful. I was dumping information and was interrupted providing additional context.
To be clear, you are saying there's still a leak due to having Kafka hooks?
Yes I do. it is the only difference between the two microservices.
Are you using a test environment or production?
This is production data with 100k kafka hooks, and 280k iots that are SET approx every 10seconds.
Is this running the latest stable version, or edge?
this is on 1.30.2 stable
If the graphs you provided are based on some mocked up test code, perhaps I can use that to diagnose the issue?
Unfortunately it's production code I cannot share 1 to 1, but
- you could set up something similar locally with this package using docker-compose. That would get you a local kafka cluster with SASL.
- Now you could add hooks
- While loop and SET vehicles
- with ids ranging 1-200k
- with a TTL and
- a random field called type with values between 1-3
- throw in an AOFSHRINK every x minutes
- (optionally) throw in a follower or two if you suspect that would affect anything
That's the setup we're running in a nutshell.
from tile38.
No worries. The extra context is helpful. I'll poke around and see happens.
from tile38.
@tidwall with scale over time i am facing the above referenced issue where after the hooks expire memory is still not coming down.
Earlier the work around which i took for this was that i ran AOFSHRINK and GC in every fixed interval (3 hrs), to clear expired hooks but with scale I want expired hooks to be removed even faster as I am running out of memory.
Should tile38 internally not free up the space without depending on manual execution of commands ?
Here is an example
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":2794045828,"avg_item_size":293,"caught_up":true,"caught_up_once":true,"cpus":2,"following":"10.0.11.12:9851","heap_released":308051968,"heap_size":87162256,"http_transport":true,"id":"8044339e8accc6a7b19b9cf12e2516d8","in_memory_size":5076387,"max_heap_size":2147483648,"mem_alloc":87162256,"num_collections":15,"num_hooks":8899,"num_objects":12174,"num_points":296942,"num_strings":0,"pending_events":0,"pid":3300140,"pointer_size":8,"read_only":false,"threads":9,"version":"0.0.0"},"elapsed":"121.6µs"}
This is just after 2 min of running AOFSHRINK and GC.
And now when i run again here is the data ->
127.0.0.1:9851> AOFSHRINK
{"ok":true,"elapsed":"320ns"}
127.0.0.1:9851> GC
{"ok":true,"elapsed":"4.045µs"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":46475824,"avg_item_size":249,"caught_up":true,"caught_up_once":true,"cpus":2,"following":"10.0.11.12:9851","heap_released":303972352,"heap_size":77025336,"http_transport":true,"id":"8044339e8accc6a7b19b9cf12e2516d8","in_memory_size":5271387,"max_heap_size":2147483648,"mem_alloc":77025336,"num_collections":15,"num_hooks":9254,"num_objects":12546,"num_points":308674,"num_strings":0,"pending_events":0,"pid":3300140,"pointer_size":8,"read_only":false,"threads":9,"version":"0.0.0"},"elapsed":"107.094µs"}
from tile38.
Related Issues (20)
- Tile38 does not support EXISTS command. HOT 2
- FSET does not have a FGET counterpoint HOT 2
- 使用tile
- 使用tile38如何判断点是否在库中存储的区域对象中
- How can I tell if a point is in a region with tile38 HOT 2
- How to execute INTERSECTS command through StatefulRedisConnection in Spring
- GOPATH / GOROOT HOT 2
- tile38 How to calculate the distance between two points HOT 1
- tile38 How to calculate the distance between two points HOT 1
- Invalid JSON created when setting a field to a string value like '000296'. HOT 4
- WebSocket interface doesn't reply to requests after connection upgrade HOT 7
- tile38 use Geofencing watches
- Bug: Tile38 followers can get out of sync with leaders HOT 2
- FSET transforms field names to lowercase HOT 7
- The arm64 environment is not running properly HOT 3
- Issue with WITHIN Query in Tile38 Geofencing HOT 10
- Tile38 unable to write and getting crash HOT 4
- ROLE command with JSON output has extra quote HOT 1
- Tile38 is not publishing its db stats to redis sentinel which causes sentinel to pick wrong master HOT 2
- TILE38 cpu spikes with lots of geofences and kafka hooks
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tile38.