Describe the bug I am running tile38 on production. For my use ca

If every hook has its own connection and each connection </blockquote

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Heap size not coming down after objects are removed,about tidwall/tile38

Comments (21)

tidwall commented on September 23, 2024 1

If every hook has its own connection and each connection

Hooks do not manage connections. That's the responsibility of the endpoint.Manager. There's only one manager for entire server instance.

From the hook's perspective an endpoint is just URL string (or multiple strings if the hook uses failover).

When a hook must send an event, that event, plus the endpoint string, is sent to the Manager.

tile38/internal/server/hooks.go

Lines 696 to 707 in a08c55b

 for _, endpoint := range h.Endpoints { 

 err := h.epm.Send(endpoint, val) 

 if err != nil { 

 log.Debugf("Endpoint connect/send error: %v: %v: %v", 

 idx, endpoint, err) 

 continue 

 } 

 log.Debugf("Endpoint send ok: %v: %v: %v", idx, endpoint, err) 

 sent = true 

 h.counter.Add(1) 

 break 

 }

The Manger will then check if there's a usable connection already mapped to that endpoint string. That's the map you are talking about. If the connection is usable then the event is sent, otherwise a new connection is opened and assigned to the endpoint string, and then the event is sent.

Those connection open on demand, and are automatically closed when idle after a number of seconds, usually 30.

tile38/internal/endpoint/kafka.go

Line 19 in a08c55b

const kafkaExpiresAfter = time.Second * 30

There will only ever be one connection per unique endpoint.

So if you have 100k hooks using the same exact endpoint URL string, then all events for all hooks will pipeline through that connection. Which is the case in @Mukund2900's example above.

from tile38.

iwpnd commented on September 23, 2024 1

1000 hooks
1000 objects where each is "moving" through space and triggers one or more of the hooks

this is 5 minutes after i stop moving the vehicles, closing the connection to kafka and the clients.

And this is a production leader with approx 100k hooks and 300k iots.
Slowly but steadily it leaks, upon restarts frees up.

For comparison this is another tile38 leader without kafka hooks with a steady memory usage.

from tile38.

iwpnd commented on September 23, 2024

Which kind of hook are you using? Also, your collection and hooks are long lived objects in memory, so it will never go down entirely.

However, I can attest to the leak to some degree. If I have a long lived tile38 leader that runs for weeks without a restart and a couple of 100k kafka hooks, at some point it will run OOM. Only a restart will free up memory again.
For comparison, I have a another tile38 leader on the same version that is not using hooks at all. The latter does not have the issue at all.

I can imagine that the issue becomes more apparent the more you're storing and the more hooks you have.

I think we had multiple threads on slack about that already, and some issues here on Github. While some culprits have been found, either on my side or with the underlying kafka library sarama (remember prometheus leaking memory), it never truly was fixed.

from tile38.

Mukund2900 commented on September 23, 2024

I am using kafka hooks and the expiry of that hook in my system is of around 1 year at the moment. I am ready to change that if that helps in memory management. But i dont think that helps. Even with around 100k hooks i am facing this issue.
Here is an example of hook I am saving
{"ok":true,"hooks":[{"name":"263:TEST1:1","key":"vehicles","ttl":31535996,"endpoints":["kafka://127.0.0.1:9092/geofence-hook-callbacks"],"command":["INTERSECTS","vehicles","WHEREIN","buid","1","263","WHEREEVAL","return ( FIELDS.timeOfTheDay >= 0540 and FIELDS.timeOfTheDay <= 2400 ) or ( FIELDS.timeOfTheDay >= 0 and FIELDS.timeOfTheDay <= 0308 ) ","0","FENCE","DETECT","enter,exit","GET","263","TEST1:1"],"meta":{"data":"{}","element":"{\"centerLng\":78.01018375615021,\"centerLat\":35.273456908142805,\"circleRadius\":45,\"type\":\"circle\"}","endTimeOfTheDay":"0308","startTimeOfTheDay":"0540"}}],"elapsed":"119.166µs"}.

Is there any approach that can help solve this memory issue.

from tile38.

tidwall commented on September 23, 2024

Based on the steps above I cannot reproduce the same issue on my side.

Heap size is never coming down, even after running manual GC it is not releasing the heap.

The heap should be going up and down all the time. The GC command only forces an immediate garbage collection, but there will be automatic garbage collection happening continuously in the background.

what I tried

I opened a terminal and started a fresh tile38-server instance:

rm -rf data && tile38-server

Then I opened up another terminal and polled for heap_size every 1 second.

while true; do tile38-cli server | jq .stats.heap_size ; sleep 1; done

# example output
3724152
3804120
3952664
4100920
4249160
3423384
3572968
...

You will see with an idle system the heap_size will continuously grow, but then suddenly shrink, then grow and shrink again.
This is normal will a stable system. Tile38 has a stuff going on in the background even when there is no data.

Then I opened a third terminal and issued the following commands:

tile38-cli SET fleet truck1 POINT 33 -112
tile38-cli FLUSHDB
tile38-cli AOFSHRINK
tile38-cli GC

Those commands have very little effect on the overall heap_size. And the system still seems stable.

Now from a fourth terminal I insert about 100k random objects using the tile38-benchmark tool.

./tile38-benchmark -q -t SET

And the heap_size suddenly jumps up significantly, as expected:

3584064
3734600
31831976         # <-- benchmark started
43412232
92726424
75022664
82000280

Now I reissue the FLUSHDB, AOFSHINK, GC and it goes down again.

96183288
96331576
96553400
96702344
3994632     # <-- GC
4146840

If it's related to the Kafka plugin or something else, then I would absolutely like to find a way to reproduce the issue and plug the leak.

from tile38.

Mukund2900 commented on September 23, 2024

@tidwall Initially i thought this is happening with all the objects, but after @iwpnd suggested that this could be something specific to hooks I checked that and it is.
To reproduce this you can add random 100k hooks.

        List<CommandArgs<String, String>> commandArgsList = new ArrayList<>();
        for (int i = 1; i <= 100000; i++) {
            int lat = 8 + (37 - (8)) * random.nextInt();
            int lng = 68 + (98 - (68)) * random.nextInt();
            commandArgsList.add(new CommandArgs<>(StringCodec.UTF8)
                    .add(i+"testhook")
                    .add("kafka://127.0.0.1:9092/bbb")
                    .add("INTERSECTS")
                    .add("vehicles")
                    .add("WHEREIN")
                    .add("id")
                    .add("1").add(i+"test")
                    .add("FENCE")
                    .add("DETECT")
                    .add("enter,exit")
                    .add("CIRCLE")
                    .add(lng)
                    .add(lat)
                    .add(i));
        }
        template.executeBatchedCommands(commandArgsList , BatchedCommandType.SETHOOK);

As u can see i have added 100k hooks

        
127.0.0.1:9851> SERVER
{"ok":true,"stats":{"aof_size":47957940,"avg_item_size":32050752,"cpus":10,"heap_released":136577024,"heap_size":288456768,"http_transport":true,"id":"1bea22bb0e8946896b1ff9f0024a9133","in_memory_size":153,"max_heap_size":0,"mem_alloc":288456768,"num_collections":2,"num_hooks":100000,"num_objects":2,"num_points":9,"num_strings":0,"pending_events":0,"pid":39868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.30.2"},"elapsed":"292.291µs"}    --after i added 100k hooks
127.0.0.1:9851> FLUSHDB
{"ok":true,"elapsed":"3.042µs"}
127.0.0.1:9851> AOFSHRINK
{"ok":true,"elapsed":"1.125µs"}
127.0.0.1:9851> GC
{"ok":true,"elapsed":"41ns"}
127.0.0.1:9851> SERVER
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":256589824,"heap_size":249559376,"http_transport":true,"id":"1bea22bb0e8946896b1ff9f0024a9133","in_memory_size":0,"max_heap_size":0,"mem_alloc":249559376,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":39868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.30.2"},"elapsed":"123.542µs"}. -- heap_size not coming down

Same i did with just adding normal geofences using SET and after the flush commands heap size came down.

If this code snippet is not enough i can create a small spring-boot server with the required code and push to github so you can check properly.

from tile38.

iwpnd commented on September 23, 2024

tile38/internal/endpoint/endpoint.go

Line 148 in a08c55b

conns map[string]Conn

If every hook has its own connection and each connection is an entry in that map then adding new hooks over the course of a Tile38 leader lifetime will leak memory eventually because it will never be freed up - or I don't see where. Upon delete of the hook the item is removed from the collection, but the connection remains in that map. Does it not?

from tile38.

tidwall commented on September 23, 2024

I pushed a change to the master branch that specifically addresses the issue from @Mukund2900 example above.

It appears that the FLUSHDB did not cleanup all the memory referenced by a hook. Now it does.

from tile38.

Mukund2900 commented on September 23, 2024

@tidwall hope this will get triggered when specific hooks expire or when I run gc or AOFSHRINK . Because in real world scenario that is how I expect it to work I.e. free up memory on hook expiry. To showcase the issue I used FLUSHDB.

Thanks a lot for quick support and response.

from tile38.

iwpnd commented on September 23, 2024

Ah gotcha!! It's unique connections, not duplicate per hook.

from tile38.

Mukund2900 commented on September 23, 2024

@tidwall with the changes you have made I still see memory not releasing completely.

See the cmds ->

After I add the hooks ->
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":203617916,"avg_item_size":0,"cpus":10,"heap_released":141246464,"heap_size":1234877480,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":1234877480,"num_collections":0,"num_hooks":100000,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"199.042µs"}

After the hooks expire ->

127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":207877862,"avg_item_size":0,"cpus":10,"heap_released":135028736,"heap_size":1290472848,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":1290472848,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"208.084µs"}

Running GC & AOFSHRINK ->

127.0.0.1:9851> AOFSHRINK
{"ok":true,"elapsed":"833ns"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":135028736,"heap_size":1290485840,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":1290485840,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"36.583µs"}
127.0.0.1:9851> GC
{"ok":true,"elapsed":"417ns"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":700637184,"heap_size":899132608,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":899132608,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"231.791µs"}

Some amount of memory is released but still 70% memory is not.
To check if this works with FLUSHDB i did this ->

127.0.0.1:9851> FLUSHDB
{"ok":true,"elapsed":"4.75µs"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":17,"avg_item_size":0,"cpus":10,"heap_released":700563456,"heap_size":899221200,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":899221200,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"189.25µs"}
127.0.0.1:9851> AOFSHRINK
{"ok":true,"elapsed":"833ns"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":700563456,"heap_size":899234552,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":899234552,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"127.75µs"}
127.0.0.1:9851> GC
{"ok":true,"elapsed":"292ns"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":700760064,"heap_size":899203424,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":899203424,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":44868,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"165.333µs"}

As you can see same response memory is not releasing completely after all this.

from tile38.

Mukund2900 commented on September 23, 2024

Uploading the memory profiling data if that helps.

I see the problem now, i think this is because I am using WHEREEVAL conditions with the hooks, which is causing this issue.

Same test i did where i saved hooks but without any where/whereeval conditions this time.

127.0.0.1:2000> server                       ---hooks added
{"ok":true,"stats":{"aof_size":188547809,"avg_item_size":151,"cpus":10,"heap_released":6561792,"heap_size":499829776,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":54277790,"max_heap_size":0,"mem_alloc":499829776,"num_collections":600,"num_hooks":100000,"num_objects":100000,"num_points":3300000,"num_strings":0,"pending_events":0,"pid":63526,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"146.125µs"}
127.0.0.1:2000> server                      --hooks removed 
{"ok":true,"stats":{"aof_size":193396238,"avg_item_size":0,"cpus":10,"heap_released":4390912,"heap_size":552040016,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":552040016,"num_collections":0,"num_hooks":88062,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":63526,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"41.083µs"}
127.0.0.1:2000> AOFSHRINK
{"ok":true,"elapsed":"500ns"}
127.0.0.1:2000> GC
{"ok":true,"elapsed":"291ns"}
127.0.0.1:2000> server                       ----heap_size comes back to normal 
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":800399360,"heap_size":49556104,"http_transport":true,"id":"6fedf8dcd7b2910a7dd5b68fa04d43ec","in_memory_size":0,"max_heap_size":0,"mem_alloc":49556104,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":63526,"pointer_size":8,"read_only":false,"threads":16,"version":"1.31.0"},"elapsed":"192.958µs"}

while building the geofences i was adding one of these commands which is the root cause

            commandArgs
                    .add("WHERE")
                    .add("timeOfTheDay").add(startTimeOfTheDay).add(endTimeOfTheDay)
                    .add("FENCE")
                    .add("DETECT")
                    .add("enter,exit")
                    .add("GET").add(buid).add(geofenceMemberId);

            String expression = "return ( FIELDS.timeOfTheDay >= " + repetitiveGeofence.getStartTime() + " and FIELDS.timeOfTheDay <= 2400 ) or ( FIELDS.timeOfTheDay >= 0 and FIELDS.timeOfTheDay <= "+ repetitiveGeofence.getEndTime() +" ) ";
            commandArgs.add("WHEREEVAL")
                    .add(expression)
                    .add("0")
                    .add("FENCE")
                    .add("DETECT")
                    .add("enter,exit")
                    .add("GET").add(buid).add(geofenceMemberId);

from tile38.

Mukund2900 commented on September 23, 2024

Adding the following command while flushing all data and does the job,
s.luascripts.Flush()

But this will help only when we FLUSHDB. There needs to be a function trigger which will delete the script when a hook corresponding to the same is removed.

from tile38.

tidwall commented on September 23, 2024

What's happening is a WHEREEVAL will compiles the script, generates an sha1, and stores it in the luascripts cache map, with the sha1 as the key and lua function as the value.

tile38/internal/server/token.go

Line 429 in 5642fc4

s.luascripts.Put(shaSum, fn.Proto)

Then the next time a new WHEREEVAL is encountered with the same script/sha1 the existing lua script is used instead of having to compile a duplicate.

This is great for performance when there are many search queries with the same WHEREEVAL, but those scripts are not removed until SCRIPTS FLUSH is called; which in-turn calls s.luascripts.Flush() under the hood.

This leaves all those script unused in memory.

The WHEREEVAL is the same as calling the SCRIPT LOAD function, which is intended more for running scripts using the EVALSHA function rather that for geofencing and general search queries.

The easy solution is to just not cache for WHEREEVAL by removing the line above s.luascripts.Put(shaSum, fn.Proto), but that will Likey degrade performance for folks that often use basic search queries such INTERSECTS fleet WHEREEVAL script ... with the same script over and over again.

I think a more robust solution is to cache the WHEREEVAL scripts in their an LRU cache, instead of sharing the with the SCRIPTS LOAD cache.

I just pushed an change that does that.

from tile38.

Mukund2900 commented on September 23, 2024

@tidwall can you please release a version with these changes?

from tile38.

tidwall commented on September 23, 2024

@iwpnd
The graphs in your last comment has a fair amount of information that I will need to wade through. Any additional context would be helpful. To be clear, you are saying there's still a leak due to having Kafka hooks? If so, are there more specific steps I can take to reproduce the issue? Are you using a test environment or production? Is this running the latest stable version, or edge?

Most importantly is reproducing on my side.

If the graphs you provided are based on some mocked up test code, perhaps I can use that to diagnose the issue?

from tile38.

tidwall commented on September 23, 2024

@Mukund2900

can you please release a version with these changes?

Are you referring to the changes in the master branch that I pushed yesterday?

from tile38.

Mukund2900 commented on September 23, 2024

@tidwall yes

from tile38.

iwpnd commented on September 23, 2024

I'm sorry that was really not helpful. I was dumping information and was interrupted providing additional context.

To be clear, you are saying there's still a leak due to having Kafka hooks?

Yes I do. it is the only difference between the two microservices.

Are you using a test environment or production?

This is production data with 100k kafka hooks, and 280k iots that are SET approx every 10seconds.

Is this running the latest stable version, or edge?

this is on 1.30.2 stable

If the graphs you provided are based on some mocked up test code, perhaps I can use that to diagnose the issue?

Unfortunately it's production code I cannot share 1 to 1, but

you could set up something similar locally with this package using docker-compose. That would get you a local kafka cluster with SASL.
Now you could add hooks
While loop and SET vehicles
- with ids ranging 1-200k
- with a TTL and
- a random field called type with values between 1-3
throw in an AOFSHRINK every x minutes
(optionally) throw in a follower or two if you suspect that would affect anything

That's the setup we're running in a nutshell.

from tile38.

tidwall commented on September 23, 2024

No worries. The extra context is helpful. I'll poke around and see happens.

from tile38.

Mukund2900 commented on September 23, 2024

@tidwall with scale over time i am facing the above referenced issue where after the hooks expire memory is still not coming down.
Earlier the work around which i took for this was that i ran AOFSHRINK and GC in every fixed interval (3 hrs), to clear expired hooks but with scale I want expired hooks to be removed even faster as I am running out of memory.
Should tile38 internally not free up the space without depending on manual execution of commands ?
Here is an example

127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":2794045828,"avg_item_size":293,"caught_up":true,"caught_up_once":true,"cpus":2,"following":"10.0.11.12:9851","heap_released":308051968,"heap_size":87162256,"http_transport":true,"id":"8044339e8accc6a7b19b9cf12e2516d8","in_memory_size":5076387,"max_heap_size":2147483648,"mem_alloc":87162256,"num_collections":15,"num_hooks":8899,"num_objects":12174,"num_points":296942,"num_strings":0,"pending_events":0,"pid":3300140,"pointer_size":8,"read_only":false,"threads":9,"version":"0.0.0"},"elapsed":"121.6µs"}

This is just after 2 min of running AOFSHRINK and GC.

And now when i run again here is the data ->

127.0.0.1:9851> AOFSHRINK
{"ok":true,"elapsed":"320ns"}
127.0.0.1:9851> GC
{"ok":true,"elapsed":"4.045µs"}
127.0.0.1:9851> server
{"ok":true,"stats":{"aof_size":46475824,"avg_item_size":249,"caught_up":true,"caught_up_once":true,"cpus":2,"following":"10.0.11.12:9851","heap_released":303972352,"heap_size":77025336,"http_transport":true,"id":"8044339e8accc6a7b19b9cf12e2516d8","in_memory_size":5271387,"max_heap_size":2147483648,"mem_alloc":77025336,"num_collections":15,"num_hooks":9254,"num_objects":12546,"num_points":308674,"num_strings":0,"pending_events":0,"pid":3300140,"pointer_size":8,"read_only":false,"threads":9,"version":"0.0.0"},"elapsed":"107.094µs"}

from tile38.

Heap size not coming down after objects are removed about tile38 HOT 21 OPEN

Comments (21)

what I tried

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	for _, endpoint := range h.Endpoints {
	err := h.epm.Send(endpoint, val)
	if err != nil {
	log.Debugf("Endpoint connect/send error: %v: %v: %v",
	idx, endpoint, err)
	continue
	}
	log.Debugf("Endpoint send ok: %v: %v: %v", idx, endpoint, err)
	sent = true
	h.counter.Add(1)
	break
	}