Describe the bug Starting with version 1.4.0 my database instance

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Bug: High idle cpu usage and log flooded with "Delr page was empty" messages since 1.4.0,about surrealdb/surrealdb

Comments (27)

iteroji commented on May 20, 2024 2

I've been observing the CPU usage and it seems that it keeps going upwards every day. From 6-8% 4 days ago it went up to around 10-12%. I'm really afraid to deploy it in production, I see it as a huge risk. I have to revert back to 1.3.1.

If anyone figures out a solution to this, please, share it with us

from surrealdb.

micisse commented on May 20, 2024 2

@phughk

1. In my case and since I've been using SurrealDB, no.
2. In my current case and for this app, no.
3. Difficult to assess, but in my case, I can say well before version 1.0.0, i.e. v0.x.x, so since October/November 2023.
4. 0, I've had no use for it.
5. I uploaded this video as soon as SurrealDB was launched with the --trace flag #3906 (comment). The logs are still the same, just like the ticket author's (for docker). If you need anything else, give us instructions on how to get exactly the logs you're interested in.
6. In my case, I've been running SurrealDB since I started using it without logs (--trace) because everything was working normally, there was nothing special to observe. I only used the logs to make the video above. In dev mode, here's the command i use surrealdb start --allow-all -u root -p root file:///...
7. In my case, no no, not necessary for the application in its current state.

The novelty of this bug is that when I CTRL+C to stop the server, I get a Failed to send shutdown signal to task: sending on a closed channel error as in the video, which I didn't get before. Another observation is that this bug is variable: one day everything's fine, and another day we see excessive memory consumption.

from surrealdb.

phughk commented on May 20, 2024 2

That's really useful, thank you! Particularly about the channel - I saw something like that in development and believed it was resolved; Will have a look as you shouldn't see the channel closed

from surrealdb.

micisse commented on May 20, 2024 2

@phughk

number of namespaces: 6 created by me
number of databases: each namespace contains one database
number of tables: each database contains 4 tables (SCHEMALESS)
number of records :
- 2 tables with 300 records (which I've limited to 300, otherwise there are many more - Select (frequently), Delete and Create (Not very frequent), when a new entry is added, the oldest is automatically deleted (1 to 2 changes per day depending on interface). no Update)
- 2 tables with currently 7 records (creation, modification, deletion - Not very frequent - 1 to 2 changes or not per day)
do you use events: no
approximate number of open database connections: Only one in dev mode
nature of queries (number of selections, updates, creations and others) : C.R.U.D and especially Select. Creations, deletions and modifications are rare. Data is known and pre-loaded, transformed once from CSV into RUST..

I took the opportunity to run a test with and without a backup created with older versions of surrealdb.

Running surrealdb v1.4.x without a backup, starting from 0 in a new folder with no load ramp-up, high memory consumption (just launch, no select, delete or update...).

without.backup.mp4

Running surrealdb v1.4.x with a backup created from previous versions and then modified (export/import) as surrealdb is updated, with an immediate increase in load and memory consumption (just launch, no selection, deletion or update...).

with.backup.mp4

from surrealdb.

phughk commented on May 20, 2024 2

1.5.0-beta1 is out, please try it if you have time. It should resolve the issue.
https://deploy-preview-564--surrealdb-web.netlify.app/releases/#v1-5-0-beta-1

from surrealdb.

phughk commented on May 20, 2024 2

Closing as this is resolved in 1.5.0-beta1, but can be mitigated on 1.4.x with --tick-interval 10s

from surrealdb.

iteroji commented on May 20, 2024 1

I just tried removing my old database file (the one I created with v1.0.0) and started 1.4.0. The cpu usage is back to the 0-1% range. It seems like the problem is with something in the database file itself.

from surrealdb.

iteroji commented on May 20, 2024 1

I noticed after about a week of normal operation the process started idling at 6-8%. I updated to 1.4.2 a few days ago, but not sure if it has anything to do with it.

from surrealdb.

Ragavendaran commented on May 20, 2024 1

Hi @phughk

I'm not using LiveQueries, ChangeFeeds

(3) I've exported and imported my data in a new Instance of SurrealDB 1.4.0, the CPU usage was high after 12 hours (overnight) when I checked.
(6) Running without logs has no effect in CPU usage, it remains high either way.

(5) Sure, I did a backup after the clean import, and another backup after 12 hours (docker volumes)
No matter which machine I use (arm/x86), the new one has 0% CPU usage, and the aged one has 16% constant CPU usage, the logs seem the same.

I've attached both logs, also included the tree structure of the database folder at both snapshots in the top.

High-CPU-Usage-(12H-aged).txt
No-CPU-Usage-(New-Instance).txt
Hope this helps.

I did not see Failed to send shutdown signal to task: sending on a closed channel in my instance when stopping.

I suspected if Events were causing this, testing by inserting some data in tables with events set up, but to no avail.

from surrealdb.

Micnubinub commented on May 20, 2024 1

@phughk

Yes, I am using Live Queries
A few months, exported only about 2MB... (High usage is usually after a restart)
Not sure, I'm using JS beta5 and the lib always logs this (haven't used them since the beta got released):

from surrealdb.

Ragavendaran commented on May 20, 2024 1

Hi @phughk

I ran the instance only for a minute, the CPU usage is high in a few seconds (the problem won't go away with restarts once started)

The first time to spot the issue was after 12 hours in a clean imported instance.

I've got the flamegraphs (for linux-amd64 bare-metal) highlighting the CPU usage from both versions 1.4.0 and 1.4.2
flamegraphs.tar.gz (Download XML)

I did see the Failed to send shutdown signal to task: sending on a closed channel message this time.

I'll be happy to provide more details if needed.

Edit: Also confirmed this issue with v1.4.0.beta.1 release.

from surrealdb.

Ragavendaran commented on May 20, 2024 1

Hi @phughk

1 Namespace
6 Databases (2 databases have Events).
28 to 30 Tables.
500K+ records.
I do use events.
max 2 to 3 connections for a short time (then no connections most of the time, this issue started probably when there was no connections)
60% select, 30% create, 6% update, 4% delete (CPU usage will be high even without any connections / running queries)

Was able to get a flamegraph from a debug build of 1.4.0-beta-1, but limited samples for 10 seconds (CPU usage was high the entire time)
flamegraph-debug-10sec.tar.gz (Download XML)

I can try getting a full minute flamegraph from debug build with even fewer samples, if the above one is insufficient.

I've created another clean instance (1.4.0-beta-1) with a single database including events, if the issue crops up there, I'll have full logs to share.

from surrealdb.

Ragavendaran commented on May 20, 2024 1

As expected, I was able to recreate this time, after letting the instance (1.4.0-beta.1 arm64-docker) run for about 12 hours, I saw 2% constant CPU usage, without any connections / queries.

I backed up this instance (entire database file) at 4 points in time.

Immediately after creation (0 min)
After importing 1 database with events (15 min)
After performing all operations in that database (30 min and 200+ queries)
After 12 hours of idling (12 hours and 20 queries)

I used these backups to run with a debug build (1.4.0-beta.1 amd64-local) for each snapshots (each 1 minute runtime).

I'm attaching only the perfs this time, (flamegraph takes really long time to generate .svg from debug perfs)
debug-perfs.tar.gz

Thank you for introducing me to flamegraphs, I've found yet another amazing tool Hotspot

perfs after 12 hours (snapshot 4)

perfs before 12 hours (snapshot 3)

perfs before and after 12h, despite very little activity in between the snapshots, CPU usage became high.

from surrealdb.

phughk commented on May 20, 2024 1

Hey @micisse , thanks. Those are just error messages, it isn't actually a failure. When a channel is closed on one side, the other side failed to send and stopped as well. We will remove the error message, but they are not a major concern.

The delr is not a concern and will also be removed from trace logs.

from surrealdb.

Ragavendaran commented on May 20, 2024

My environment is same, surrealdb in docker.

Firstly, I tried exporting (in 1.4.0 itself) and importing all my databases manually (to a new instance), this did lower the CPU usage back to near 0%, but still had the logs as before.

This new instance got to same 25% CPU usage after a day or so.

I just created another new instance with only one database imported. Seems fine for more than a day. The logs were present nevertheless.

This log surrealdb_core::kvs::tx: Delr page was empty, seems to be printed for the number of databases in the instance.
In addition to the that.

I could see some other failed attempts to connect to port 4317, not sure if this relates to CPU usage.

surrealdb  | 2024-04-18T12:32:41.355797Z TRACE tonic::transport::service::reconnect: poll_ready; connecting
surrealdb  | 2024-04-18T12:32:41.355811Z TRACE hyper::client::connect::http: connect error for 127.0.0.1:4317: ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })
surrealdb  | 2024-04-18T12:32:41.355821Z TRACE tonic::transport::service::reconnect: poll_ready; error
surrealdb  | 2024-04-18T12:32:41.355824Z DEBUG tonic::transport::service::reconnect: reconnect::poll_ready: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }))
surrealdb  | 2024-04-18T12:32:41.355830Z DEBUG tower::buffer::worker: service.ready=true processing request
surrealdb  | 2024-04-18T12:32:41.355836Z TRACE tonic::transport::service::reconnect: Reconnect::call
surrealdb  | 2024-04-18T12:32:41.355839Z DEBUG tonic::transport::service::reconnect: error: error trying to connect: tcp connect error: Connection refused (os error 111)

from surrealdb.

MerryOscar commented on May 20, 2024

also seeing this

from surrealdb.

Micnubinub commented on May 20, 2024

Same here 1.4.0 and 1.4.2, sitting between 75 and 100% CPU usage. RAM keeps cycling between 220 and 320MB... No queries running

Screen.Recording.2024-04-21.at.10.45.21.mov

from surrealdb.

micisse commented on May 20, 2024

It's not just related to docker, just running surrealdb start ... causes this problem and hogs all the memory, when I stop it the machine goes back to normal. It seems to launch several instances in a single run.

sample.mp4

Linux
SurrealDB 1.4.0 for linux on x86_64 & SurrealDB 1.4.2 for linux on x86_64

from surrealdb.

KristjanVall commented on May 20, 2024

from surrealdb.

micisse commented on May 20, 2024

Don't dare deploy it in production, it's an existential problem that has appeared since certain versions. I've never had this problem before. We tried to downgrade to earlier versions, but there were compatibility errors with our backups. New versions work with backups created with older versions, but not older versions, which display compatibility errors. This problem also exists on Windows, apparently describing the same problem on the same releases. Since opening the ticket, the team has not yet reacted, so you'll have to wait if you don't have a database alternative.

Bug: Massive performance degradation on v1.4.2 #3935

from surrealdb.

phughk commented on May 20, 2024

Hi @iteroji, @Ragavendaran, @MerryOscar, @Micnubinub, @micisse, @KristjanVall

We have some ideas, but don't have access to your environments - could you please share some information on the above to help us debug?

Are you using live queries
Are you using change feeds
How long have you been running the DBMS (SurrealDB) (creation data, approximate size)
How long have you been running change feeds and the retention size
Could you please attach the output logs at trace logging level
Could you try running without trace logs to see if that reduces usage please
Are you using Live queries v2?

We have some ideas what it could be, but we would like to get some confirmation so any help will be useful on the above questions.

Thanks for the help in finding this and helping resolve,

Hugh

from surrealdb.

phughk commented on May 20, 2024

Hi @Ragavendaran thank you for providing those logs. I am a bit surprised that the timestamps don't indicate that the db was running for a long time. For example the first log message is

 47 surrealdb  | 2024-05-02T11:45:17.700722Z TRACE tower::buffer::worker: worker polling for next message

and the last is

1072 surrealdb  | 2024-05-02T11:45:59.903767Z TRACE surrealdb_core::kvs::tx: Delr page was empty

As far as I can see, that indicates about a single minute of logging.

We are going to try and reproduce this internally, but were wondering if you may have better log samples. Also if you are able to get a DTrace from the instance, that would be fantastic. A tool I have been using for that is flamegraph https://github.com/flamegraph-rs/flamegraph?tab=readme-ov-file

from surrealdb.

phughk commented on May 20, 2024

Hi everyone thank you for sharing all the information, it is very useful.

The flamegraphs indicate that there are 3 main parts to this, and the largest part seems to be accessing rocksdb (libc.so.6). Because you are all running release builds, we don't have debug information in the flamegraphs and cannot decipher easily exactly what the above functions are. However we can speculate that since we have polling functions present, this is to do with that. And the Delr message may be part of the issue.

Can you please confirm @iteroji, @Ragavendaran, @MerryOscar, @Micnubinub, @micisse, @KristjanVall

number of namespaces
number of databases
number of tables
number of records
are you using events
approximately how many connections you have open to the database
nature of requests (number of select vs update vs create vs other)

We are continuing trying to reproduce this locally but can't immediately see how this is happening for you. We can confirm #3952 and have a fix in the works for that.

from surrealdb.

phughk commented on May 20, 2024

Hey @Ragavendaran 2% passive usage is tolerable, as the database is doing liveness checks. I can see you mentioned that after 12h the CPU was high, but how high? 2% or higher? and was this constant or simply spikes? Thank you for providing this information and captures, looking at them now. The perfdata files are missing the corresponding stack files - do you have these by the chance?

from surrealdb.

Ragavendaran commented on May 20, 2024

Hi @phughk, there were no changes to the database before and after 12hours of idle, there was no CPU usage in the beginning (snapshot 3), and 2 to 4 % after the 12hour idle (snapshot 4). It spikes every 1 second and seems constant in htop.
Yes, 2 to 4 % CPU usage in this test instance, lower than the ~25% CPU usage in the original instance with 6 databases.

I believe, the perf command records stack frames
perf record --call-graph dwarf ...

I'm not sure about the stack files. If there is a specific perf command to use, and some directions to get the stack files, I'll be able to get them.
After your suggestion, I'll try getting the perfs again for this test instance and the original instance with 25% CPU usage.

from surrealdb.

phughk commented on May 20, 2024

Yeah, this is the polling rate for change feeds and live queries. We have changed it back to the 10s polling rate that it was in 1.3, as 1.4 changed it to 1s; 1.5 will have the fix for this. Thanks all for your cooperation in resolving this
@iteroji @Ragavendaran @MerryOscar @Micnubinub @micisse @KristjanVall

Will close this ticket after 1.5 is tested by you all

from surrealdb.

micisse commented on May 20, 2024

Hi @phughk, i've been using v1.5.0-beta.1 since this Saturday afternoon and I haven't noticed any load increase. I've only been using it for a day and a half and so far there's nothing to report about the problem that caused the ticket.

However, I'm still seeing these messages and/or errors, as in the video below (view in full):

...TRACE surrealdb_core::kvs::tx: Delr page was empty, but this time with a certain delay and still in bursts.
Server shutdown error (CTRL+C) ...ERROR surreal::cli::start: Failed to send shutdown signal to task: sending on a closed channel

perhaps planned in another fix

Surrealdb.v1.5.0-beta.1.test.mp4

Have a nice day

from surrealdb.

Bug: High idle cpu usage and log flooded with "Delr page was empty" messages since 1.4.0 about surrealdb HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent