Giter Club home page Giter Club logo

Comments (14)

ohsayan avatar ohsayan commented on May 18, 2024

Can confirm persistence is not broken on 0.5.1. This is a regression into release channel

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

This needs bisection

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

Bisection (to be updated ..)

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

This is the offending commit: ce466eb. Bisection complete

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

Is this related to the cloned file descriptor?

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

Updates:

  • flock::FileLock::reacquire doesn't cause problems

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

This patch fixes it: bisect-bad-persistence-0.5.2.patch

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

Yes, the cloned descriptor is the source of errors although it isn't immediately clear why

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

This is the best patch we can have with a reacquire:

diff --git a/server/src/coredb/mod.rs b/server/src/coredb/mod.rs
index 76f1fe4..2887bbd 100644
--- a/server/src/coredb/mod.rs
+++ b/server/src/coredb/mod.rs
@@ -286,7 +286,7 @@ impl CoreDB {
         bgsave: BGSave,
         snapshot_cfg: SnapshotConfig,
         restore_file: Option<PathBuf>,
-    ) -> TResult<(Self, Option<flock::FileLock>, flock::FileLock)> {
+    ) -> TResult<(Self, Option<flock::FileLock>)> {
         let coretable = diskstore::get_saved(restore_file)?;
         let mut background_tasks: usize = 0;
         if !bgsave.is_disabled() {
@@ -324,13 +324,12 @@ impl CoreDB {
         ));
         let lock = flock::FileLock::lock("data.bin")
             .map_err(|e| format!("Failed to acquire lock on data file with error '{}'", e))?;
-        let cloned_descriptor = lock.try_clone()?;
         if bgsave.is_disabled() {
-            Ok((db, Some(lock), cloned_descriptor))
+            Ok((db, Some(lock)))
         } else {
             // Spawn the BGSAVE service in a separate task
             tokio::spawn(diskstore::bgsave_scheduler(db.clone(), bgsave, lock));
-            Ok((db, None, cloned_descriptor))
+            Ok((db, None))
         }
     }
     /// Create an empty in-memory table
diff --git a/server/src/dbnet/mod.rs b/server/src/dbnet/mod.rs
index 598142f..38516b4 100644
--- a/server/src/dbnet/mod.rs
+++ b/server/src/dbnet/mod.rs
@@ -44,8 +44,7 @@ use crate::config::PortConfig;
 use crate::config::SnapshotConfig;
 use crate::config::SslOpts;
 use crate::dbnet::tcp::Listener;
-use crate::diskstore::{self, snapshot::DIR_REMOTE_SNAPSHOT};
-use diskstore::flock;
+use crate::diskstore::snapshot::DIR_REMOTE_SNAPSHOT;
 mod tcp;
 use crate::CoreDB;
 use libsky::TResult;
@@ -316,17 +315,16 @@ pub async fn run(
     snapshot_cfg: SnapshotConfig,
     sig: impl Future,
     restore_filepath: Option<PathBuf>,
-) -> (CoreDB, flock::FileLock) {
+) -> CoreDB {
     let (signal, _) = broadcast::channel(1);
     let (terminate_tx, terminate_rx) = mpsc::channel(1);
-    let (db, lock, cloned_descriptor) =
-        match CoreDB::new(bgsave_cfg, snapshot_cfg, restore_filepath) {
-            Ok((db, lock, cloned_descriptor)) => (db, lock, cloned_descriptor),
-            Err(e) => {
-                log::error!("ERROR: {}", e);
-                process::exit(0x100);
-            }
-        };
+    let (db, lock) = match CoreDB::new(bgsave_cfg, snapshot_cfg, restore_filepath) {
+        Ok((db, lock)) => (db, lock),
+        Err(e) => {
+            log::error!("ERROR: {}", e);
+            process::exit(0x100);
+        }
+    };
     match fs::create_dir_all(&*DIR_REMOTE_SNAPSHOT) {
         Ok(_) => (),
         Err(e) => match e.kind() {
@@ -388,12 +386,14 @@ pub async fn run(
             log::info!("Signalling all workers to shut down");
         }
     }
-    server.finish_with_termsig().await;
+    {
+        server.finish_with_termsig().await;
+    }
     if let Some(Err(e)) = lock.map(|mut val| val.unlock()) {
         log::error!("Failed to release lock on data file with '{}'", e);
         process::exit(0x100);
     }
-    (db, cloned_descriptor)
+    db
 }
 
 /// This is a **test only** function
diff --git a/server/src/diskstore/flock.rs b/server/src/diskstore/flock.rs
index 9f6a313..9f37756 100644
--- a/server/src/diskstore/flock.rs
+++ b/server/src/diskstore/flock.rs
@@ -97,18 +97,6 @@ impl FileLock {
         // Now write to the file
         self.file.write_all(bytes)
     }
-    pub fn try_clone(&self) -> Result<Self> {
-        Ok(Self {
-            file: self.file.try_clone()?,
-            unlocked: false,
-        })
-    }
-    /// Reacquire a lock
-    pub fn reacquire(&mut self) -> Result<()> {
-        Self::_lock(&self.file)?;
-        self.unlocked = false;
-        Ok(())
-    }
 }
 
 impl Drop for FileLock {
diff --git a/server/src/main.rs b/server/src/main.rs
index b0b0074..2d461c3 100644
--- a/server/src/main.rs
+++ b/server/src/main.rs
@@ -77,10 +77,10 @@ fn main() {
         .enable_all()
         .build()
         .unwrap();
-    let (db, mut descriptor) = runtime.block_on(async {
+    let db = runtime.block_on(async {
         let (tcplistener, bgsave_config, snapshot_config, restore_filepath) =
             check_args_and_get_cfg().await;
-        let (db, descriptor) = run(
+        let db = run(
             tcplistener,
             bgsave_config,
             snapshot_config,
@@ -88,7 +88,7 @@ fn main() {
             restore_filepath,
         )
         .await;
-        (db, descriptor)
+        db
     });
     // Make sure all background workers terminate
     drop(runtime);
@@ -97,19 +97,21 @@ fn main() {
         1,
         "Maybe the compiler reordered the drop causing more than one instance of CoreDB to live at this point"
     );
-    // Try to acquire lock almost immediately
-    if let Err(e) = descriptor.reacquire() {
-        log::error!("Failed to reacquire lock on data file with error: '{}'", e);
-        panic!("FATAL: data file relocking failure");
-    }
-    if let Err(e) = flush_db!(db, descriptor) {
+    let mut lock = match diskstore::flock::FileLock::lock("data.bin") {
+        Ok(lck) => lck,
+        Err(e) => {
+            log::error!("Failed to reacquire lock on data file with '{}'", e);
+            std::process::exit(0x100);
+        }
+    };
+    if let Err(e) = flush_db!(db, lock) {
         log::error!("Failed to flush data to disk with '{}'", e);
         loop {
             // Keep looping until we successfully write the in-memory table to disk
             log::warn!("Press enter to try again...");
             io::stdout().flush().unwrap();
             io::stdin().read(&mut [0]).unwrap();
-            if let Ok(_) = flush_db!(db, descriptor) {
+            if let Ok(_) = flush_db!(db, lock) {
                 log::info!("Successfully saved data to disk");
                 break;
             } else {

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

Patch updated to remove reacquire

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

We should backport this fix and rollout to 0.5.3

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

Fix backported: https://github.com/skytable/skytable/tree/0.5.2/backports/fix-storage

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

This should be tagged into 0.5.3

from skytable.

ohsayan avatar ohsayan commented on May 18, 2024

Backport tagged and pushed into release channel: https://github.com/skytable/skytable/releases/tag/v0.5.3

from skytable.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.