Giter Club home page Giter Club logo

Comments (5)

rescrv avatar rescrv commented on July 19, 2024
  1. Does hyperdex persist data to disk?

Data is flushed to disk in the background.

  1. Does hyperdex persist spaces greater than memory onto disk?

Yes.

  1. How does hyperdex handle requests for items currently stored on disk?

Currently, it relies upon memory-backed files. We are very careful to keep all data needed for a particular request in one location so that it will be faulted in together.

  1. What format does hyperdex persist data to disk?

A custom binary format. It's not set in stone yet, but the general structure will be similar to what is in hyperdisk/ now.

  1. Is data persisted to disk synchronously or asynchronously flushed?

We persist data to disk asynchronously, and rely upon the higher-level replication of HyperDex to ensure that up to $f$ node failures will not impact durability.

  1. Is there a mechanism to re-start a cluster, without data-loss, from a catastrophic full-cluster failure?

In order to truly protect against data loss from a full-cluster failure, we'd need to persistently log every message. This cost is unacceptably high. Although we do not handle full-cluster failure right now, we intend to guard against such failures through periodic snapshots, and a cluster-wide recovery mechanism.

  1. Is there any possibility for the disk storage to become corrupted? Are there mechanisms in place to recover from corrupted storage?

There is always this possibility. It is on our short-medium term priorities to make this robust, both through use of adequate check-summing, and by enabling fetches from secondary replicas.

I'll consider this comment a good start on documenting this aspect of HyperDex.

from hyperdex.

roja avatar roja commented on July 19, 2024

Fantastic response :)

from hyperdex.

fclairamb avatar fclairamb commented on July 19, 2024

Is there some news on that ?

In order to truly protect against data loss from a full-cluster failure, we'd need to persistently log every message.

On your website is published a performance comparison page which starts with:

Picking a NoSQL data store is a difficult task. To help with this process, we provide the results of a comparison between HyperDex and other popular NoSQL systems.

In this page, hyperdex is clearly beating Cassandra and mongodb. But cassandra does persistently store every change made.

This changes a lot of things. Even for reads performances as cassandra might have to search data among different sstables (an interesting discussion about it).

Maybe you should add something about it in your benchmark so that it actually helps us pick the right NoSQL data store.


I'm very interested by hyperdex for two main reasons:

  • Its searchable capabilities (it's great, there are so many new ways to organize data on nosql clusters with it)
  • Low memory footprint (especially compared to cassandra or mongodb)

But these full-cluster failures can happen (power outage, cascading failure, simple distributed human mistake, etc.) and I don't know that many people who would accept this kind of trade-off.

==> Is there any plan improve this ?
Periodic snapshot isn't really one unless it's associated with some change logs.

from hyperdex.

rescrv avatar rescrv commented on July 19, 2024

It's worth noting that in the year since this bug report was opened, HyperDex has undergone significant change and much of what was written in the bug report under-sells HyperDex's capabilities. Right now we use LevelDB which does write directly to disk on each write, not asynchronously.

It's also worth noting that Cassandra's default settings, and more importantly, the settings we use in the benchmark do not persistently store every change made upon acknowledgement. Data is only written to disk periodically. It's unclear from the documentation whether it's even passed to a "write" call when the operation completes. In HyperDex, the data is indeed passed to the OS with "write" when you receive an acknowledgement.

Because HyperDex uses LevelDB, it too has to search for data among different sstables. This alone is not a reason for slow read operations. The organization of Cassandra's SSTs means they must look in more tables for a get, possibly many more. LevelDB maintains invariants that put a hard upper bound on the number of tables to look in.

We're working on full-cluster backup solutions usable for intra-datacenter, or inter-datacenter failover, allowing you to keep multiple hot spare clusters in different datacenters, and then bring failed clusters up-to-date.

If you'd like to continue this discussion, I'm happy to do so, but ask that we either take it to the HyperDex-discuss mailing list, or open another bug report as the topic has drifted from the original questions. Unless you're actually reporting a bug, the discussion list is where I'd prefer to take it.

from hyperdex.

fclairamb avatar fclairamb commented on July 19, 2024

My point was simply that cassandra doesn't leave dirty data that it can't handle afterwards. It indeed doesn't flush the changes to logs instantly.

You're right, I should continue this to the hyperdex-discuss group. Thank you for your very complete answer.

from hyperdex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.