Comments (5)
- Does hyperdex persist data to disk?
Data is flushed to disk in the background.
- Does hyperdex persist spaces greater than memory onto disk?
Yes.
- How does hyperdex handle requests for items currently stored on disk?
Currently, it relies upon memory-backed files. We are very careful to keep all data needed for a particular request in one location so that it will be faulted in together.
- What format does hyperdex persist data to disk?
A custom binary format. It's not set in stone yet, but the general structure will be similar to what is in hyperdisk/ now.
- Is data persisted to disk synchronously or asynchronously flushed?
We persist data to disk asynchronously, and rely upon the higher-level replication of HyperDex to ensure that up to
- Is there a mechanism to re-start a cluster, without data-loss, from a catastrophic full-cluster failure?
In order to truly protect against data loss from a full-cluster failure, we'd need to persistently log every message. This cost is unacceptably high. Although we do not handle full-cluster failure right now, we intend to guard against such failures through periodic snapshots, and a cluster-wide recovery mechanism.
- Is there any possibility for the disk storage to become corrupted? Are there mechanisms in place to recover from corrupted storage?
There is always this possibility. It is on our short-medium term priorities to make this robust, both through use of adequate check-summing, and by enabling fetches from secondary replicas.
I'll consider this comment a good start on documenting this aspect of HyperDex.
from hyperdex.
Fantastic response :)
from hyperdex.
Is there some news on that ?
In order to truly protect against data loss from a full-cluster failure, we'd need to persistently log every message.
On your website is published a performance comparison page which starts with:
Picking a NoSQL data store is a difficult task. To help with this process, we provide the results of a comparison between HyperDex and other popular NoSQL systems.
In this page, hyperdex is clearly beating Cassandra and mongodb. But cassandra does persistently store every change made.
This changes a lot of things. Even for reads performances as cassandra might have to search data among different sstables (an interesting discussion about it).
Maybe you should add something about it in your benchmark so that it actually helps us pick the right NoSQL data store.
I'm very interested by hyperdex for two main reasons:
- Its searchable capabilities (it's great, there are so many new ways to organize data on nosql clusters with it)
- Low memory footprint (especially compared to cassandra or mongodb)
But these full-cluster failures can happen (power outage, cascading failure, simple distributed human mistake, etc.) and I don't know that many people who would accept this kind of trade-off.
==> Is there any plan improve this ?
Periodic snapshot isn't really one unless it's associated with some change logs.
from hyperdex.
It's worth noting that in the year since this bug report was opened, HyperDex has undergone significant change and much of what was written in the bug report under-sells HyperDex's capabilities. Right now we use LevelDB which does write directly to disk on each write, not asynchronously.
It's also worth noting that Cassandra's default settings, and more importantly, the settings we use in the benchmark do not persistently store every change made upon acknowledgement. Data is only written to disk periodically. It's unclear from the documentation whether it's even passed to a "write" call when the operation completes. In HyperDex, the data is indeed passed to the OS with "write" when you receive an acknowledgement.
Because HyperDex uses LevelDB, it too has to search for data among different sstables. This alone is not a reason for slow read operations. The organization of Cassandra's SSTs means they must look in more tables for a get, possibly many more. LevelDB maintains invariants that put a hard upper bound on the number of tables to look in.
We're working on full-cluster backup solutions usable for intra-datacenter, or inter-datacenter failover, allowing you to keep multiple hot spare clusters in different datacenters, and then bring failed clusters up-to-date.
If you'd like to continue this discussion, I'm happy to do so, but ask that we either take it to the HyperDex-discuss mailing list, or open another bug report as the topic has drifted from the original questions. Unless you're actually reporting a bug, the discussion list is where I'd prefer to take it.
from hyperdex.
My point was simply that cassandra doesn't leave dirty data that it can't handle afterwards. It indeed doesn't flush the changes to logs instantly.
You're right, I should continue this to the hyperdex-discuss group. Thank you for your very complete answer.
from hyperdex.
Related Issues (20)
- C API with document operations and Authorization ? HOT 1
- Atomic map ops appear to have relaxed ordering HOT 3
- Hyperdex daemon reports IO errors if data directory is a relative path HOT 1
- Trouble installing hyperdex-client (node) HOT 5
- How well does HyperDex scale with large datasets in multi DCs across the globe ?
- HyperDex Jepsen, Hermitage Tests
- Go admin bindings fail to build
- cityhash/city.cc HOT 1
- building with clang HOT 3
- HyperDexClientException: reconfiguration affecting virtual_server when putting data
- Publish Java client to maven central
- Python bindings segfault on OS X El Capitan HOT 3
- Error in `python3.5': free(): invalid next size (fast): 0x0000000001712440 HOT 2
- x
- depend on some many other libs HOT 2
- Project still alive? HOT 5
- Anyone actively using Hyperdex? HOT 19
- Bad release dates HOT 1
- replicant cannot be installed on Mac OS X because of daemon deprecation HOT 1
- hacker attack? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hyperdex.