Comments (3)
I think after reading comments you have made under posts, I can conclude that the size of RAM depends on compression ratio and data size.
Let's say I have TBs of data and I only have machines(several) with 32GB RAM, is there any way I can speed up the exploration of data? LocustDB can fit in here? Or some other alternatives could be considered?
from locustdb.
Ah, you beat me to it :). Right now LocustDB does not support querying data across multiple nodes. In principle it is architected to make this possible, but it's still a lot of work to actually implement. If you are looking for a system you can use in production right now, you may want to look at e.g. ClickHouse.
Answer to original question:
Depends on compression. LocustDB actually doesn't really apply much compression yet beyond dictionary encoding strings and choosing the appropriate byte-width for integer columns.
For the taxi dataset, the uncompressed csv takes up about 600GB, the 120GB number is after applying gzip. I also dropped about 70% of columns when loading the dataset into LocustDB, without that you would need maybe 150GB. So in this case, compression is similar to gzip and around ~4x over uncompressed csv. I think those numbers are somewhat typical, but will depend on the actual dataset.
If we implement additional compression passes, it should be possible to get compression ratios in excess of 10x (but sacrificing some query speed). Adding support for storing data on disk would also help reduce memory requirements, but performance will suffer quite significantly if the part of the dataset that you query doesn't fit completely into memory.
from locustdb.
Thanks a lot for the detailed answer. I guess I can just close this issue. Will do some other research.
from locustdb.
Related Issues (20)
- Revisit choice of hash functions during hash grouping HOT 1
- Query planner chooses names for anonymous result columns that might be identical to existing ones HOT 4
- Fix performance regression in benchmark case
- String packed columns break things
- Unable to build `LocustDB` on Mac OS X HOT 4
- Logo? HOT 3
- Queries that have type errors or reference missing columns should give helpful errors/warnings HOT 7
- Order by string column fails with `top_n_asc not supported for type ScalarStr` HOT 4
- Support window functions like row_number() HOT 4
- Default ordering of index columns and inserts to the already existing data HOT 1
- Why not support ansi SQL? HOT 1
- Allow simple GROUP BY clauses
- It does not work at all. HOT 3
- Unary minus and negative constants don't parse
- Tweak RocksDB options
- Optimize RocksDB layout for multiple tables
- Fix usages of `unsafe` related to hard-to-model lifetimes
- Perform merging of select queries by constructing and executing query plan
- Expand cases where intermediary results can be streamed between operators
- Columns in query output are not always in same order as projections in query
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from locustdb.