Comments (16)
I think this can be closed in light of the new Feather.jl package, which is essentially a a cross-language binary storage format for dataframes (R, pandas, etc.). It's fast and efficient because the files are mmapped directly and the each column is unsafe_wrap(Array)
into NullableArrays=>DataFrame. Obviously, NullableArray support is not quite 100% with DataFrames, but should soon be.
from dataframes.jl.
This would be really cool and useful. Cc: @tanmaykm
from dataframes.jl.
I've been thinking about trying to create a binary data format that could store any Julia data structure but mmaps Arrays/BitArrays. Doing this right could take some effort, but it would give you mmapped DataFrames for free.
from dataframes.jl.
That would be great.
from dataframes.jl.
Sounds cool. Any thoughts to share? Possibly related ideas: I've sometimes considered starting a language-agnostic Sane Formats Project™ for reasonable data formats. HDF5 seems like it started out with a reasonable core and then sprouted some insanity when some committee got its grubby mitts on the standard. Likewise, neither CSV nor TSV are actually standardized but are both very handy because of their simplicity. Another thought that @JeffBezanson and I were discussing the last week is the idea of making in-memory data-structures position-independent so that you can serialize them trivially. Might be relevant.
from dataframes.jl.
I think I'm going to start by trying to mmap contiguous datasets in HDF5/JLD files, which doesn't look too hard and would avoid the standards problem. I'm not sure it's possible to create a standard that's as flexible, expressive, and language-agnostic as HDF5 without importing most of its complexity. It's only worth creating a new format if performance can be appreciably improved or complexity appreciably reduced. The HDF5 API is kind of insane, but @timholy has done a great job turning it into something more Julian. I'm also interested in how http://symas.com/mdb/ approaches mmapped storage, although the goals of a database are very different from those of a file format.
Position-independent in-memory data structures seem very useful for parallel usage, but I'm not sure of the wisdom of using them for a data format. I don't see how it would be possible to mmap data structures that aren't bits types directly from disk. If they're mmapped read-only, an attempt to write would crash Julia, and if they're mmapped read/write, it would be possible to accidentally introduce pointers to data structures that aren't stored on disk. Reading into memory would require fewer cycles, but the CPU shouldn't be the bottleneck anyway. Changes in Julia's in-memory data format could also break forward/backward compatibility; as it stands this is one of the primary reasons to use JLD instead of serialize()
/deserialize()
.
from dataframes.jl.
Given that Base's readcsv
is now using mmap
, I have a much clearer sense of how to do this now. I may get to work on it soon, although I don't think it's the most pressing issue.
from dataframes.jl.
Agree - this is not a pressing issue for the moment. Also, streaming would perhaps be a better way to handle this, since you need to do much more later with the large data frame than just read it. Even in the readcsv
in Base, you cannot really read large files and work with them, because the data will not fit in memory. However, the mmap
loads the data faster due to zero-copy, and that could certainly be useful here too. @tanmaykm should correct me if I am wrong.
from dataframes.jl.
Next week I'm going to do a major pass through our streaming architecture. For data sets that are purely numeric, I think we can basically do exactly one memory allocation step at the start and then reuse memory very efficiently. If we can fit GLM's to 10-100 GB data sets in Julia with a nice API, I think we'll have produced the killer Julia statistics app.
from dataframes.jl.
@johnmyleswhite and @tanmaykm, your dramatic performance improvements to I/O are quite amazing!
from dataframes.jl.
I should also add that HDF5 lets you read and write chunks of huge arrays very easily (just using array subregion syntax, like with mmap). Great to see similar capabilities being implemented for other file formats where it's not done "for us" by an outside library.
from dataframes.jl.
Thanks, @timholy!
We should try to extract some of the techniques we're using to provide a more general framework for doing this kind of array extraction.
from dataframes.jl.
I think it would be useful to have a new issue for the DataStream overhaul that you mention.
Given that we have hooked up HDFS (not HDF5!) to julia, it should be pretty straightforward to add DataFrames-like capabilities to datasets that are stored in large distributed filesystems. For example, HDFS provides an API to consume large files in chunks over a distributed cluster. It would be nice to expose this through a DataStream like API to the user, except that this is really a DistributedDataStream, but many of the operations one will want to do will be similar.
from dataframes.jl.
Viral, let's set up some time next week to strategize how we'll implement a simple type of DistributedDataStream.
from dataframes.jl.
I agree, we need a platform able to work transparently with data of any size, at least bigger than memory, and ideally distributed on several computers.
Solutions such as Spark are not complete, they only offer the basic functionalities to build something else.
We don't just need to be able to get some summaries, as we do with databases, we need to be able to do all operations with big data, operations such as multiplying to big matrixes, fitting a mixed-effect models, MCMC, etc.
Bigmemory or ff allows you to do some simple things but they cannot be used by other packages like lme4.
from dataframes.jl.
Closing. Can be reopened in the context of Feather.jl.
from dataframes.jl.
Related Issues (20)
- Segmentation Fault when reading compressed file HOT 1
- Revisit spreading for `AsTable` output` HOT 6
- Better error message when forming a DataFrame from a vector of dictionaries with missing data. HOT 2
- `describe` is slow HOT 3
- CartesianIndex error in Julia 1.11 HOT 4
- `DataFrame(x=Int[], y=Int)` HOT 3
- Add comparison function for dataframes which can handle both isapprox and isequal column types HOT 2
- unique fails with column-type FixedDecimal HOT 5
- mapcols! should modify the parent of a SubDataFrame HOT 11
- Feature request: Pairs in stack HOT 2
- Grouped DataFrame with array elements fails to combine HOT 4
- error when combining a grouped empty dataframe using `first` HOT 6
- Short circuit && on subset? HOT 1
- Integer strings as colnames/selectors are error prone HOT 2
- Suggestion - Matrix Syntax for hcat (as well as vcat) HOT 4
- Document custom generation of column names in manual HOT 9
- `join` should not introduce `Missing` types to schema HOT 1
- Consider removing Tables.allocatecolumn in vcat
- DataFrame(t::Table) converts PooledVector columns HOT 2
- Sampling GroupedDataFrames (rand) HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframes.jl.