Giter Club home page Giter Club logo

Comments (16)

quinnj avatar quinnj commented on May 16, 2024 1

I think this can be closed in light of the new Feather.jl package, which is essentially a a cross-language binary storage format for dataframes (R, pandas, etc.). It's fast and efficient because the files are mmapped directly and the each column is unsafe_wrap(Array) into NullableArrays=>DataFrame. Obviously, NullableArray support is not quite 100% with DataFrames, but should soon be.

from dataframes.jl.

ViralBShah avatar ViralBShah commented on May 16, 2024

This would be really cool and useful. Cc: @tanmaykm

from dataframes.jl.

simonster avatar simonster commented on May 16, 2024

I've been thinking about trying to create a binary data format that could store any Julia data structure but mmaps Arrays/BitArrays. Doing this right could take some effort, but it would give you mmapped DataFrames for free.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on May 16, 2024

That would be great.

from dataframes.jl.

StefanKarpinski avatar StefanKarpinski commented on May 16, 2024

Sounds cool. Any thoughts to share? Possibly related ideas: I've sometimes considered starting a language-agnostic Sane Formats Project™ for reasonable data formats. HDF5 seems like it started out with a reasonable core and then sprouted some insanity when some committee got its grubby mitts on the standard. Likewise, neither CSV nor TSV are actually standardized but are both very handy because of their simplicity. Another thought that @JeffBezanson and I were discussing the last week is the idea of making in-memory data-structures position-independent so that you can serialize them trivially. Might be relevant.

from dataframes.jl.

simonster avatar simonster commented on May 16, 2024

I think I'm going to start by trying to mmap contiguous datasets in HDF5/JLD files, which doesn't look too hard and would avoid the standards problem. I'm not sure it's possible to create a standard that's as flexible, expressive, and language-agnostic as HDF5 without importing most of its complexity. It's only worth creating a new format if performance can be appreciably improved or complexity appreciably reduced. The HDF5 API is kind of insane, but @timholy has done a great job turning it into something more Julian. I'm also interested in how http://symas.com/mdb/ approaches mmapped storage, although the goals of a database are very different from those of a file format.

Position-independent in-memory data structures seem very useful for parallel usage, but I'm not sure of the wisdom of using them for a data format. I don't see how it would be possible to mmap data structures that aren't bits types directly from disk. If they're mmapped read-only, an attempt to write would crash Julia, and if they're mmapped read/write, it would be possible to accidentally introduce pointers to data structures that aren't stored on disk. Reading into memory would require fewer cycles, but the CPU shouldn't be the bottleneck anyway. Changes in Julia's in-memory data format could also break forward/backward compatibility; as it stands this is one of the primary reasons to use JLD instead of serialize()/deserialize().

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on May 16, 2024

Given that Base's readcsv is now using mmap, I have a much clearer sense of how to do this now. I may get to work on it soon, although I don't think it's the most pressing issue.

from dataframes.jl.

ViralBShah avatar ViralBShah commented on May 16, 2024

Agree - this is not a pressing issue for the moment. Also, streaming would perhaps be a better way to handle this, since you need to do much more later with the large data frame than just read it. Even in the readcsv in Base, you cannot really read large files and work with them, because the data will not fit in memory. However, the mmap loads the data faster due to zero-copy, and that could certainly be useful here too. @tanmaykm should correct me if I am wrong.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on May 16, 2024

Next week I'm going to do a major pass through our streaming architecture. For data sets that are purely numeric, I think we can basically do exactly one memory allocation step at the start and then reuse memory very efficiently. If we can fit GLM's to 10-100 GB data sets in Julia with a nice API, I think we'll have produced the killer Julia statistics app.

from dataframes.jl.

timholy avatar timholy commented on May 16, 2024

@johnmyleswhite and @tanmaykm, your dramatic performance improvements to I/O are quite amazing!

from dataframes.jl.

timholy avatar timholy commented on May 16, 2024

I should also add that HDF5 lets you read and write chunks of huge arrays very easily (just using array subregion syntax, like with mmap). Great to see similar capabilities being implemented for other file formats where it's not done "for us" by an outside library.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on May 16, 2024

Thanks, @timholy!

We should try to extract some of the techniques we're using to provide a more general framework for doing this kind of array extraction.

from dataframes.jl.

ViralBShah avatar ViralBShah commented on May 16, 2024

I think it would be useful to have a new issue for the DataStream overhaul that you mention.

Given that we have hooked up HDFS (not HDF5!) to julia, it should be pretty straightforward to add DataFrames-like capabilities to datasets that are stored in large distributed filesystems. For example, HDFS provides an API to consume large files in chunks over a distributed cluster. It would be nice to expose this through a DataStream like API to the user, except that this is really a DistributedDataStream, but many of the operations one will want to do will be similar.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on May 16, 2024

Viral, let's set up some time next week to strategize how we'll implement a simple type of DistributedDataStream.

from dataframes.jl.

skanskan avatar skanskan commented on May 16, 2024

I agree, we need a platform able to work transparently with data of any size, at least bigger than memory, and ideally distributed on several computers.

Solutions such as Spark are not complete, they only offer the basic functionalities to build something else.

We don't just need to be able to get some summaries, as we do with databases, we need to be able to do all operations with big data, operations such as multiplying to big matrixes, fitting a mixed-effect models, MCMC, etc.

Bigmemory or ff allows you to do some simple things but they cannot be used by other packages like lme4.

from dataframes.jl.

ViralBShah avatar ViralBShah commented on May 16, 2024

Closing. Can be reopened in the context of Feather.jl.

from dataframes.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.