There are several options here, from indexing CSV rows to allow random access to giant

I think this can be closed in light of the new <a href="https://github.com/JuliaStats/

This would be really cool and useful. Cc: <a class="user-mention notranslate" data-hov

Given that Base's readcsv is now using <code class="n

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

memory-mapped DataFrame about dataframes.jl HOT 16 CLOSED

juliadata commented on May 16, 2024

memory-mapped DataFrame

from dataframes.jl.

Comments (16)

quinnj commented on May 16, 2024 1

I think this can be closed in light of the new Feather.jl package, which is essentially a a cross-language binary storage format for dataframes (R, pandas, etc.). It's fast and efficient because the files are mmapped directly and the each column is unsafe_wrap(Array) into NullableArrays=>DataFrame. Obviously, NullableArray support is not quite 100% with DataFrames, but should soon be.

from dataframes.jl.

ViralBShah commented on May 16, 2024

This would be really cool and useful. Cc: @tanmaykm

from dataframes.jl.

simonster commented on May 16, 2024

I've been thinking about trying to create a binary data format that could store any Julia data structure but mmaps Arrays/BitArrays. Doing this right could take some effort, but it would give you mmapped DataFrames for free.

from dataframes.jl.

johnmyleswhite commented on May 16, 2024

That would be great.

from dataframes.jl.

StefanKarpinski commented on May 16, 2024

Sounds cool. Any thoughts to share? Possibly related ideas: I've sometimes considered starting a language-agnostic Sane Formats Project™ for reasonable data formats. HDF5 seems like it started out with a reasonable core and then sprouted some insanity when some committee got its grubby mitts on the standard. Likewise, neither CSV nor TSV are actually standardized but are both very handy because of their simplicity. Another thought that @JeffBezanson and I were discussing the last week is the idea of making in-memory data-structures position-independent so that you can serialize them trivially. Might be relevant.

from dataframes.jl.

simonster commented on May 16, 2024

I think I'm going to start by trying to mmap contiguous datasets in HDF5/JLD files, which doesn't look too hard and would avoid the standards problem. I'm not sure it's possible to create a standard that's as flexible, expressive, and language-agnostic as HDF5 without importing most of its complexity. It's only worth creating a new format if performance can be appreciably improved or complexity appreciably reduced. The HDF5 API is kind of insane, but @timholy has done a great job turning it into something more Julian. I'm also interested in how http://symas.com/mdb/ approaches mmapped storage, although the goals of a database are very different from those of a file format.

Position-independent in-memory data structures seem very useful for parallel usage, but I'm not sure of the wisdom of using them for a data format. I don't see how it would be possible to mmap data structures that aren't bits types directly from disk. If they're mmapped read-only, an attempt to write would crash Julia, and if they're mmapped read/write, it would be possible to accidentally introduce pointers to data structures that aren't stored on disk. Reading into memory would require fewer cycles, but the CPU shouldn't be the bottleneck anyway. Changes in Julia's in-memory data format could also break forward/backward compatibility; as it stands this is one of the primary reasons to use JLD instead of serialize()/deserialize().

from dataframes.jl.

johnmyleswhite commented on May 16, 2024

Given that Base's readcsv is now using mmap, I have a much clearer sense of how to do this now. I may get to work on it soon, although I don't think it's the most pressing issue.

from dataframes.jl.

ViralBShah commented on May 16, 2024

Agree - this is not a pressing issue for the moment. Also, streaming would perhaps be a better way to handle this, since you need to do much more later with the large data frame than just read it. Even in the readcsv in Base, you cannot really read large files and work with them, because the data will not fit in memory. However, the mmap loads the data faster due to zero-copy, and that could certainly be useful here too. @tanmaykm should correct me if I am wrong.

from dataframes.jl.

johnmyleswhite commented on May 16, 2024

Next week I'm going to do a major pass through our streaming architecture. For data sets that are purely numeric, I think we can basically do exactly one memory allocation step at the start and then reuse memory very efficiently. If we can fit GLM's to 10-100 GB data sets in Julia with a nice API, I think we'll have produced the killer Julia statistics app.

from dataframes.jl.

timholy commented on May 16, 2024

@johnmyleswhite and @tanmaykm, your dramatic performance improvements to I/O are quite amazing!

from dataframes.jl.

timholy commented on May 16, 2024

I should also add that HDF5 lets you read and write chunks of huge arrays very easily (just using array subregion syntax, like with mmap). Great to see similar capabilities being implemented for other file formats where it's not done "for us" by an outside library.

from dataframes.jl.

johnmyleswhite commented on May 16, 2024

Thanks, @timholy!

We should try to extract some of the techniques we're using to provide a more general framework for doing this kind of array extraction.

from dataframes.jl.

ViralBShah commented on May 16, 2024

I think it would be useful to have a new issue for the DataStream overhaul that you mention.

Given that we have hooked up HDFS (not HDF5!) to julia, it should be pretty straightforward to add DataFrames-like capabilities to datasets that are stored in large distributed filesystems. For example, HDFS provides an API to consume large files in chunks over a distributed cluster. It would be nice to expose this through a DataStream like API to the user, except that this is really a DistributedDataStream, but many of the operations one will want to do will be similar.

from dataframes.jl.

johnmyleswhite commented on May 16, 2024

Viral, let's set up some time next week to strategize how we'll implement a simple type of DistributedDataStream.

from dataframes.jl.

skanskan commented on May 16, 2024

I agree, we need a platform able to work transparently with data of any size, at least bigger than memory, and ideally distributed on several computers.

Solutions such as Spark are not complete, they only offer the basic functionalities to build something else.

We don't just need to be able to get some summaries, as we do with databases, we need to be able to do all operations with big data, operations such as multiplying to big matrixes, fitting a mixed-effect models, MCMC, etc.

Bigmemory or ff allows you to do some simple things but they cannot be used by other packages like lme4.

from dataframes.jl.

ViralBShah commented on May 16, 2024

Closing. Can be reopened in the context of Feather.jl.

from dataframes.jl.

memory-mapped DataFrame about dataframes.jl HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent