juliadata / juliadb.jl Goto Github PK

View Code? Open in Web Editor NEW

763.0 37.0 62.0 1.34 MB

Parallel analytical database in pure Julia

Home Page: http://juliadb.org/

License: Other

Julia 100.00%

juliadb parallelism julia

juliadb.jl's Introduction

JuliaDB is unmaintained. Please consider using DataFrames or DTables

JuliaDB is a package for working with large persistent data sets

We recognized the need for an all-Julia, end-to-end tool that can

Load multi-dimensional datasets quickly and incrementally.
Index the data and perform filter, aggregate, sort and join operations.
Save results and load them efficiently later.
Use Julia's built-in parallelism to fully utilize any machine or cluster.

We built JuliaDB to fill this void.

Get Started

Introduce yourself to JuliaDB's features at juliadb.org or jump into the documentation here!

juliadb.jl's People

Contributors

Stargazers

Watchers

juliadb.jl's Issues

Support for memory-mapped I/O

References
https://msdn.microsoft.com/en-us/library/dd997372(v=vs.110).aspx
https://docs.julialang.org/en/stable/stdlib/io-network/#memory-mapped-i-o
Passing the byte array representation of the mapped segment as source

I will work on this after supporting I/O streams.

saved metadata doesn't work with different number of workers

It looks like the csv.cached_on field is saved in the metadata. This shouldn't be used in a different run since in a new environment it might not be true any more.

ingest should be loadfiles + save

Right now we have 2 different implementations of this very similar functionality.

Support for generic I/O compressed streams

Using these as references
https://docs.julialang.org/en/latest/manual/networking-and-streams.html
https://msdn.microsoft.com/en-us/library/k3352a4t(v=vs.110).aspx
https://github.com/JuliaData/CSV.jl/blob/master/src/io.jl
https://github.com/BioJulia/Libz.jl

suggestions are welcome.

capture metadata as files get read

right now if one file fails to load we need to restart the load process. Would be nice to avoid this.

Could you register this in METADATA? I have a very rough (=functional but not fast) integration with IterableTables and Query ready, but I can only merge (and test) that if JuliaDB is registered in METADATA.

Define an AbstractSerializer for Mmapping to & from disk

#12 but without the MmappableArray wrapper type. Fall back to default (de)serialize.

DTable non-time-series operations

Tracking operations we need to build the next layer of functionality here.

Won't implement:

where, pairs (because getindex does return a kind of iterator)

Accumulator initialization in reduce

For non-standard reducers, the current version fails. E.g. if you would like to compute the extreme values in a single pass.

This also fails for Base's reduce but it works if you provide an initial value, e.g.

julia> reduce((t,s) -> (min(t[1],s), max(t[2],s)), (0.0,0.0), randn(10))
(-1.9300544528466752, 2.919740428612819)

GUI

Capturing an excellent idea tossed by Jeff:

A GUI for JuliaDB should:

List available datasets
On the click of a button load the data into distributed memory
Optionally provide interfaces to do simple dataviz on the data

We could use a combination of WebIO + Escher + SomePlottingPackage for this. The same codebase should also allow us to run this on a webserver or in a native-looking app with Blink.

Unify loading functions into one `loadtable`

First, start by saving chunks (IndexedTables) as .itbl files so that we can distinguish them from other files. (currently we don't use an extension)

loadtable([array of filenames]; chunks=nworkers())

loads the specified files in n chunks.

loadtable("dir"; chunks=nworkers()) # load a directory full of .csv files, or .itbl files (savetable)

this is equivalent to the current loadfiles and load functions.

loadtable("file.csv"; chunks=nworkers())  = loadfiles(["file.csv"], chunks=chunks)

Deserialization fails if NamedTuples are used

see #9

juliadb.org intro: call --> ask

julia> # loadfiles can load the files in parallel
       fxdata = loadfiles(files, header_exists=false,
                          colnames=["pair", "timestamp", "bid", "call"],
                          indexcols=[1, 2])

# either "call" above should be "ask[ed]" or there should be an explanation

Save metadata after reading a bunch of CSV files

So that the next time we can "load" a lazy DTable really fast.

Corruption when calling ingest more than once

julia> for i in 1:2
         open("test$(i).txt", "w") do f
           write(f, "a\tb\n")
           write(f, "$(i)\t$(2i)\n")
         end
       end

julia> using JuliaDB

julia> ingest(["test1.txt"], "tmpcache", delim = '\t', usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 8 bytes in 1 batches...
ingDTable with 1 rows in 1 chunks:

  │ a  b
──┼─────
1 │ 1  2

julia> ingest(["test1.txt", "test2.txt"], "tmpcache", delim = '\t', usecache = false)
Metadata for 0 / 2 files can be loaded from cache.
Reading 2 csv files totalling 16 bytes in 1 batches...
DTable with 2 rows in 1 chunks:

  │ a                     b
──┼──────────────────────────
1 │ -4106979394849065894  512

if I then restart the REPL

julia> using JuliaDB

julia> ingest(joinpath.("/Users/andreasnoack/Desktop", ["test1.txt", "test2.txt"][1:2]), "tmpcache", delim = '\t', usecache = false)
Metadata for 0 / 2 files can be loaded from cache.
Reading 2 csv files totalling 16 bytes in 1 batches...
DTable with 2 rows in 1 chunks:

  │ a  b
──┼─────
1 │ 1  2
2 │ 2  4

interpolation

In the TrueFX data there are currency pair and timestamp index columns. We would like to aggregate the data over 5-minute time windows, take out 6 currency pairs separately and then match them up side-by-side so to create a table with timestamp index.

We may not have values available for every 5-minute time-step for all currency pairs. It will be good to have a way to pluggable interpolation mechanism to fill up the blanks.

Note that it's also possible to deal with this situation by just doing an intersection join which would keep only the time steps for which data about all 6 currency pairs are available.

cc @ViralBShah

Implement fit!(Series, DTable)

Right now we can't update a series. We can use merge but that is not good for histograms where the two histograms might not have the same bins.

CSV performance improvements

Complete API for CSV parsing speedups, make a PR

Allow indexcols=[]

Needs a special case which adds 1:n as the index.

installation issue

I'm trying to install JuliaDB and getting the following errors:

julia> Pkg.clone("https://github.com/JuliaComputing/JuliaDB.jl.git")
INFO: Cloning JuliaDB from https://github.com/JuliaComputing/JuliaDB.jl.git
INFO: Computing changes...
INFO: No packages to install, update or remove
INFO: Package database updated

julia> using JuliaDB
INFO: Precompiling module JuliaDB.
WARNING: could not import OnlineStatsBase.Series into JuliaDB
ERROR: LoadError: LoadError: ArgumentError: invalid type for argument series in method definition for aggregate_stats at /Users/ianmarshall/.julia/v0.6/JuliaDB/src/online-stats.jl:114
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:569
 [2] include(::String) at ./sysimg.jl:14
 [3] include_from_node1(::String) at ./loading.jl:569
 [4] include(::String) at ./sysimg.jl:14
 [5] anonymous at ./<missing>:2
while loading /Users/ianmarshall/.julia/v0.6/JuliaDB/src/online-stats.jl, in expression starting on line 521
while loading /Users/ianmarshall/.julia/v0.6/JuliaDB/src/JuliaDB.jl, in expression starting on line 25
ERROR: Failed to precompile JuliaDB to /Users/ianmarshall/.julia/lib/v0.6/JuliaDB.ji.
Stacktrace:
 [1] compilecache(::String) at ./loading.jl:703
 [2] _require(::Symbol) at ./loading.jl:490
 [3] require(::Symbol) at ./loading.jl:398

This error seems to happen on Julia version 0.6 and 0.6.1, and I've tried it both on linux (Ubuntu 16) and Mac OS X 10.11.6. I've also tried to install via Pkg.add and encountered the same problem. Any ideas for how to get it to work would be appreciated. Thanks!

Redistribution

Use the psort algorithm from Dagger.

This requires (or could force) that the input is already computed.

Possible APIs: User specifies number of blocks or maximum size of block to output.

Simplify index datastructure

Using IndexedTable for the index has been not as helpful as I had hoped. It has lead to lots of construction / deconstruction code that obfuscates the logic.

The current structure is this:

The datastructure is an IndexedTable with:

Index:

N index columns where each one is an Interval corresponding to the interval in the chunk

Data:

boundingrect: a column containing tuples of intervals to represent a bounding box
chunk: a column containing Chunk or Thunk objects
length: a column containing Nullable{Int} lengths of the chunks

I think this could be simplified to:

an IndexedTable from IndexSpace to Union{Chunk,Thunk} where IndexSpace is ordered by its interval field (should give same ordering as the current one)

Any other suggestions are welcome!

Don't try to save which options were used to load the data

loadfiles hashes the options passed to it to make sure we don't use invalid cached data. We should instead just warn about this when we are using cache, and allow user to say overwritecache=true to invalidate it. Always load data from cache.

Speed up rechunk

For datasets with 1000s of chunks, finding the splitters can take a long time. The process is right now O(p*log(N)) where p is the number of chunks, N is the number of elements in the biggest chunk.

Some ideas for improvements:

Treat subsets of sorted chunks as a single big sorted chunk (this will reduce p). (Not of much help in permutedims)
Get > 1 possible medians in each round of communication. For example, it could be 32 which would make the log base 5 - overall a 5-fold improvement assuming the cost of looking up the 32 index rows is negligible in comparison.

TypeError: non-boolean (Nullable{Bool}) used in boolean context

I think this is an issue in Dagger.jl.
So I actually checked out Dagger, IndexedTables, NamedTuples, but the problem remained:

TypeError: non-boolean (Nullable{Bool}) used in boolean context
refine_perm!(::Array{Int64,1}, ::NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1}}, ::Int64, ::NullableArrays.NullableArray{Int64,1}, ::NullableArrays.NullableArray{Int64,1}, ::Int64, ::Int64) at /home/s/.julia/v0.6/IndexedTables/src/columns.jl:101
refine_perm!(::Array{Int64,1}, ::NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1}}, ::Int64, ::Array{Int64,1}, ::NullableArrays.NullableArray{Int64,1}, ::Int64, ::Int64) at /home/s/.julia/v0.6/IndexedTables/src/columns.jl:109
sortperm(::IndexedTables.Columns{Tuple{Int64,Nullable{Int64},Nullable{Int64},Nullable{Int64},String,String,String,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Float64},Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Int64},String,Int64,Int64,String,Float64,String,Nullable{Float64}},NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1}}}) at /home/s/.julia/v0.6/IndexedTables/src/columns.jl:86
#IndexedTable#42(::Void, ::Bool, ::Bool, ::Type{T} where T, ::IndexedTables.Columns{NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Int64,Nullable{Int64},Nullable{Int64},Nullable{Int64},String,String,String,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Float64},Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Int64},String,Int64,Int64,String,Float64,String,Nullable{Float64}},NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Ar...

Deserializing DTable fails on Julia 0.6

Saying _NT_... type is not defined.

This is because DTable (and all the chunks) are parameterized by the NDSparse type of the chunks. Data of this type itself is not kept inside DTable.

I guess one possible approach is to serialize with just DTable abstract type tag, look through chunks to find all _NT_... types, store a list of their fieldnames.

While deserializing we read this list and then create those types using NamedTuples.create_tuple.

Merge vs Union

Is the merge function intended to enable performing SQL style union operations.
If so how should it handle empty arguments?

sliding window

We don't have this feature yet. Will have to be implemented in IndexedTables first.

loadfiles fails with more workers than files

Metadata for 0 / 25 files can be loaded from cache.
Reading 25 csv files totalling 478.168 MiB in 40 batches...
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
On worker 33:
AssertionError: !(isempty(files))
#csvread#32 at /home/andreasnoack/.julia/v0.6/TextParse/src/csv.jl:99

Add consistency of agg, _vec and vecagg API

I was looking for the thing that does the reducedim_vec equivalent for convertdim. convertdim has a vecagg keyword argument - it took me a while to remember that it does. Here is the current state of aggregation API:

reducedim(agg, ...)
reducedim_vec(vecagg, ...)
aggregate(agg, ...)
aggregate_vec(vecagg, ...)
convertdim(...; agg, vecagg)
select(...; agg) but has no vecagg argument

We need to make this more consistent than this... Options I see are:

convertdim(agg, t, d, xlat) and convertdim_vec(vecagg, t, d, xlat), and select(agg, t, where...) and select_vec(agg, t, where...). But having something named "select_vec" is very confusing. I'm also not crazy about the agg functions as the first arguments to these functions since their main intent is a "query" as opposed to reduction/aggregation.
Add vecagg kwarg to select, and probably to reducedim

support other chunk types

We should be able to support AxisArrays as chunks instead of IndexedTables. Ideally, they should be able to implement the same interface but with dense vs. sparse storage.

pretty-printing

Right now the printing is very ugly.

Reducedim issue

reducedim function errors out with StackOverflowError when the aggregation function is executed over 99 times

Example:

t = distribute(IndexedTable([fill(1,100000); fill(2,100000); fill(3,100000);fill(4,100000);fill(5,100000)],
                            rand(Float64,500000), fill(1,500000),
                            MVector{64,Float64}[rand(Float64, 64) for n in 1:500000]), 2)

compute(reducedim((x,y)->MVector(x+y), t, 1)) - works

compute(reducedim((x,y)->MVector(x+y), t, 2)) – StackOverflowError. Complains the function executed > 99 times

Distributed Table

We need to be able to use multi-processing and GPUs to get speedups. The easiest route is by using Dagger. This also gives us things like out-of-core and lazy joins for free.

Work items here are:

Define Dagger's distribution interface for IndexedTables
~~Use efficient interval lookup (IntervalTrees.jl or a hand rolled cache-friendly implementation) to look up parts containing an index or index range~~ use an NDSparse of intervals as the index
Allow creation of a distributed table from a list of CSV files (providing API to cache metadata for future reads) -> will work in combination with Glob.glob for directories
Implement all the IndexedTable operations on DistributedTable

Progress bar

Will be especially nice in operations that involve rechunk.

Support for GZ file extension (custom file format)

Hello! Would like to inform that as of tomorrow I'll be working on building a PR for this. Any preferred references are welcome. For the moment I'm studying Libz.jl.

Make sure cached data is reused every time

Right now, for a DTable whose parts are already computed, we can't be sure that the scheduler picks the same process to perform a next operation.

We should track and use some kind of affinity information for each chunk.

use column names from TextParse

TextParse returns raw data and column names. Right now JuliaDB ignores the column names and creates the output table without named columns.

Silently hangs when an undefined function is used in aggregation

If a function is defined without @everywhere and then used in a dtable operation it must throw an error and abort computation.

BoundsError: attempt to access 0-element Array{Any,1} at index [0]

Another Dagger.jl related error.

julia> sampledata = loadfiles(path, indexcols=["date", "ticker"])
Metadata for 6 / 7 files can be loaded from cache.
Reading 1 csv files totalling 63 bytes...
ERROR: BoundsError: attempt to access 0-element Array{Any,1} at index [0]
#_csvread#28(::Char, ::Char, ::Array{Any,1}, ::Array{Any,1}, ::Bool, ::Int64, ::Bool, ::Array{String,1}, ::Array{Any,1}, ::Int64, ::TextParse.#_csvread, ::WeakRefString{UInt8}, ::Char) at /home/jn/.julia/v0.7/TextParse/src/csv.jl:104
#csvread#27(::Array{Any,1}, ::Function, ::IOStream, ::Char) at /home/jn/.julia/v0.7/TextParse/src/csv.jl:64
#25 at /home/jn/.julia/v0.7/TextParse/src/csv.jl:58 [inlined]
open(::TextParse.##25#26{Array{Any,1},Char}, ::String, ::String) at ./iostream.jl:152
#load_table#7(::Array{String,1}, ::Void, ::Void, ::Bool, ::Bool, ::TextParse.#csvread, ::Array{Any,1}, ::Function, ::String, ::Char) at /home/jn/.julia/v0.7/JuliaDB/src/util.jl:152
(::JuliaDB.#kw##load_table)(::Array{Any,1}, ::JuliaDB.#load_table, ::String, ::Char) at ./<missing>:0
_gather(::Dagger.Context, ::JuliaDB.CSVChunk) at /home/jn/.julia/v0.7/JuliaDB/src/loadfiles.jl:185
#makecsvchunk#112(::Bool, ::Array{Any,1}, ::Function, ::String, ::Char) at /home/jn/.julia/v0.7/JuliaDB/src/loadfiles.jl:210
(::JuliaDB.#kw##makecsvchunk)(::Array{Any,1}, ::JuliaDB.#makecsvchunk, ::String, ::Char) at ./<missing>:0
(::JuliaDB.#load_f#92{Array{Any,1},Char})(::String) at /home/jn/.julia/v0.7/JuliaDB/src/loadfiles.jl:107
do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{String}, ::Bool, ::Bool) at /home/jn/.julia/v0.7/Dagger/src/compute.jl:449
(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#92{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#92{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
macro expansion at /home/jn/.julia/v0.7/Dagger/src/compute.jl:462 [inlined]
(::Dagger.##122#123{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#92{Array{Any,1},Char},Tuple{String},Channel{Any},Bool,Bool})() at ./event.jl:73

Calling getdatacol not working when table has only one data column

The following example is working with a set of foreign exchange data in JuliaDB.

When attempting to access the data from a table that has only one data column, the getdatacol function currently throws an error. The equivalent (and desired) operation using gather and accessing the data field is shown below.

julia> bids_5m["EUR/USD",t_5m]
DTable with 1 chunks:

───────────────────────────────┬────────
"EUR/USD"  2016-06-15T00:00:00 │ 1.12111
"EUR/USD"  2016-06-15T00:05:00 │ 1.12114
"EUR/USD"  2016-06-15T00:10:00 │ 1.12116
"EUR/USD"  2016-06-15T00:15:00 │ 1.12125
"EUR/USD"  2016-06-15T00:20:00 │ 1.12107
...

julia> getdatacol(bids["EUR/USD",t_5m],1)
ERROR: type Array has no field columns
#145 at /Users/andy/.julia/v0.6/JuliaDB/src/dcolumns.jl:24
do_task at /Users/andy/.julia/v0.6/Dagger/src/compute.jl:449
#106 at ./distributed/process_messages.jl:268 [inlined]
run_work_thunk at ./distributed/process_messages.jl:56
macro expansion at ./distributed/process_messages.jl:268 [inlined]
#105 at ./event.jl:73

julia> gather(bids["EUR/USD",t_5m]).data
28-element Array{Float32,1}:
 1.12105
 1.11953
 1.12215
 1.12903
 1.12903
 1.12543
 1.1285 
 1.13747
 1.13382
 1.1334 
 1.13165
 1.13246
 1.12548
 1.12915
 1.13079
 1.13789
 1.11447
 1.10834
 1.10832
 1.1046 
 1.10559
 1.10646
 1.10589
 1.10787
 1.10855
 1.10996
 1.11318
 1.1102

JuliaDB and NamedTuples out of sync: no create_tuple found

Just a heads up: I noticed today that the NamedTuples.create_tuple method has been refactored / renamed but JuliaDB still contains a reference to that in utils.jl, line 359 method Base.deserialize.

add an option to drop nullable rows while loading files

Suggested by @StefanKarpinski. It is annoying to load a dataset and then have to filter the nulls.

pick example from the docs doesn't work

This example is from the map and reduce section of the docs.

julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/andrew/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> path = Pkg.dir("JuliaDB", "test", "sample")
"/home/andrew/.julia/v0.6/JuliaDB/test/sample"

julia> sampledata = loadfiles(path, indexcols=["date", "ticker"])
Metadata for 6 / 6 files can be loaded from cache.
DTable with 288 rows in 6 chunks:

date        ticker  │ open     high     low     close   volume
────────────────────┼────────────────────────────────────────────
2010-01-01  "GOOGL" │ 626.95   629.51   540.99  626.75  1.78022e8
2010-01-01  "GS"    │ 170.05   178.75   154.88  173.08  2.81862e8
2010-01-01  "KO"    │ 57.16    57.4301  54.94   57.04   1.92693e8
2010-01-01  "XRX"   │ 8.54     9.48     8.91    8.63    3.00838e8
2010-02-01  "GOOGL" │ 534.602  547.5    531.75  533.02  1.03964e8
...

julia> volumes = map(pick(:volume), sampledata)
Error showing value of type JuliaDB.DTable{Tuple{Date,String},NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64}}:
ERROR: MethodError: no method matching (::IndexedTables.Proj{:volume})(::IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}})
(::JuliaDB.##41#42{IndexedTables.Proj{:volume}})(::IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}}) at /home/andrew/.julia/v0.6/JuliaDB/src/dtable.jl:137
do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}}, ::Bool, ::Bool) at /home/andrew/.julia/v0.6/Dagger/src/compute.jl:295
(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.##41#42{IndexedTables.Proj{:volume}},Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}},Bool,Bool},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.##41#42{IndexedTables.Proj{:volume}},Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}},Bool,Bool},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
macro expansion at /home/andrew/.julia/v0.6/Dagger/src/compute.jl:308 [inlined]
(::Dagger.##69#70{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.##41#42{IndexedTables.Proj{:volume}},Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}},Channel{Any},Bool,Bool})() at ./event.jl:73

The example for reduce(+, map(pick(:volume), sampledata)) doesn't work either, but the error message is already printed in the docs.

some chunks may have PooledArrays, others may not

Since the type depends on the % of unique elements in each column.

This causes N in DTable{N<:NDSparse} to become a Union

Possible solution: store only eltype types in DTable.

Storage format

The consensus seems to be to use Parquet.jl

The package can only read Parquet files now, we will need to implement writing.

Noting that we can use Julia serialization for this in the meanwhile.

cc @tanmaykm

Unable to change colparsers after first call to loadfiles

julia> open("tmp.csv", "w") do io
         write(io, "1, 2\n")
       end;

julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/andreasnoack/.julia/v0.6/NullableArrays/src/operators.jl:99.

julia> loadfiles(["tmp.csv"], ',', header_exists = false, colparsers = [Int], usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 5 bytes in 1 batches...
DTable with 1 rows in 1 chunks:

  │ 1  2
──┼─────
1 │ 1  2

julia> loadfiles(["tmp.csv"], ',', header_exists = false, colparsers = [Float64], usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 5 bytes in 1 batches...
DTable with 1 rows in 1 chunks:

  │ 1  2
──┼─────
1 │ 1  2

julia>

.julia/v0.6/Optim on  2aad9401 [+?]
➜ ~/juliagpu/julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |  x86_64-apple-darwin16.7.0

julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/andreasnoack/.julia/v0.6/NullableArrays/src/operators.jl:99.

julia> loadfiles(["tmp.csv"], ',', header_exists = false, colparsers = [Float64], usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 5 bytes in 1 batches...
DTable with 1 rows in 1 chunks:

  │ 1    2
──┼───────
1 │ 1.0  2

Advanced docs

Parsing error when using loadfiles

Sorry I don't have more of a MWE, but this is the best I could do.
Version 0.6.0 (2017-06-19 13:05 UTC)

julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/bromberger1/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> sampledata=loadfiles(["knownbad.csv"])
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 52.301 KiB...
Could not determine which type to promote column to.
ERROR: Parse error at line 2 at char 117:
..., and People's Region",,,Awasa,,Africa/Addis_Ababa...
____________________________________________________^
CSV column 2 is expected to be: TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}}(<string>, true, true, false, false, Nullable{Char}(','))
parsefill!(::WeakRefString{UInt8}, ::TextParse.LocalOpts, ::TextParse.Record{Tuple{TextParse.Field{Nullable{Int64},TextParse.NAToken{Nullable{Int64},TextParse.Numeric{Int64}}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.Quoted{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{Nullable{Union{}},TextParse.NAToken{Nullable{Union{}},TextParse.Unknown}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}}},Tuple{Nullable{Int64},TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,Nullable{Union{}},TextParse.StrRange}}, ::Int64, ::Tuple{NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},NullableArrays.NullableArray{Void,1},Array{String,1}}, ::Int64, ::Int64, ::Int64, ::Int64) at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:323
#_csvread#28(::Char, ::Char, ::Array{Any,1}, ::Array{Any,1}, ::Bool, ::Int64, ::Bool, ::Array{String,1}, ::Array{Any,1}, ::Int64, ::TextParse.#_csvread, ::WeakRefString{UInt8}, ::Char) at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:119
#csvread#27(::Array{Any,1}, ::Function, ::IOStream, ::Char) at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:64
#25 at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:58 [inlined]
open(::TextParse.##25#26{Array{Any,1},Char}, ::String, ::String) at ./iostream.jl:152
#_load_table#10(::Array{Any,1}, ::Void, ::Void, ::Bool, ::Bool, ::TextParse.#csvread, ::Array{Any,1}, ::Function, ::String, ::Char) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/util.jl:150
_load_table(::String, ::Char) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/util.jl:150
_collect(::Dagger.Context, ::JuliaDB.CSVChunk) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/loadfiles.jl:185
#makecsvchunk#145(::Bool, ::Array{Any,1}, ::Function, ::String, ::Char) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/loadfiles.jl:210
(::JuliaDB.#load_f#125{Array{Any,1},Char})(::String) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/loadfiles.jl:107
do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{String}, ::Bool, ::Bool) at /Users/bromberger1/.julia/v0.6/Dagger/src/compute.jl:295
(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#125{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#125{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
macro expansion at /Users/bromberger1/.julia/v0.6/Dagger/src/compute.jl:308 [inlined]
(::Dagger.##69#70{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#125{Array{Any,1},Char},Tuple{String},Channel{Any},Bool,Bool})() at ./event.jl:73

The CSV file is attached.
knownbad.csv.zip

Edit: readcsv doesn't have a problem with the file.

RFC: Major API changes

After looking at people use JuliaDB in the wild, and noting the common roadblocks to mastery of the framework, it's clearer that we need to make the API more relatable for someone with a relational database background (which is most people who want to use the package).

IndexedTable vs N-d sparse structure

IndexedTable type as it stands conflates a table structure and that of an N-d sparse array. These two structures can be separated:

A table in which rows are sorted by first few columns - iterated over with rows, operations make use of sortedness for speed.
An N-d sparse data cube - iterated over non-indexed values

One could argue that a 1:n indexed 2) acts as a table in the traditional sense. This is true, but does not capture sorting as in 1). 2) is just a simple wrapper of 1) which defines getindex to work on the indexed values, and with that as basis defines array operations like map, reduce, broadcast, reducedim and so on. 1) may not enforce that indexed values are unique, while 2) must. By default, relational join operations on 1) join non-indexed columns based on indexed columns (as is the case now) but should be configurable (which being implemented in JuliaData/IndexedTables.jl#79). It seems fine to let 2) also be used in relational operations by just forwarding them to the wrapped 1) object.

My opinion is that we should call 2) (which is most similar to current IndexedTable) NDSparse, and name 1) IndexedTable with a deprecation step where we deprecate IndexedTable to NDSparse and then bring back IndexedTable as 1).

Credits for some of this thinking goes to @andyferris. An advantage of these changes is that it gives things proper names, making them easier to explain.

AxisArrays vs N-d sparse data

AxisArrays could act as dense versions of 2) above. For an uncomplicated implementation and mental model, we need to figure out some one main issue:

AxisArrays have lowest stride in the first dimension while current IndexedTable (future n-d sparse) has it the other way around.

An interesting fact is that AxisArrays-based 2) can also be thought of as a relational table indexed by some columns (i.e. 1)) and relational operations can thus be implemented.

The proposed changes:

Deprecate the current IndexedTable name for NDSparse - this is a reversion to the old name, and arguably more accurate name.
Bring back IndexedTable as a row-iterated table, indexed by integers which give a row as a named tuple. Some of the columns may be indexed, below is a rough sketch of what this could look like

struct Perm{X} # sortperm of sorting by a subset of columns
    columns::Vector{Int} # which columns (integer indices) are indexed
    perm::X # sortperm. OneTo(n) denotes that the table is already sorted by these columns
end

struct IndexedTable{C<:Columns}
    columns::C
    perms::Vector{Perm} # any number of these
    cardinality::Vector{Nullable{Float64}} # cached cardinality for each column, to be used in optimizations

    ## temporary buffer
    columns_buffer::C # merged into columns when flush! is called, favors first OneTo ordered perm in insertion.
end

Implement relational operations on this type. This would involve reusing existing implementations but exploring rows in different permute orders as in JuliaData/IndexedTables.jl#79
Rename current distributed table type DTable to DNDSparse.
Bring back DTable as the distributed version of the new IndexedTable

cc @JeffBezanson @StefanKarpinski @ViralBShah @andyferris @andreasnoack @aviks @simonbyrne