Giter Club home page Giter Club logo

juliadb.jl's Introduction

JuliaDB is unmaintained. Please consider using DataFrames or DTables

JuliaDB logo

Latest documentation Travis CI Code Coverage

JuliaDB is a package for working with large persistent data sets

We recognized the need for an all-Julia, end-to-end tool that can

  • Load multi-dimensional datasets quickly and incrementally.
  • Index the data and perform filter, aggregate, sort and join operations.
  • Save results and load them efficiently later.
  • Use Julia's built-in parallelism to fully utilize any machine or cluster.

We built JuliaDB to fill this void.

Get Started

Introduce yourself to JuliaDB's features at juliadb.org or jump into the documentation here!

juliadb.jl's People

Contributors

andreasnoack avatar andygreenwell avatar appleparan avatar asinghvi17 avatar dilumaluthge avatar ericphanson avatar femtocleaner[bot] avatar fud avatar github-actions[bot] avatar iblislin avatar jeffbezanson avatar joshday avatar jpsamaroo avatar keno avatar mdpradeep avatar mikeinnes avatar piever avatar ppalmes avatar quinnj avatar racinmat avatar scottpjones avatar shashi avatar simonbyrne avatar tanmaykm avatar tkelman avatar zlatanvasovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

juliadb.jl's Issues

Register in METADATA.jl

Could you register this in METADATA? I have a very rough (=functional but not fast) integration with IterableTables and Query ready, but I can only merge (and test) that if JuliaDB is registered in METADATA.

DTable non-time-series operations

Tracking operations we need to build the next layer of functionality here.

  • getindex
  • select
  • convertdim/aggregate
  • merge
  • merge!
  • join
  • reducedim
  • reducedim_vec
  • permutedims
  • aggregate_vec
  • setindex!
  • update! ingest!
  • aggregate! aggregate
  • broadcast
  • mapslices

Won't implement:

  • where, pairs (because getindex does return a kind of iterator)

Accumulator initialization in reduce

For non-standard reducers, the current version fails. E.g. if you would like to compute the extreme values in a single pass.

screen shot 2017-09-12 at 5 10 10 pm

This also fails for Base's reduce but it works if you provide an initial value, e.g.

julia> reduce((t,s) -> (min(t[1],s), max(t[2],s)), (0.0,0.0), randn(10))
(-1.9300544528466752, 2.919740428612819)

GUI

Capturing an excellent idea tossed by Jeff:

A GUI for JuliaDB should:

  • List available datasets
  • On the click of a button load the data into distributed memory
  • Optionally provide interfaces to do simple dataviz on the data

We could use a combination of WebIO + Escher + SomePlottingPackage for this. The same codebase should also allow us to run this on a webserver or in a native-looking app with Blink.

Unify loading functions into one `loadtable`

First, start by saving chunks (IndexedTables) as .itbl files so that we can distinguish them from other files. (currently we don't use an extension)

loadtable([array of filenames]; chunks=nworkers())

loads the specified files in n chunks.

loadtable("dir"; chunks=nworkers()) # load a directory full of .csv files, or .itbl files (savetable)

this is equivalent to the current loadfiles and load functions.

loadtable("file.csv"; chunks=nworkers())  = loadfiles(["file.csv"], chunks=chunks)

juliadb.org intro: call --> ask

julia> # loadfiles can load the files in parallel
       fxdata = loadfiles(files, header_exists=false,
                          colnames=["pair", "timestamp", "bid", "call"],
                          indexcols=[1, 2])

# either "call" above should be "ask[ed]" or there should be an explanation

Corruption when calling ingest more than once

julia> for i in 1:2
         open("test$(i).txt", "w") do f
           write(f, "a\tb\n")
           write(f, "$(i)\t$(2i)\n")
         end
       end

julia> using JuliaDB

julia> ingest(["test1.txt"], "tmpcache", delim = '\t', usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 8 bytes in 1 batches...
ingDTable with 1 rows in 1 chunks:

  │ a  b
──┼─────
11  2

julia> ingest(["test1.txt", "test2.txt"], "tmpcache", delim = '\t', usecache = false)
Metadata for 0 / 2 files can be loaded from cache.
Reading 2 csv files totalling 16 bytes in 1 batches...
DTable with 2 rows in 1 chunks:

  │ a                     b
──┼──────────────────────────
1-4106979394849065894  512

if I then restart the REPL

julia> using JuliaDB

julia> ingest(joinpath.("/Users/andreasnoack/Desktop", ["test1.txt", "test2.txt"][1:2]), "tmpcache", delim = '\t', usecache = false)
Metadata for 0 / 2 files can be loaded from cache.
Reading 2 csv files totalling 16 bytes in 1 batches...
DTable with 2 rows in 1 chunks:

  │ a  b
──┼─────
11  2
22  4

interpolation

In the TrueFX data there are currency pair and timestamp index columns. We would like to aggregate the data over 5-minute time windows, take out 6 currency pairs separately and then match them up side-by-side so to create a table with timestamp index.

We may not have values available for every 5-minute time-step for all currency pairs. It will be good to have a way to pluggable interpolation mechanism to fill up the blanks.

Note that it's also possible to deal with this situation by just doing an intersection join which would keep only the time steps for which data about all 6 currency pairs are available.

cc @ViralBShah

Implement fit!(Series, DTable)

Right now we can't update a series. We can use merge but that is not good for histograms where the two histograms might not have the same bins.

installation issue

I'm trying to install JuliaDB and getting the following errors:

julia> Pkg.clone("https://github.com/JuliaComputing/JuliaDB.jl.git")
INFO: Cloning JuliaDB from https://github.com/JuliaComputing/JuliaDB.jl.git
INFO: Computing changes...
INFO: No packages to install, update or remove
INFO: Package database updated

julia> using JuliaDB
INFO: Precompiling module JuliaDB.
WARNING: could not import OnlineStatsBase.Series into JuliaDB
ERROR: LoadError: LoadError: ArgumentError: invalid type for argument series in method definition for aggregate_stats at /Users/ianmarshall/.julia/v0.6/JuliaDB/src/online-stats.jl:114
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:569
 [2] include(::String) at ./sysimg.jl:14
 [3] include_from_node1(::String) at ./loading.jl:569
 [4] include(::String) at ./sysimg.jl:14
 [5] anonymous at ./<missing>:2
while loading /Users/ianmarshall/.julia/v0.6/JuliaDB/src/online-stats.jl, in expression starting on line 521
while loading /Users/ianmarshall/.julia/v0.6/JuliaDB/src/JuliaDB.jl, in expression starting on line 25
ERROR: Failed to precompile JuliaDB to /Users/ianmarshall/.julia/lib/v0.6/JuliaDB.ji.
Stacktrace:
 [1] compilecache(::String) at ./loading.jl:703
 [2] _require(::Symbol) at ./loading.jl:490
 [3] require(::Symbol) at ./loading.jl:398

This error seems to happen on Julia version 0.6 and 0.6.1, and I've tried it both on linux (Ubuntu 16) and Mac OS X 10.11.6. I've also tried to install via Pkg.add and encountered the same problem. Any ideas for how to get it to work would be appreciated. Thanks!

Redistribution

Use the psort algorithm from Dagger.

This requires (or could force) that the input is already computed.

Possible APIs: User specifies number of blocks or maximum size of block to output.

Simplify index datastructure

Using IndexedTable for the index has been not as helpful as I had hoped. It has lead to lots of construction / deconstruction code that obfuscates the logic.

The current structure is this:

The datastructure is an IndexedTable with:

Index:

  • N index columns where each one is an Interval corresponding to the interval in the chunk

Data:

  • boundingrect: a column containing tuples of intervals to represent a bounding box
  • chunk: a column containing Chunk or Thunk objects
  • length: a column containing Nullable{Int} lengths of the chunks

I think this could be simplified to:

an IndexedTable from IndexSpace to Union{Chunk,Thunk} where IndexSpace is ordered by its interval field (should give same ordering as the current one)

Any other suggestions are welcome!

Don't try to save which options were used to load the data

loadfiles hashes the options passed to it to make sure we don't use invalid cached data. We should instead just warn about this when we are using cache, and allow user to say overwritecache=true to invalidate it. Always load data from cache.

Speed up rechunk

For datasets with 1000s of chunks, finding the splitters can take a long time. The process is right now O(p*log(N)) where p is the number of chunks, N is the number of elements in the biggest chunk.

Some ideas for improvements:

  • Treat subsets of sorted chunks as a single big sorted chunk (this will reduce p). (Not of much help in permutedims)
  • Get > 1 possible medians in each round of communication. For example, it could be 32 which would make the log base 5 - overall a 5-fold improvement assuming the cost of looking up the 32 index rows is negligible in comparison.

TypeError: non-boolean (Nullable{Bool}) used in boolean context

I think this is an issue in Dagger.jl.
So I actually checked out Dagger, IndexedTables, NamedTuples, but the problem remained:

TypeError: non-boolean (Nullable{Bool}) used in boolean context
refine_perm!(::Array{Int64,1}, ::NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1}}, ::Int64, ::NullableArrays.NullableArray{Int64,1}, ::NullableArrays.NullableArray{Int64,1}, ::Int64, ::Int64) at /home/s/.julia/v0.6/IndexedTables/src/columns.jl:101
refine_perm!(::Array{Int64,1}, ::NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1}}, ::Int64, ::Array{Int64,1}, ::NullableArrays.NullableArray{Int64,1}, ::Int64, ::Int64) at /home/s/.julia/v0.6/IndexedTables/src/columns.jl:109
sortperm(::IndexedTables.Columns{Tuple{Int64,Nullable{Int64},Nullable{Int64},Nullable{Int64},String,String,String,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Float64},Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Int64},String,Int64,Int64,String,Float64,String,Nullable{Float64}},NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1}}}) at /home/s/.julia/v0.6/IndexedTables/src/columns.jl:86
#IndexedTable#42(::Void, ::Bool, ::Bool, ::Type{T} where T, ::IndexedTables.Columns{NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Int64,Nullable{Int64},Nullable{Int64},Nullable{Int64},String,String,String,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Float64},Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Int64},String,Int64,Int64,String,Float64,String,Nullable{Float64}},NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Ar...

Deserializing DTable fails on Julia 0.6

Saying _NT_... type is not defined.

This is because DTable (and all the chunks) are parameterized by the NDSparse type of the chunks. Data of this type itself is not kept inside DTable.

I guess one possible approach is to serialize with just DTable abstract type tag, look through chunks to find all _NT_... types, store a list of their fieldnames.

While deserializing we read this list and then create those types using NamedTuples.create_tuple.

Merge vs Union

Is the merge function intended to enable performing SQL style union operations.
If so how should it handle empty arguments?

sliding window

We don't have this feature yet. Will have to be implemented in IndexedTables first.

loadfiles fails with more workers than files

Metadata for 0 / 25 files can be loaded from cache.
Reading 25 csv files totalling 478.168 MiB in 40 batches...
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
On worker 33:
AssertionError: !(isempty(files))
#csvread#32 at /home/andreasnoack/.julia/v0.6/TextParse/src/csv.jl:99

Add consistency of agg, _vec and vecagg API

I was looking for the thing that does the reducedim_vec equivalent for convertdim. convertdim has a vecagg keyword argument - it took me a while to remember that it does. Here is the current state of aggregation API:

  1. reducedim(agg, ...)
  2. reducedim_vec(vecagg, ...)
  3. aggregate(agg, ...)
  4. aggregate_vec(vecagg, ...)
  5. convertdim(...; agg, vecagg)
  6. select(...; agg) but has no vecagg argument

We need to make this more consistent than this... Options I see are:

  1. convertdim(agg, t, d, xlat) and convertdim_vec(vecagg, t, d, xlat), and select(agg, t, where...) and select_vec(agg, t, where...). But having something named "select_vec" is very confusing. I'm also not crazy about the agg functions as the first arguments to these functions since their main intent is a "query" as opposed to reduction/aggregation.
  2. Add vecagg kwarg to select, and probably to reducedim

support other chunk types

We should be able to support AxisArrays as chunks instead of IndexedTables. Ideally, they should be able to implement the same interface but with dense vs. sparse storage.

Reducedim issue

reducedim function errors out with StackOverflowError when the aggregation function is executed over 99 times

Example:

t = distribute(IndexedTable([fill(1,100000); fill(2,100000); fill(3,100000);fill(4,100000);fill(5,100000)],
                            rand(Float64,500000), fill(1,500000),
                            MVector{64,Float64}[rand(Float64, 64) for n in 1:500000]), 2)

compute(reducedim((x,y)->MVector(x+y), t, 1)) - works

compute(reducedim((x,y)->MVector(x+y), t, 2)) – StackOverflowError. Complains the function executed > 99 times           

Distributed Table

We need to be able to use multi-processing and GPUs to get speedups. The easiest route is by using Dagger. This also gives us things like out-of-core and lazy joins for free.

Work items here are:

  • Define Dagger's distribution interface for IndexedTables
  • Use efficient interval lookup (IntervalTrees.jl or a hand rolled cache-friendly implementation) to look up parts containing an index or index range use an NDSparse of intervals as the index
  • Allow creation of a distributed table from a list of CSV files (providing API to cache metadata for future reads) -> will work in combination with Glob.glob for directories
  • Implement all the IndexedTable operations on DistributedTable

Progress bar

Will be especially nice in operations that involve rechunk.

Make sure cached data is reused every time

Right now, for a DTable whose parts are already computed, we can't be sure that the scheduler picks the same process to perform a next operation.

We should track and use some kind of affinity information for each chunk.

use column names from TextParse

TextParse returns raw data and column names. Right now JuliaDB ignores the column names and creates the output table without named columns.

BoundsError: attempt to access 0-element Array{Any,1} at index [0]

Another Dagger.jl related error.

julia> sampledata = loadfiles(path, indexcols=["date", "ticker"])
Metadata for 6 / 7 files can be loaded from cache.
Reading 1 csv files totalling 63 bytes...
ERROR: BoundsError: attempt to access 0-element Array{Any,1} at index [0]
#_csvread#28(::Char, ::Char, ::Array{Any,1}, ::Array{Any,1}, ::Bool, ::Int64, ::Bool, ::Array{String,1}, ::Array{Any,1}, ::Int64, ::TextParse.#_csvread, ::WeakRefString{UInt8}, ::Char) at /home/jn/.julia/v0.7/TextParse/src/csv.jl:104
#csvread#27(::Array{Any,1}, ::Function, ::IOStream, ::Char) at /home/jn/.julia/v0.7/TextParse/src/csv.jl:64
#25 at /home/jn/.julia/v0.7/TextParse/src/csv.jl:58 [inlined]
open(::TextParse.##25#26{Array{Any,1},Char}, ::String, ::String) at ./iostream.jl:152
#load_table#7(::Array{String,1}, ::Void, ::Void, ::Bool, ::Bool, ::TextParse.#csvread, ::Array{Any,1}, ::Function, ::String, ::Char) at /home/jn/.julia/v0.7/JuliaDB/src/util.jl:152
(::JuliaDB.#kw##load_table)(::Array{Any,1}, ::JuliaDB.#load_table, ::String, ::Char) at ./<missing>:0
_gather(::Dagger.Context, ::JuliaDB.CSVChunk) at /home/jn/.julia/v0.7/JuliaDB/src/loadfiles.jl:185
#makecsvchunk#112(::Bool, ::Array{Any,1}, ::Function, ::String, ::Char) at /home/jn/.julia/v0.7/JuliaDB/src/loadfiles.jl:210
(::JuliaDB.#kw##makecsvchunk)(::Array{Any,1}, ::JuliaDB.#makecsvchunk, ::String, ::Char) at ./<missing>:0
(::JuliaDB.#load_f#92{Array{Any,1},Char})(::String) at /home/jn/.julia/v0.7/JuliaDB/src/loadfiles.jl:107
do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{String}, ::Bool, ::Bool) at /home/jn/.julia/v0.7/Dagger/src/compute.jl:449
(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#92{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#92{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
macro expansion at /home/jn/.julia/v0.7/Dagger/src/compute.jl:462 [inlined]
(::Dagger.##122#123{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#92{Array{Any,1},Char},Tuple{String},Channel{Any},Bool,Bool})() at ./event.jl:73

Calling getdatacol not working when table has only one data column

The following example is working with a set of foreign exchange data in JuliaDB.

When attempting to access the data from a table that has only one data column, the getdatacol function currently throws an error. The equivalent (and desired) operation using gather and accessing the data field is shown below.

julia> bids_5m["EUR/USD",t_5m]
DTable with 1 chunks:

───────────────────────────────┬────────
"EUR/USD"  2016-06-15T00:00:001.12111
"EUR/USD"  2016-06-15T00:05:001.12114
"EUR/USD"  2016-06-15T00:10:001.12116
"EUR/USD"  2016-06-15T00:15:001.12125
"EUR/USD"  2016-06-15T00:20:001.12107
...

julia> getdatacol(bids["EUR/USD",t_5m],1)
ERROR: type Array has no field columns
#145 at /Users/andy/.julia/v0.6/JuliaDB/src/dcolumns.jl:24
do_task at /Users/andy/.julia/v0.6/Dagger/src/compute.jl:449
#106 at ./distributed/process_messages.jl:268 [inlined]
run_work_thunk at ./distributed/process_messages.jl:56
macro expansion at ./distributed/process_messages.jl:268 [inlined]
#105 at ./event.jl:73

julia> gather(bids["EUR/USD",t_5m]).data
28-element Array{Float32,1}:
 1.12105
 1.11953
 1.12215
 1.12903
 1.12903
 1.12543
 1.1285 
 1.13747
 1.13382
 1.1334 
 1.13165
 1.13246
 1.12548
 1.12915
 1.13079
 1.13789
 1.11447
 1.10834
 1.10832
 1.1046 
 1.10559
 1.10646
 1.10589
 1.10787
 1.10855
 1.10996
 1.11318
 1.1102 

pick example from the docs doesn't work

This example is from the map and reduce section of the docs.

julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/andrew/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> path = Pkg.dir("JuliaDB", "test", "sample")
"/home/andrew/.julia/v0.6/JuliaDB/test/sample"

julia> sampledata = loadfiles(path, indexcols=["date", "ticker"])
Metadata for 6 / 6 files can be loaded from cache.
DTable with 288 rows in 6 chunks:

date        ticker  │ open     high     low     close   volume
────────────────────┼────────────────────────────────────────────
2010-01-01  "GOOGL" │ 626.95   629.51   540.99  626.75  1.78022e8
2010-01-01  "GS"    │ 170.05   178.75   154.88  173.08  2.81862e8
2010-01-01  "KO"    │ 57.16    57.4301  54.94   57.04   1.92693e8
2010-01-01  "XRX"   │ 8.54     9.48     8.91    8.63    3.00838e8
2010-02-01  "GOOGL" │ 534.602  547.5    531.75  533.02  1.03964e8
...

julia> volumes = map(pick(:volume), sampledata)
Error showing value of type JuliaDB.DTable{Tuple{Date,String},NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64}}:
ERROR: MethodError: no method matching (::IndexedTables.Proj{:volume})(::IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}})
(::JuliaDB.##41#42{IndexedTables.Proj{:volume}})(::IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}}) at /home/andrew/.julia/v0.6/JuliaDB/src/dtable.jl:137
do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}}, ::Bool, ::Bool) at /home/andrew/.julia/v0.6/Dagger/src/compute.jl:295
(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.##41#42{IndexedTables.Proj{:volume}},Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}},Bool,Bool},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.##41#42{IndexedTables.Proj{:volume}},Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}},Bool,Bool},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
macro expansion at /home/andrew/.julia/v0.6/Dagger/src/compute.jl:308 [inlined]
(::Dagger.##69#70{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.##41#42{IndexedTables.Proj{:volume}},Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}},Channel{Any},Bool,Bool})() at ./event.jl:73

The example for reduce(+, map(pick(:volume), sampledata)) doesn't work either, but the error message is already printed in the docs.

Storage format

The consensus seems to be to use Parquet.jl

The package can only read Parquet files now, we will need to implement writing.

Noting that we can use Julia serialization for this in the meanwhile.

cc @tanmaykm

Unable to change colparsers after first call to loadfiles

julia> open("tmp.csv", "w") do io
         write(io, "1, 2\n")
       end;

julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/andreasnoack/.julia/v0.6/NullableArrays/src/operators.jl:99.

julia> loadfiles(["tmp.csv"], ',', header_exists = false, colparsers = [Int], usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 5 bytes in 1 batches...
DTable with 1 rows in 1 chunks:1  2
──┼─────
11  2

julia> loadfiles(["tmp.csv"], ',', header_exists = false, colparsers = [Float64], usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 5 bytes in 1 batches...
DTable with 1 rows in 1 chunks:1  2
──┼─────
11  2

julia>

.julia/v0.6/Optim on  2aad9401 [+?]
➜ ~/juliagpu/julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |  x86_64-apple-darwin16.7.0

julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/andreasnoack/.julia/v0.6/NullableArrays/src/operators.jl:99.

julia> loadfiles(["tmp.csv"], ',', header_exists = false, colparsers = [Float64], usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 5 bytes in 1 batches...
DTable with 1 rows in 1 chunks:

  │ 1    2
──┼───────
1 │ 1.0  2

Advanced docs

  • implicit indexing
  • IndexedTable construction and distribute
  • getdatacol, getindexcol
  • lazy views
  • rechunk

Parsing error when using loadfiles

Sorry I don't have more of a MWE, but this is the best I could do.
Version 0.6.0 (2017-06-19 13:05 UTC)

julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/bromberger1/.julia/v0.6/NullableArrays/src/operators.jl:128.

julia> sampledata=loadfiles(["knownbad.csv"])
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 52.301 KiB...
Could not determine which type to promote column to.
ERROR: Parse error at line 2 at char 117:
..., and People's Region",,,Awasa,,Africa/Addis_Ababa...
____________________________________________________^
CSV column 2 is expected to be: TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}}(<string>, true, true, false, false, Nullable{Char}(','))
parsefill!(::WeakRefString{UInt8}, ::TextParse.LocalOpts, ::TextParse.Record{Tuple{TextParse.Field{Nullable{Int64},TextParse.NAToken{Nullable{Int64},TextParse.Numeric{Int64}}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.Quoted{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{Nullable{Union{}},TextParse.NAToken{Nullable{Union{}},TextParse.Unknown}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}}},Tuple{Nullable{Int64},TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,Nullable{Union{}},TextParse.StrRange}}, ::Int64, ::Tuple{NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},NullableArrays.NullableArray{Void,1},Array{String,1}}, ::Int64, ::Int64, ::Int64, ::Int64) at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:323
#_csvread#28(::Char, ::Char, ::Array{Any,1}, ::Array{Any,1}, ::Bool, ::Int64, ::Bool, ::Array{String,1}, ::Array{Any,1}, ::Int64, ::TextParse.#_csvread, ::WeakRefString{UInt8}, ::Char) at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:119
#csvread#27(::Array{Any,1}, ::Function, ::IOStream, ::Char) at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:64
#25 at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:58 [inlined]
open(::TextParse.##25#26{Array{Any,1},Char}, ::String, ::String) at ./iostream.jl:152
#_load_table#10(::Array{Any,1}, ::Void, ::Void, ::Bool, ::Bool, ::TextParse.#csvread, ::Array{Any,1}, ::Function, ::String, ::Char) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/util.jl:150
_load_table(::String, ::Char) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/util.jl:150
_collect(::Dagger.Context, ::JuliaDB.CSVChunk) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/loadfiles.jl:185
#makecsvchunk#145(::Bool, ::Array{Any,1}, ::Function, ::String, ::Char) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/loadfiles.jl:210
(::JuliaDB.#load_f#125{Array{Any,1},Char})(::String) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/loadfiles.jl:107
do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{String}, ::Bool, ::Bool) at /Users/bromberger1/.julia/v0.6/Dagger/src/compute.jl:295
(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#125{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#125{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
macro expansion at /Users/bromberger1/.julia/v0.6/Dagger/src/compute.jl:308 [inlined]
(::Dagger.##69#70{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#125{Array{Any,1},Char},Tuple{String},Channel{Any},Bool,Bool})() at ./event.jl:73

The CSV file is attached.
knownbad.csv.zip

Edit: readcsv doesn't have a problem with the file.

RFC: Major API changes

After looking at people use JuliaDB in the wild, and noting the common roadblocks to mastery of the framework, it's clearer that we need to make the API more relatable for someone with a relational database background (which is most people who want to use the package).

IndexedTable vs N-d sparse structure

IndexedTable type as it stands conflates a table structure and that of an N-d sparse array. These two structures can be separated:

  1. A table in which rows are sorted by first few columns - iterated over with rows, operations make use of sortedness for speed.
  2. An N-d sparse data cube - iterated over non-indexed values

One could argue that a 1:n indexed 2) acts as a table in the traditional sense. This is true, but does not capture sorting as in 1). 2) is just a simple wrapper of 1) which defines getindex to work on the indexed values, and with that as basis defines array operations like map, reduce, broadcast, reducedim and so on. 1) may not enforce that indexed values are unique, while 2) must. By default, relational join operations on 1) join non-indexed columns based on indexed columns (as is the case now) but should be configurable (which being implemented in JuliaData/IndexedTables.jl#79). It seems fine to let 2) also be used in relational operations by just forwarding them to the wrapped 1) object.

My opinion is that we should call 2) (which is most similar to current IndexedTable) NDSparse, and name 1) IndexedTable with a deprecation step where we deprecate IndexedTable to NDSparse and then bring back IndexedTable as 1).

Credits for some of this thinking goes to @andyferris. An advantage of these changes is that it gives things proper names, making them easier to explain.

AxisArrays vs N-d sparse data

AxisArrays could act as dense versions of 2) above. For an uncomplicated implementation and mental model, we need to figure out some one main issue:

AxisArrays have lowest stride in the first dimension while current IndexedTable (future n-d sparse) has it the other way around.

An interesting fact is that AxisArrays-based 2) can also be thought of as a relational table indexed by some columns (i.e. 1)) and relational operations can thus be implemented.

The proposed changes:

  • Deprecate the current IndexedTable name for NDSparse - this is a reversion to the old name, and arguably more accurate name.
  • Bring back IndexedTable as a row-iterated table, indexed by integers which give a row as a named tuple. Some of the columns may be indexed, below is a rough sketch of what this could look like
struct Perm{X} # sortperm of sorting by a subset of columns
    columns::Vector{Int} # which columns (integer indices) are indexed
    perm::X # sortperm. OneTo(n) denotes that the table is already sorted by these columns
end

struct IndexedTable{C<:Columns}
    columns::C
    perms::Vector{Perm} # any number of these
    cardinality::Vector{Nullable{Float64}} # cached cardinality for each column, to be used in optimizations

    ## temporary buffer
    columns_buffer::C # merged into columns when flush! is called, favors first OneTo ordered perm in insertion.
end
  • Implement relational operations on this type. This would involve reusing existing implementations but exploring rows in different permute orders as in JuliaData/IndexedTables.jl#79
  • Rename current distributed table type DTable to DNDSparse.
  • Bring back DTable as the distributed version of the new IndexedTable

cc @JeffBezanson @StefanKarpinski @ViralBShah @andyferris @andreasnoack @aviks @simonbyrne

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.