juliadata / juliadb.jl Goto Github PK
View Code? Open in Web Editor NEWParallel analytical database in pure Julia
Home Page: http://juliadb.org/
License: Other
Parallel analytical database in pure Julia
Home Page: http://juliadb.org/
License: Other
The consensus seems to be to use Parquet.jl
The package can only read Parquet files now, we will need to implement writing.
Noting that we can use Julia serialization for this in the meanwhile.
cc @tanmaykm
Metadata for 0 / 25 files can be loaded from cache.
Reading 25 csv files totalling 478.168 MiB in 40 batches...
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
Error reading file String[]
On worker 33:
AssertionError: !(isempty(files))
#csvread#32 at /home/andreasnoack/.julia/v0.6/TextParse/src/csv.jl:99
First, start by saving chunks (IndexedTables) as .itbl
files so that we can distinguish them from other files. (currently we don't use an extension)
loadtable([array of filenames]; chunks=nworkers())
loads the specified files in n chunks.
loadtable("dir"; chunks=nworkers()) # load a directory full of .csv files, or .itbl files (savetable)
this is equivalent to the current loadfiles
and load
functions.
loadtable("file.csv"; chunks=nworkers()) = loadfiles(["file.csv"], chunks=chunks)
julia> # loadfiles can load the files in parallel
fxdata = loadfiles(files, header_exists=false,
colnames=["pair", "timestamp", "bid", "call"],
indexcols=[1, 2])
# either "call" above should be "ask[ed]" or there should be an explanation
After looking at people use JuliaDB in the wild, and noting the common roadblocks to mastery of the framework, it's clearer that we need to make the API more relatable for someone with a relational database background (which is most people who want to use the package).
IndexedTable type as it stands conflates a table structure and that of an N-d sparse array. These two structures can be separated:
One could argue that a 1:n
indexed 2) acts as a table in the traditional sense. This is true, but does not capture sorting as in 1). 2) is just a simple wrapper of 1) which defines getindex
to work on the indexed values, and with that as basis defines array operations like map, reduce, broadcast, reducedim and so on. 1) may not enforce that indexed values are unique, while 2) must. By default, relational join operations on 1) join non-indexed columns based on indexed columns (as is the case now) but should be configurable (which being implemented in JuliaData/IndexedTables.jl#79). It seems fine to let 2) also be used in relational operations by just forwarding them to the wrapped 1) object.
My opinion is that we should call 2) (which is most similar to current IndexedTable) NDSparse, and name 1) IndexedTable with a deprecation step where we deprecate IndexedTable to NDSparse and then bring back IndexedTable as 1).
Credits for some of this thinking goes to @andyferris. An advantage of these changes is that it gives things proper names, making them easier to explain.
AxisArrays could act as dense versions of 2) above. For an uncomplicated implementation and mental model, we need to figure out some one main issue:
AxisArrays have lowest stride in the first dimension while current IndexedTable (future n-d sparse) has it the other way around.
An interesting fact is that AxisArrays-based 2) can also be thought of as a relational table indexed by some columns (i.e. 1)) and relational operations can thus be implemented.
IndexedTable
name for NDSparse
- this is a reversion to the old name, and arguably more accurate name.struct Perm{X} # sortperm of sorting by a subset of columns
columns::Vector{Int} # which columns (integer indices) are indexed
perm::X # sortperm. OneTo(n) denotes that the table is already sorted by these columns
end
struct IndexedTable{C<:Columns}
columns::C
perms::Vector{Perm} # any number of these
cardinality::Vector{Nullable{Float64}} # cached cardinality for each column, to be used in optimizations
## temporary buffer
columns_buffer::C # merged into columns when flush! is called, favors first OneTo ordered perm in insertion.
end
DTable
to DNDSparse
.DTable
as the distributed version of the new IndexedTablecc @JeffBezanson @StefanKarpinski @ViralBShah @andyferris @andreasnoack @aviks @simonbyrne
loadfiles
hashes the options passed to it to make sure we don't use invalid cached data. We should instead just warn about this when we are using cache, and allow user to say overwritecache=true
to invalidate it. Always load data from cache.
I was looking for the thing that does the reducedim_vec
equivalent for convertdim
. convertdim
has a vecagg
keyword argument - it took me a while to remember that it does. Here is the current state of aggregation API:
reducedim(agg, ...)
reducedim_vec(vecagg, ...)
aggregate(agg, ...)
aggregate_vec(vecagg, ...)
convertdim(...; agg, vecagg)
select(...; agg)
but has no vecagg
argumentWe need to make this more consistent than this... Options I see are:
convertdim(agg, t, d, xlat)
and convertdim_vec(vecagg, t, d, xlat)
, and select(agg, t, where...)
and select_vec(agg, t, where...)
. But having something named "select_vec" is very confusing. I'm also not crazy about the agg
functions as the first arguments to these functions since their main intent is a "query" as opposed to reduction/aggregation.vecagg
kwarg to select
, and probably to reducedim
right now if one file fails to load we need to restart the load process. Would be nice to avoid this.
References
https://msdn.microsoft.com/en-us/library/dd997372(v=vs.110).aspx
https://docs.julialang.org/en/stable/stdlib/io-network/#memory-mapped-i-o
Passing the byte array representation of the mapped segment as source
I will work on this after supporting I/O streams.
Is the merge function intended to enable performing SQL style union
operations.
If so how should it handle empty arguments?
We should be able to support AxisArrays as chunks instead of IndexedTables. Ideally, they should be able to implement the same interface but with dense vs. sparse storage.
Right now we have 2 different implementations of this very similar functionality.
see #9
Right now, for a DTable whose parts are already computed, we can't be sure that the scheduler picks the same process to perform a next operation.
We should track and use some kind of affinity information for each chunk.
Hello! Would like to inform that as of tomorrow I'll be working on building a PR for this. Any preferred references are welcome. For the moment I'm studying Libz.jl.
IndexedTable
construction and distribute
getdatacol
, getindexcol
rechunk
So that the next time we can "load" a lazy DTable really fast.
Use the psort algorithm from Dagger.
This requires (or could force) that the input is already computed.
Possible APIs: User specifies number of blocks or maximum size of block to output.
Using IndexedTable
for the index has been not as helpful as I had hoped. It has lead to lots of construction / deconstruction code that obfuscates the logic.
The current structure is this:
The datastructure is an IndexedTable with:
Index:
Data:
Chunk
or Thunk
objectsI think this could be simplified to:
an IndexedTable from IndexSpace
to Union{Chunk,Thunk}
where IndexSpace
is ordered by its interval field (should give same ordering as the current one)
Any other suggestions are welcome!
I'm trying to install JuliaDB and getting the following errors:
julia> Pkg.clone("https://github.com/JuliaComputing/JuliaDB.jl.git")
INFO: Cloning JuliaDB from https://github.com/JuliaComputing/JuliaDB.jl.git
INFO: Computing changes...
INFO: No packages to install, update or remove
INFO: Package database updated
julia> using JuliaDB
INFO: Precompiling module JuliaDB.
WARNING: could not import OnlineStatsBase.Series into JuliaDB
ERROR: LoadError: LoadError: ArgumentError: invalid type for argument series in method definition for aggregate_stats at /Users/ianmarshall/.julia/v0.6/JuliaDB/src/online-stats.jl:114
Stacktrace:
[1] include_from_node1(::String) at ./loading.jl:569
[2] include(::String) at ./sysimg.jl:14
[3] include_from_node1(::String) at ./loading.jl:569
[4] include(::String) at ./sysimg.jl:14
[5] anonymous at ./<missing>:2
while loading /Users/ianmarshall/.julia/v0.6/JuliaDB/src/online-stats.jl, in expression starting on line 521
while loading /Users/ianmarshall/.julia/v0.6/JuliaDB/src/JuliaDB.jl, in expression starting on line 25
ERROR: Failed to precompile JuliaDB to /Users/ianmarshall/.julia/lib/v0.6/JuliaDB.ji.
Stacktrace:
[1] compilecache(::String) at ./loading.jl:703
[2] _require(::Symbol) at ./loading.jl:490
[3] require(::Symbol) at ./loading.jl:398
This error seems to happen on Julia version 0.6 and 0.6.1, and I've tried it both on linux (Ubuntu 16) and Mac OS X 10.11.6. I've also tried to install via Pkg.add
and encountered the same problem. Any ideas for how to get it to work would be appreciated. Thanks!
Capturing an excellent idea tossed by Jeff:
A GUI for JuliaDB should:
We could use a combination of WebIO + Escher + SomePlottingPackage for this. The same codebase should also allow us to run this on a webserver or in a native-looking app with Blink.
Complete API for CSV parsing speedups, make a PR
julia> for i in 1:2
open("test$(i).txt", "w") do f
write(f, "a\tb\n")
write(f, "$(i)\t$(2i)\n")
end
end
julia> using JuliaDB
julia> ingest(["test1.txt"], "tmpcache", delim = '\t', usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 8 bytes in 1 batches...
ingDTable with 1 rows in 1 chunks:
│ a b
──┼─────
1 │ 1 2
julia> ingest(["test1.txt", "test2.txt"], "tmpcache", delim = '\t', usecache = false)
Metadata for 0 / 2 files can be loaded from cache.
Reading 2 csv files totalling 16 bytes in 1 batches...
DTable with 2 rows in 1 chunks:
│ a b
──┼──────────────────────────
1 │ -4106979394849065894 512
if I then restart the REPL
julia> using JuliaDB
julia> ingest(joinpath.("/Users/andreasnoack/Desktop", ["test1.txt", "test2.txt"][1:2]), "tmpcache", delim = '\t', usecache = false)
Metadata for 0 / 2 files can be loaded from cache.
Reading 2 csv files totalling 16 bytes in 1 batches...
DTable with 2 rows in 1 chunks:
│ a b
──┼─────
1 │ 1 2
2 │ 2 4
We don't have this feature yet. Will have to be implemented in IndexedTables first.
Saying _NT_...
type is not defined.
This is because DTable
(and all the chunks) are parameterized by the NDSparse
type of the chunks. Data of this type itself is not kept inside DTable
.
I guess one possible approach is to serialize with just DTable
abstract type tag, look through chunks to find all _NT_...
types, store a list of their fieldnames
.
While deserializing we read this list and then create those types using NamedTuples.create_tuple
.
This example is from the map and reduce section of the docs.
julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/andrew/.julia/v0.6/NullableArrays/src/operators.jl:128.
julia> path = Pkg.dir("JuliaDB", "test", "sample")
"/home/andrew/.julia/v0.6/JuliaDB/test/sample"
julia> sampledata = loadfiles(path, indexcols=["date", "ticker"])
Metadata for 6 / 6 files can be loaded from cache.
DTable with 288 rows in 6 chunks:
date ticker │ open high low close volume
────────────────────┼────────────────────────────────────────────
2010-01-01 "GOOGL" │ 626.95 629.51 540.99 626.75 1.78022e8
2010-01-01 "GS" │ 170.05 178.75 154.88 173.08 2.81862e8
2010-01-01 "KO" │ 57.16 57.4301 54.94 57.04 1.92693e8
2010-01-01 "XRX" │ 8.54 9.48 8.91 8.63 3.00838e8
2010-02-01 "GOOGL" │ 534.602 547.5 531.75 533.02 1.03964e8
...
julia> volumes = map(pick(:volume), sampledata)
Error showing value of type JuliaDB.DTable{Tuple{Date,String},NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64}}:
ERROR: MethodError: no method matching (::IndexedTables.Proj{:volume})(::IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}})
(::JuliaDB.##41#42{IndexedTables.Proj{:volume}})(::IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}}) at /home/andrew/.julia/v0.6/JuliaDB/src/dtable.jl:137
do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}}, ::Bool, ::Bool) at /home/andrew/.julia/v0.6/Dagger/src/compute.jl:295
(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.##41#42{IndexedTables.Proj{:volume}},Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}},Bool,Bool},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.##41#42{IndexedTables.Proj{:volume}},Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}},Bool,Bool},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
macro expansion at /home/andrew/.julia/v0.6/Dagger/src/compute.jl:308 [inlined]
(::Dagger.##69#70{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.##41#42{IndexedTables.Proj{:volume}},Tuple{Dagger.Chunk{IndexedTables.IndexedTable{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},Tuple{Date,String},NamedTuples._NT_date_ticker{Array{Date,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}}},IndexedTables.Columns{NamedTuples._NT_open_high_low_close_volume{Float64,Float64,Float64,Float64,Float64},NamedTuples._NT_open_high_low_close_volume{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}}},JuliaDB.CSVChunk}},Channel{Any},Bool,Bool})() at ./event.jl:73
The example for reduce(+, map(pick(:volume), sampledata))
doesn't work either, but the error message is already printed in the docs.
Will be especially nice in operations that involve rechunk
.
Needs a special case which adds 1:n
as the index.
If a function is defined without @everywhere
and then used in a dtable operation it must throw an error and abort computation.
Just a heads up: I noticed today that the NamedTuples.create_tuple
method has been refactored / renamed but JuliaDB still contains a reference to that in utils.jl, line 359 method Base.deserialize
.
Could you register this in METADATA? I have a very rough (=functional but not fast) integration with IterableTables and Query ready, but I can only merge (and test) that if JuliaDB is registered in METADATA.
#12 but without the MmappableArray
wrapper type. Fall back to default (de)serialize.
Sorry I don't have more of a MWE, but this is the best I could do.
Version 0.6.0 (2017-06-19 13:05 UTC)
julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/bromberger1/.julia/v0.6/NullableArrays/src/operators.jl:128.
julia> sampledata=loadfiles(["knownbad.csv"])
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 52.301 KiB...
Could not determine which type to promote column to.
ERROR: Parse error at line 2 at char 117:
..., and People's Region",,,Awasa,,Africa/Addis_Ababa...
____________________________________________________^
CSV column 2 is expected to be: TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}}(<string>, true, true, false, false, Nullable{Char}(','))
parsefill!(::WeakRefString{UInt8}, ::TextParse.LocalOpts, ::TextParse.Record{Tuple{TextParse.Field{Nullable{Int64},TextParse.NAToken{Nullable{Int64},TextParse.Numeric{Int64}}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{TextParse.StrRange,TextParse.Quoted{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}},TextParse.Field{Nullable{Union{}},TextParse.NAToken{Nullable{Union{}},TextParse.Unknown}},TextParse.Field{TextParse.StrRange,TextParse.StringToken{TextParse.StrRange}}},Tuple{Nullable{Int64},TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,TextParse.StrRange,Nullable{Union{}},TextParse.StrRange}}, ::Int64, ::Tuple{NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},Array{String,1},NullableArrays.NullableArray{Void,1},Array{String,1}}, ::Int64, ::Int64, ::Int64, ::Int64) at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:323
#_csvread#28(::Char, ::Char, ::Array{Any,1}, ::Array{Any,1}, ::Bool, ::Int64, ::Bool, ::Array{String,1}, ::Array{Any,1}, ::Int64, ::TextParse.#_csvread, ::WeakRefString{UInt8}, ::Char) at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:119
#csvread#27(::Array{Any,1}, ::Function, ::IOStream, ::Char) at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:64
#25 at /Users/bromberger1/.julia/v0.6/TextParse/src/csv.jl:58 [inlined]
open(::TextParse.##25#26{Array{Any,1},Char}, ::String, ::String) at ./iostream.jl:152
#_load_table#10(::Array{Any,1}, ::Void, ::Void, ::Bool, ::Bool, ::TextParse.#csvread, ::Array{Any,1}, ::Function, ::String, ::Char) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/util.jl:150
_load_table(::String, ::Char) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/util.jl:150
_collect(::Dagger.Context, ::JuliaDB.CSVChunk) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/loadfiles.jl:185
#makecsvchunk#145(::Bool, ::Array{Any,1}, ::Function, ::String, ::Char) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/loadfiles.jl:210
(::JuliaDB.#load_f#125{Array{Any,1},Char})(::String) at /Users/bromberger1/.julia/v0.6/JuliaDB/src/loadfiles.jl:107
do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{String}, ::Bool, ::Bool) at /Users/bromberger1/.julia/v0.6/Dagger/src/compute.jl:295
(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#125{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#125{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
macro expansion at /Users/bromberger1/.julia/v0.6/Dagger/src/compute.jl:308 [inlined]
(::Dagger.##69#70{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#125{Array{Any,1},Char},Tuple{String},Channel{Any},Bool,Bool})() at ./event.jl:73
The CSV file is attached.
knownbad.csv.zip
Edit: readcsv
doesn't have a problem with the file.
For non-standard reducers, the current version fails. E.g. if you would like to compute the extreme values in a single pass.
This also fails for Base's reduce
but it works if you provide an initial value, e.g.
julia> reduce((t,s) -> (min(t[1],s), max(t[2],s)), (0.0,0.0), randn(10))
(-1.9300544528466752, 2.919740428612819)
It looks like the csv.cached_on
field is saved in the metadata. This shouldn't be used in a different run since in a new environment it might not be true any more.
reducedim function errors out with StackOverflowError when the aggregation function is executed over 99 times
Example:
t = distribute(IndexedTable([fill(1,100000); fill(2,100000); fill(3,100000);fill(4,100000);fill(5,100000)],
rand(Float64,500000), fill(1,500000),
MVector{64,Float64}[rand(Float64, 64) for n in 1:500000]), 2)
compute(reducedim((x,y)->MVector(x+y), t, 1)) - works
compute(reducedim((x,y)->MVector(x+y), t, 2)) – StackOverflowError. Complains the function executed > 99 times
I think this is an issue in Dagger.jl.
So I actually checked out Dagger, IndexedTables, NamedTuples, but the problem remained:
TypeError: non-boolean (Nullable{Bool}) used in boolean context
refine_perm!(::Array{Int64,1}, ::NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1}}, ::Int64, ::NullableArrays.NullableArray{Int64,1}, ::NullableArrays.NullableArray{Int64,1}, ::Int64, ::Int64) at /home/s/.julia/v0.6/IndexedTables/src/columns.jl:101
refine_perm!(::Array{Int64,1}, ::NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1}}, ::Int64, ::Array{Int64,1}, ::NullableArrays.NullableArray{Int64,1}, ::Int64, ::Int64) at /home/s/.julia/v0.6/IndexedTables/src/columns.jl:109
sortperm(::IndexedTables.Columns{Tuple{Int64,Nullable{Int64},Nullable{Int64},Nullable{Int64},String,String,String,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Float64},Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Int64},String,Int64,Int64,String,Float64,String,Nullable{Float64}},NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Int64,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Int64,1},Array{Int64,1},Array{String,1},Array{Float64,1},Array{String,1},NullableArrays.NullableArray{Float64,1}}}) at /home/s/.julia/v0.6/IndexedTables/src/columns.jl:86
#IndexedTable#42(::Void, ::Bool, ::Bool, ::Type{T} where T, ::IndexedTables.Columns{NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Int64,Nullable{Int64},Nullable{Int64},Nullable{Int64},String,String,String,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Float64},Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,Nullable{Int64},String,Int64,Int64,String,Float64,String,Nullable{Float64}},NamedTuples._NT_id_hip_hd_hr_gl_bf_proper_ra_dec_dist_pmra_pmdec_rv_mag_absmag_spect_ci_x_y_z_vx_vy_vz_rarad_decrad_pmrarad_pmdecrad_bayer_flam_con_comp_comp__primary_base_lum_var_var__min{Array{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},NullableArrays.NullableArray{Int64,1},Array{String,1},Array{String,1},PooledArrays.PooledArray{String,UInt8,1,Array{UInt8,1}},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Ar...
Another Dagger.jl
related error.
julia> sampledata = loadfiles(path, indexcols=["date", "ticker"])
Metadata for 6 / 7 files can be loaded from cache.
Reading 1 csv files totalling 63 bytes...
ERROR: BoundsError: attempt to access 0-element Array{Any,1} at index [0]
#_csvread#28(::Char, ::Char, ::Array{Any,1}, ::Array{Any,1}, ::Bool, ::Int64, ::Bool, ::Array{String,1}, ::Array{Any,1}, ::Int64, ::TextParse.#_csvread, ::WeakRefString{UInt8}, ::Char) at /home/jn/.julia/v0.7/TextParse/src/csv.jl:104
#csvread#27(::Array{Any,1}, ::Function, ::IOStream, ::Char) at /home/jn/.julia/v0.7/TextParse/src/csv.jl:64
#25 at /home/jn/.julia/v0.7/TextParse/src/csv.jl:58 [inlined]
open(::TextParse.##25#26{Array{Any,1},Char}, ::String, ::String) at ./iostream.jl:152
#load_table#7(::Array{String,1}, ::Void, ::Void, ::Bool, ::Bool, ::TextParse.#csvread, ::Array{Any,1}, ::Function, ::String, ::Char) at /home/jn/.julia/v0.7/JuliaDB/src/util.jl:152
(::JuliaDB.#kw##load_table)(::Array{Any,1}, ::JuliaDB.#load_table, ::String, ::Char) at ./<missing>:0
_gather(::Dagger.Context, ::JuliaDB.CSVChunk) at /home/jn/.julia/v0.7/JuliaDB/src/loadfiles.jl:185
#makecsvchunk#112(::Bool, ::Array{Any,1}, ::Function, ::String, ::Char) at /home/jn/.julia/v0.7/JuliaDB/src/loadfiles.jl:210
(::JuliaDB.#kw##makecsvchunk)(::Array{Any,1}, ::JuliaDB.#makecsvchunk, ::String, ::Char) at ./<missing>:0
(::JuliaDB.#load_f#92{Array{Any,1},Char})(::String) at /home/jn/.julia/v0.7/JuliaDB/src/loadfiles.jl:107
do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{String}, ::Bool, ::Bool) at /home/jn/.julia/v0.7/Dagger/src/compute.jl:449
(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#92{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Dagger.#do_task,Tuple{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#92{Array{Any,1},Char},Tuple{String},Bool,Bool},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
macro expansion at /home/jn/.julia/v0.7/Dagger/src/compute.jl:462 [inlined]
(::Dagger.##122#123{Dagger.Context,Dagger.OSProc,Int64,JuliaDB.#load_f#92{Array{Any,1},Char},Tuple{String},Channel{Any},Bool,Bool})() at ./event.jl:73
We need to be able to use multi-processing and GPUs to get speedups. The easiest route is by using Dagger. This also gives us things like out-of-core and lazy joins for free.
Work items here are:
Glob.glob
for directoriesFor datasets with 1000s of chunks, finding the splitters can take a long time. The process is right now O(p*log(N))
where p
is the number of chunks, N
is the number of elements in the biggest chunk.
Some ideas for improvements:
p
). (Not of much help in permutedims)32
which would make the log base 5 - overall a 5-fold improvement assuming the cost of looking up the 32 index rows is negligible in comparison.The following example is working with a set of foreign exchange data in JuliaDB.
When attempting to access the data from a table that has only one data column, the getdatacol function currently throws an error. The equivalent (and desired) operation using gather and accessing the data field is shown below.
julia> bids_5m["EUR/USD",t_5m]
DTable with 1 chunks:
───────────────────────────────┬────────
"EUR/USD" 2016-06-15T00:00:00 │ 1.12111
"EUR/USD" 2016-06-15T00:05:00 │ 1.12114
"EUR/USD" 2016-06-15T00:10:00 │ 1.12116
"EUR/USD" 2016-06-15T00:15:00 │ 1.12125
"EUR/USD" 2016-06-15T00:20:00 │ 1.12107
...
julia> getdatacol(bids["EUR/USD",t_5m],1)
ERROR: type Array has no field columns
#145 at /Users/andy/.julia/v0.6/JuliaDB/src/dcolumns.jl:24
do_task at /Users/andy/.julia/v0.6/Dagger/src/compute.jl:449
#106 at ./distributed/process_messages.jl:268 [inlined]
run_work_thunk at ./distributed/process_messages.jl:56
macro expansion at ./distributed/process_messages.jl:268 [inlined]
#105 at ./event.jl:73
julia> gather(bids["EUR/USD",t_5m]).data
28-element Array{Float32,1}:
1.12105
1.11953
1.12215
1.12903
1.12903
1.12543
1.1285
1.13747
1.13382
1.1334
1.13165
1.13246
1.12548
1.12915
1.13079
1.13789
1.11447
1.10834
1.10832
1.1046
1.10559
1.10646
1.10589
1.10787
1.10855
1.10996
1.11318
1.1102
TextParse returns raw data and column names. Right now JuliaDB ignores the column names and creates the output table without named columns.
Right now the printing is very ugly.
In the TrueFX data there are currency pair
and timestamp
index columns. We would like to aggregate the data over 5-minute time windows, take out 6 currency pairs separately and then match them up side-by-side so to create a table with timestamp
index.
We may not have values available for every 5-minute time-step for all currency pairs. It will be good to have a way to pluggable interpolation mechanism to fill up the blanks.
Note that it's also possible to deal with this situation by just doing an intersection join which would keep only the time steps for which data about all 6 currency pairs are available.
cc @ViralBShah
Suggested by @StefanKarpinski. It is annoying to load a dataset and then have to filter the nulls.
Since the type depends on the % of unique elements in each column.
This causes N in DTable{N<:NDSparse}
to become a Union
Possible solution: store only eltype types in DTable.
julia> open("tmp.csv", "w") do io
write(io, "1, 2\n")
end;
julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/andreasnoack/.julia/v0.6/NullableArrays/src/operators.jl:99.
julia> loadfiles(["tmp.csv"], ',', header_exists = false, colparsers = [Int], usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 5 bytes in 1 batches...
DTable with 1 rows in 1 chunks:
│ 1 2
──┼─────
1 │ 1 2
julia> loadfiles(["tmp.csv"], ',', header_exists = false, colparsers = [Float64], usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 5 bytes in 1 batches...
DTable with 1 rows in 1 chunks:
│ 1 2
──┼─────
1 │ 1 2
julia>
.julia/v0.6/Optim on 2aad9401 [+?]
➜ ~/juliagpu/julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| |
|__/ | x86_64-apple-darwin16.7.0
julia> using JuliaDB
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/andreasnoack/.julia/v0.6/NullableArrays/src/operators.jl:99.
julia> loadfiles(["tmp.csv"], ',', header_exists = false, colparsers = [Float64], usecache = false)
Metadata for 0 / 1 files can be loaded from cache.
Reading 1 csv files totalling 5 bytes in 1 batches...
DTable with 1 rows in 1 chunks:
│ 1 2
──┼───────
1 │ 1.0 2
Tracking operations we need to build the next layer of functionality here.
Won't implement:
where
, pairs
(because getindex does return a kind of iterator)Right now we can't update a series. We can use merge but that is not good for histograms where the two histograms might not have the same bins.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.