kimmolinna / duckdb.jl Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 5.0 9.57 MB

License: MIT License

Julia 99.92% Dockerfile 0.08%

duckdb.jl's People

Stargazers

Watchers

Forkers

artaxerces jeremiahpslewis lars-dammann doytsujin cpfiffer

duckdb.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

reading arrow files

Is it possible to perform query on arrow files? Julia has a fully featured Arrow.jl library using which we can pass a Arrow Table to DuckDB .

Something like this would be great, allowing to query on multiple arrow files and returning the result as an ArrowTable.

arrow_files = ["a1.arrow", "a2.arrow", "a3.arrow"]
tbl = Arrow.Table(arrow_files)

db = DuckDB.open(":memory")
con = DuckDB.connect(db)

q = "SELECT count(*) from tbl;"
r = DuckDB.execute(con, q)

arrow_result = DuckDB.toArrowTable(r)

Python example

arrow_table = pq.read_table('integers.parquet')
con = duckdb.connect()

print(con.execute('SELECT SUM(data) FROM arrow_table WHERE data > 50').fetchone())
con.execute("SELECT * FROM arrow_table").fetch_arrow_table()

Use Appender API to append DataFrame to table

I used the appender API to create a function to append the contents of a dataframe to an existing table for some common data types. Maybe we should integrate it into the package? I'll make a Pull request.
If not we should at least cherry-pick the first commit fa38bfce3449bbc1b98a62780308a87ebdd427a6. It dereferences the appender object in some API ccalls. Using the wrapped API as is leads to segmentation faults which seems like a bug to me.

Core dump trying to return DataFrame

Hi,

I have just started to use DuckDB with julia and ran into an error.
I have converted a huge csv file to a DuckDB and wanted to repeat a query which runs without errors on the commandline client in julia.
I am using julia 1.7.3 (2022-05-06) on Arch Linux (installed using the julia-bin AUR package)

The script to reproduce is

julia> using DuckDB
julia> db = DuckDB.open("data/test.db")
Base.RefValue{Ptr{Nothing}}(Ptr{Nothing} @0x0000000002ec5ad0)
julia> con = DuckDB.connect(db)
Base.RefValue{Ptr{Nothing}}(Ptr{Nothing} @0x0000000002f0cc50)
julia> df = DuckDB.toDataFrame(con, "select * from trainfeature limit 5")

The error is

signal (11): Segmentation fault
in expression starting at REPL[3]:1
getindex at ./array.jl:861 [inlined]
toDataFrame at /home/x/.julia/packages/DuckDB/QadPg/src/DuckDB.jl:39
toDataFrame at /home/x/.julia/packages/DuckDB/QadPg/src/DuckDB.jl:24
unknown function (ip: 0x7f5af4214caf)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:126
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:215
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:166 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:587
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:731
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:885
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:830
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:944
eval at ./boot.jl:373 [inlined]
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:150
repl_backend_loop at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:246
start_repl_backend at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:231
#run_repl#47 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:364
run_repl at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:351
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
#936 at ./client.jl:394
jfptr_YY.936_35454.clone_1 at /usr/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
jl_f__call_latest at /buildworker/worker/package_linux64/build/src/builtins.c:757
#invokelatest#2 at ./essentials.jl:716 [inlined]
invokelatest at ./essentials.jl:714 [inlined]
run_main_repl at ./client.jl:379
exec_options at ./client.jl:309
_start at ./client.jl:495
jfptr__start_22567.clone_1 at /usr/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
true_main at /buildworker/worker/package_linux64/build/src/jlapi.c:559
jl_repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:701
main at julia (unknown line)
unknown function (ip: 0x7f5b9acbe28f)
__libc_start_main at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x400808)
Allocations: 7632851 (Pool: 7629215; Big: 3636); GC: 11
Segmentation fault (core dumped)

All other queries I tried produced the same error.

Can DuckDB query DataFrames.jl data directly ?

As far as I known DuckDB can query Arrow datasets directly and stream query results back to Arrow,

Can DuckDB query DataFrames.jl data directly and stream query results back to DataFrames ?

Implement DBInterface.jl?

Not familiar with the library, but it looks like DB packages like SQLite.jl are moving toward implementing a common DBInterface API, might be helpful here as well. https://github.com/JuliaDatabases/DBInterface.jl

Can't query parquet files?

I have tried the following from the duckDB docs:

D select count(1) from parquet_scan('test.parquet');
┌──────────┐
│ count(1) │
├──────────┤
│ 15369    │
└──────────┘
D

That's cool. And fast :O)

But from Julia, I try this

julia> using DuckDB
julia> res = DuckDB.execute(con, """SELECT * FROM parquet_scan('test.parquet');""")
Catalog Error: Table Function with name parquet_scan does not exist!
Did you mean "arrow_scan"?
LINE 1: SELECT * FROM parquet_scan('test.parquet');
                      ^Base.RefValue{DuckDB.duckdb_result}(DuckDB.duckdb_result(Ptr{UInt64} @0x0000000000000000, Ptr{UInt64} @0x0000000000000000, Ptr{UInt64} @0x0000000000000000, Ptr{DuckDB.duckdb_column} @0x0000000000000000, Ptr{UInt8} @0x0000000004415ec0, Ptr{Nothing} @0x0000000000000000))

That's sad.

On the other hand, csv seems to work:

julia> DuckDB.toDataFrame(DuckDB.execute(con, """select * from 't1.csv' """))
4×2 DataFrame
 Row │ x      t     
     │ Int32  Int32 
─────┼──────────────
   1 │     1      1
   2 │     2      1
   3 │     3      1
   4 │     2      1

Any ideas?

Memory issues without close(database)

I am connecting to a big DB from my jupyter notebook and run into memory issues. For example running this code kills my kernel.

using DuckDB

"""
Queries a big DuckDB database
"""
function query_db()
    # Establish connection to DuckDB database
    con = connect("path/to/big_db")
    # Perfom query
    df = toDataFrame(con, "select avg(example_column) from example_table;")
    # Disconnect from the database
    disconnect(con)
    return df
end

result1 = query_db()
result2 = query_db()

The problem seems to be that connect(path::String) calls duckdb_open(path, database) and also doens't return the pointer to the DB but disconnect(connection) doesn't call duckdb_close(database). So there is no possibility to close the database again when connect(path::String) is used. Therefore, memory allocated for the database is never de-allocated and running the function twice kills the kernel. Closing the database in query_db() seems to solve the problem.

Would it make sense to let connect(path::String) also return the pointer to the database and not only to the connection to make it possible to close the database after usage and/or make disconnect(connection) close the database to resolve the asymmetry?

kimmolinna / duckdb.jl Goto Github PK

duckdb.jl's People

Stargazers

Watchers

Forkers

duckdb.jl's Issues

TagBot trigger issue

reading arrow files

Use Appender API to append DataFrame to table

Core dump trying to return DataFrame

Can DuckDB query DataFrames.jl data directly ?

Implement DBInterface.jl?

Can't query parquet files?

Memory issues without close(database)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent