Giter Club home page Giter Club logo

duckdb.jl's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

duckdb.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

reading arrow files

Is it possible to perform query on arrow files? Julia has a fully featured Arrow.jl library using which we can pass a Arrow Table to DuckDB .

Something like this would be great, allowing to query on multiple arrow files and returning the result as an ArrowTable.

arrow_files = ["a1.arrow", "a2.arrow", "a3.arrow"]
tbl = Arrow.Table(arrow_files)

db = DuckDB.open(":memory")
con = DuckDB.connect(db)

q = "SELECT count(*) from tbl;"
r = DuckDB.execute(con, q)

arrow_result = DuckDB.toArrowTable(r)

Python example

arrow_table = pq.read_table('integers.parquet')
con = duckdb.connect()

print(con.execute('SELECT SUM(data) FROM arrow_table WHERE data > 50').fetchone())
con.execute("SELECT * FROM arrow_table").fetch_arrow_table()

Use Appender API to append DataFrame to table

I used the appender API to create a function to append the contents of a dataframe to an existing table for some common data types. Maybe we should integrate it into the package? I'll make a Pull request.
If not we should at least cherry-pick the first commit fa38bfce3449bbc1b98a62780308a87ebdd427a6. It dereferences the appender object in some API ccalls. Using the wrapped API as is leads to segmentation faults which seems like a bug to me.

Core dump trying to return DataFrame

Hi,

I have just started to use DuckDB with julia and ran into an error.
I have converted a huge csv file to a DuckDB and wanted to repeat a query which runs without errors on the commandline client in julia.
I am using julia 1.7.3 (2022-05-06) on Arch Linux (installed using the julia-bin AUR package)

The script to reproduce is

julia> using DuckDB
julia> db = DuckDB.open("data/test.db")
Base.RefValue{Ptr{Nothing}}(Ptr{Nothing} @0x0000000002ec5ad0)
julia> con = DuckDB.connect(db)
Base.RefValue{Ptr{Nothing}}(Ptr{Nothing} @0x0000000002f0cc50)
julia> df = DuckDB.toDataFrame(con, "select * from trainfeature limit 5")

The error is

signal (11): Segmentation fault
in expression starting at REPL[3]:1
getindex at ./array.jl:861 [inlined]
toDataFrame at /home/x/.julia/packages/DuckDB/QadPg/src/DuckDB.jl:39
toDataFrame at /home/x/.julia/packages/DuckDB/QadPg/src/DuckDB.jl:24
unknown function (ip: 0x7f5af4214caf)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:126
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:215
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:166 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:587
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:731
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:885
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:830
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:944
eval at ./boot.jl:373 [inlined]
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:150
repl_backend_loop at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:246
start_repl_backend at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:231
#run_repl#47 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:364
run_repl at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:351
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
#936 at ./client.jl:394
jfptr_YY.936_35454.clone_1 at /usr/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
jl_f__call_latest at /buildworker/worker/package_linux64/build/src/builtins.c:757
#invokelatest#2 at ./essentials.jl:716 [inlined]
invokelatest at ./essentials.jl:714 [inlined]
run_main_repl at ./client.jl:379
exec_options at ./client.jl:309
_start at ./client.jl:495
jfptr__start_22567.clone_1 at /usr/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
true_main at /buildworker/worker/package_linux64/build/src/jlapi.c:559
jl_repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:701
main at julia (unknown line)
unknown function (ip: 0x7f5b9acbe28f)
__libc_start_main at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x400808)
Allocations: 7632851 (Pool: 7629215; Big: 3636); GC: 11
Segmentation fault (core dumped)

All other queries I tried produced the same error.

Can DuckDB query DataFrames.jl data directly ?

As far as I known DuckDB can query Arrow datasets directly and stream query results back to Arrow,

Can DuckDB query DataFrames.jl data directly and stream query results back to DataFrames ?

Can't query parquet files?

I have tried the following from the duckDB docs:

D select count(1) from parquet_scan('test.parquet');
┌──────────┐
│ count(1) │
├──────────┤
│ 15369    │
└──────────┘
D 

That's cool. And fast :O)

But from Julia, I try this

julia> using DuckDB
julia> res = DuckDB.execute(con, """SELECT * FROM parquet_scan('test.parquet');""")
Catalog Error: Table Function with name parquet_scan does not exist!
Did you mean "arrow_scan"?
LINE 1: SELECT * FROM parquet_scan('test.parquet');
                      ^Base.RefValue{DuckDB.duckdb_result}(DuckDB.duckdb_result(Ptr{UInt64} @0x0000000000000000, Ptr{UInt64} @0x0000000000000000, Ptr{UInt64} @0x0000000000000000, Ptr{DuckDB.duckdb_column} @0x0000000000000000, Ptr{UInt8} @0x0000000004415ec0, Ptr{Nothing} @0x0000000000000000))

That's sad.

On the other hand, csv seems to work:

julia> DuckDB.toDataFrame(DuckDB.execute(con, """select * from 't1.csv' """))
4×2 DataFrame
 Row │ x      t     
     │ Int32  Int32 
─────┼──────────────
   1 │     1      1
   2 │     2      1
   3 │     3      1
   4 │     2      1

Any ideas?

Memory issues without close(database)

I am connecting to a big DB from my jupyter notebook and run into memory issues. For example running this code kills my kernel.

using DuckDB

"""
Queries a big DuckDB database
"""
function query_db()
    # Establish connection to DuckDB database
    con = connect("path/to/big_db")
    # Perfom query
    df = toDataFrame(con, "select avg(example_column) from example_table;")
    # Disconnect from the database
    disconnect(con)
    return df
end

result1 = query_db()
result2 = query_db()

The problem seems to be that connect(path::String) calls duckdb_open(path, database) and also doens't return the pointer to the DB but disconnect(connection) doesn't call duckdb_close(database). So there is no possibility to close the database again when connect(path::String) is used. Therefore, memory allocated for the database is never de-allocated and running the function twice kills the kernel. Closing the database in query_db() seems to solve the problem.

Would it make sense to let connect(path::String) also return the pointer to the database and not only to the connection to make it possible to close the database after usage and/or make disconnect(connection) close the database to resolve the asymmetry?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.