pacman82 / arrow-odbc Goto Github PK

View Code? Open in Web Editor NEW

46.0 46.0 7.0 417 KB

Fill Apache Arrow record batches from an ODBC data source in Rust.

License: MIT License

Rust 100.00%

arrow-odbc's People

Contributors

Stargazers

Watchers

Forkers

icodein bcmcmill jonashaag keraion matthewgapp nobie-org

arrow-odbc's Issues

8.3.0: the trait `arrow_odbc::odbc_api::Cursor` is not implemented

The 8.3.0 update is breaking my code with:
the trait arrow_odbc::odbc_api::Cursor is not implemented for odbc_api::CursorImpl<odbc_api::StatementConnection<'static>>

ReadStrategy is not Send

Is the ConcurrentOdbcReader thread safe? I'm getting an error saying that ReadStrategy is not Send:
error[E0277]: (dyn arrow_odbc::reader::ReadStrategy + 'static) cannot be sent between threads safely

Is there a way to use this concurrently? I'm trying to use this with the new WASI component model, saving the reader across multiple calls from the component.

Semicolon after INSERT fails on DB2 for IBM i.

I was originally trying to do bulk inserts with arrow-odbc-py, but the stack traces didn't show me the original database error code and message. So, I switched to Rust, for a simpler reproducible example.

My goal was to get insert_into_table working with DB2 for IBM i. I believe the root cause is that function insert_statement_text hard-codes a semicolon, and the ODBC drivers for DB2 for IBM i does not allow statements to end in a semicolon, nor does it allow a script of statements separated by a semicolon.

Code

For a table named TSTPANDAS (originally testing inserts from Pandas) with columns ROW_NUM and RAND_FLOAT.

Warning: I really don't know Rust, though it's reminding me of C++, Java, and Kotlin. I hacked this together from examples in your docs.

use std::env;
use std::sync::Arc;

use arrow::{
    array::{Float32Array, Int32Array},
    record_batch::RecordBatch,
};
use arrow_array::RecordBatchIterator;
use arrow_odbc::{
    insert_into_table,
    odbc_api::{ConnectionOptions, Environment, Error},
};

fn main() -> Result<(), Error> {
    // If you do not do anything fancy it is recommended to have only one Environment in the
    // entire process.
    let environment = Environment::new()?;

    let user = env::var("db2_user").unwrap_or_default();
    let password = env::var("db2_password").unwrap_or_default();

    // Connect using a DSN. Alternatively we could have used a connection string
    let connection = environment.connect(
        "scupddatawhse_dev",
        &user,
        &password,
        ConnectionOptions::default(),
    )?;

    // https://docs.rs/arrow/latest/arrow/#tabular-representation
    let col_1 = Arc::new(Int32Array::from_iter([1, 2, 3])) as _;
    let col_2 = Arc::new(Float32Array::from_iter([1., 6.3, 4.])) as _;

    let record_batch =
        RecordBatch::try_from_iter([("row_num", col_1), ("rand_float", col_2)]).unwrap();

    // https://docs.rs/arrow-array/44.0.0/arrow_array/trait.RecordBatchReader.html
    let batches: Vec<RecordBatch> = vec![record_batch.clone(), record_batch.clone()];

    let mut reader = RecordBatchIterator::new(batches.into_iter().map(Ok), record_batch.schema());
    let batch_size = 1000;
    insert_into_table(&connection, &mut reader, "TSTPANDAS", batch_size).expect("Inserted into TSTPANDAS");

    Ok(())
}

Output:

rcoleman12@ross-T480-linux:~/git/test_arrow_odbc$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.05s
     Running `target/debug/test_arrow_odbc`
thread 'main' panicked at 'Inserted into TSTPANDAS: PreparingInsertStatement { source: Diagnostics { record: State: 42000, Native error: -104, Message: [IBM][System i Access ODBC Driver][DB2 for i5/OS]SQL0104 - Token ; was not valid. Valid tokens: <END-OF-STATEMENT>., function: "SQLPrepare" }, sql: "INSERT INTO TSTPANDAS (row_num, rand_float) VALUES (?, ?);" }', src/main.rs:42:74
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Performance extremely sensitive to `max_text_size`

It seems like arrow-odbc's performance can be extremely sensitive to the value of max_text_size:

In [16]: reader = arrow_odbc.read_arrow_batches_from_odbc(..., max_text_size=30)

In [17]: %timeit -n1 -r1 list(reader)
81.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [18]: reader = arrow_odbc.read_arrow_batches_from_odbc(..., max_text_size=300)

In [19]: %timeit -n1 -r1 list(reader)
117 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [20]: reader = arrow_odbc.read_arrow_batches_from_odbc(..., max_text_size=3000)

In [21]: %timeit -n1 -r1 list(reader)
6.42 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

When writing library code I have to pick a value large enough to accommodate the longest string the library will ever deal will. In practice this can easily be the 3000 bytes used in the example above, for a 80x slowdown in this case :-(

Version: arrow-odbc-py 2.0.5

about ConcurrentOdbcReader send

problem

This is the problem I'm having

error[E0277]: `(dyn arrow_odbc::reader::ReadStrategy + 'static)` cannot be sent between threads safely
   --> src/odbc/odbc_drive.rs:105:25
    |
105 |                 Ok(Some(Box::pin(s)))
    |                         ^^^^^^^^^^^ `(dyn arrow_odbc::reader::ReadStrategy + 'static)` cannot be sent between threads safely
    |
    = help: the trait `std::marker::Send` is not implemented for `(dyn arrow_odbc::reader::ReadStrategy + 'static)`
    = note: required for `std::ptr::Unique<(dyn arrow_odbc::reader::ReadStrategy + 'static)>` to implement `std::marker::Send`
note: required because it appears within the type `Box<dyn ReadStrategy>`
   --> /Users/simonyi/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/boxed.rs:195:12

where

Can it be changed?
impl<C: Cursor + Send + 'static> ConcurrentOdbcReader {

I think

I think it is better to use SendableRecordBatchStream type here

pub trait RecordBatchStream: Stream<Item = Result<RecordBatch>> {
    fn schema(&self) -> SchemaRef;
}

pub type SendableRecordBatchStream = Pin<Box<dyn RecordBatchStream + Send>>;

Optimize parsing of Decimal(x, 0)

I think we can make a (small?) optimization for parsing decimals with scale=0. Snowflake will send any integers as Decimal(38, 0) by default. At the very least I guess we can skip removing the . char but maybe we can also just use one of the integer parsers.

mysql json type

Display size of column 7: -4 Is this normal? If this is -4, the program will be painc.

[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 1. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'username';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 2. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'fullname';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 3. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'password';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 4. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'email';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 5. Relational type: Varchar { length: 2083 }; Nullability: Nullable; Name: 'profile_img';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 6. Relational type: TinyInt; Nullability: Nullable; Name: 'deleted';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 7. Relational type: Unknown; Nullability: Nullable; Name: 'properties';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 8. Relational type: TinyInt; Nullability: Nullable; Name: 'is_group';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 1: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 2: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 3: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 4: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 5: Varchar { length: 2083 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 7: Unknown
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Display size of column 7: -4
thread 'odbc::odbc_drive::test::test_insert_into' panicked at /Users/simonyi/mosdb/arrow-odbc/src/reader/text.rs:52:14:
called `Result::unwrap()` on an `Err` value: TryFromIntError(())

mysql support

Is mysql currently incompatible?

  const CONNECTION_STRING: &str = r#"
    Driver={MySQL ODBC 8.2 Unicode Driver};
    Server=127.0.0.1;
    DB=morse;
    UID=root;
    PWD=123456;
    Port=3306;
    "#;

odbcinst.ini

[MySQL ODBC 8.2 Unicode Driver]
Description=MySQL ODBC 8.2 Unicode Driver
Driver = /usr/local/lib/libmyodbc8w.so


running 1 test
Error: ODBC emitted an error calling 'SQLDriverConnect':
State: H000, Native error: 0, Message: [

Stack backtrace:
   0: std::backtrace_rs::backtrace::libunwind::trace
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1: std::backtrace_rs::backtrace::trace_unsynchronized
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2: std::backtrace::Backtrace::create
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/backtrace.rs:331:13
   3: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
             at /Users/simonyi/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/anyhow-1.0.75/src/error.rs:551:25
   4: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/result.rs:1963:27
   5: morse_engine::odbc::test::test
             at ./src/odbc/mod.rs:27:26
   6: morse_engine::odbc::test::test::{{closure}}
             at ./src/odbc/mod.rs:15:18
   7: core::ops::function::FnOnce::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/ops/function.rs:250:5
   8: core::ops::function::FnOnce::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/ops/function.rs:250:5
   9: test::__rust_begin_short_backtrace
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:628:18
  10: test::run_test_in_process::{{closure}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:651:60
  11: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/panic/unwind_safe.rs:272:9
  12: std::panicking::try::do_call
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:552:40
  13: std::panicking::try
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:516:19
  14: std::panic::catch_unwind
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panic.rs:142:14
  15: test::run_test_in_process
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:651:27
  16: test::run_test::{{closure}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:574:43
  17: test::run_test::{{closure}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:602:41
  18: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/sys_common/backtrace.rs:154:18
  19: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/thread/mod.rs:529:17
  20: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/panic/unwind_safe.rs:272:9
  21: std::panicking::try::do_call
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:552:40
  22: std::panicking::try
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:516:19
  23: std::panic::catch_unwind
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panic.rs:142:14
  24: std::thread::Builder::spawn_unchecked_::{{closure}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/thread/mod.rs:528:30
  25: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/ops/function.rs:250:5
  26: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/alloc/src/boxed.rs:2007:9
  27: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/alloc/src/boxed.rs:2007:9
  28: std::sys::unix::thread::Thread::new::thread_start
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/sys/unix/thread.rs:108:17
  29: __pthread_joiner_wake
test odbc::test::test ... FAILED

failures:

failures:
    odbc::test::test

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 13 filtered out; finished in 0.37s

error: test failed, to rerun pass `-p morse-engine --lib`

arrow version conflict

cargo.lock

[[package]]
name = "arrow-odbc"
version = "3.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d2278bb408bd1cbf82f725517ff182d06b024657040f9bafac23b9bd0ca61883"
dependencies = [
 "arrow 49.0.0",
 "atoi",
 "chrono",
 "log",
 "odbc-api",
 "thiserror",
]

my cargo.toml

[dependencies]
arrow = { version = "46.0.0", features = ["prettyprint","ffi"] }
arrow-array = { version = "46.0.0", default-features = false, features = ["chrono-tz"] }
arrow-buffer = { version = "46.0.0", default-features = false }
arrow-flight = { version = "46.0.0", features = ["flight-sql-experimental"] }
arrow-schema = { version = "46.0.0", default-features = false }
parquet = { version = "46.0.0", features = ["arrow", "async", "object_store"] }

What should I do to make arrow-odbc use arrow:46.0.0 version? Although arrow-odbc supports version = ">= 29, < 50", the actual compiled result is as above. Why is this?

compile error

os:macos m1

-nodefaultlibs"
  = note: ld: warning: ignoring duplicate libraries: '-lc++'
          ld: library 'odbc' not found
          clang: error: linker command failed with exit code 1 (use -v to see invocation)

[Feature Request] Convert Schema string from ODBC to ArrowSchema from pub API

Any plan on api to get tables and schemas from odbc connection?

ADBC support?

Just an idea: would be nice to use this as a bridge from odbc to adbc

https://arrow.apache.org/adbc/0.5.1/format/specification.html

Feel free to close this if too complex / time consuming or otherwise not wanted

Would probably need to use this lib: https://github.com/apache/arrow-adbc/blob/main/rust/Cargo.toml

Panic with divide by zero on empty schema

Since release of arrow_odbc 3.0.0 (arrow_odbc_py 2.0.0), when running a multi-statement query that doesn't generate a result set in the second or later statement now fails due to a division by zero error when calculating the buffer_size_in_rows. The first statement returns a result set to allow the Cursor to be created, and the statement handle still has more results; however, the calculation of the buffer size now panics.

import arrow_odbc as ao

reader = ao.read_arrow_batches_from_odbc(
    connection_string="Driver={DB2};"
    "Hostname=host;"
    "Port=50000;"
    "Protocol=TCPIP;"
    "Database=db2db;"
    "UID=user;"
    "PWD=password;",
    query="""
SELECT 1 
FROM SYSIBM.SYSDUMMY1
;

DECLARE GLOBAL TEMPORARY TABLE MY_DGTT AS ( 
    SELECT *, 5 AS FIVE
    FROM SYSIBM.SYSDUMMY1 
) 
WITH DATA 
ON COMMIT PRESERVE ROWS 
NOT LOGGED 
WITH REPLACE
;

SELECT * 
FROM SESSION.MY_DGTT
;
""",
)

while True:
    print("-------------------------")
    print(reader.schema or "No Schema")
    if reader.schema:
        print("-------------------------")
        for batch in reader:
            print(batch)
    if not reader.more_results():
        break

Error Message:

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-odbc-3.0.0/src/reader/odbc_reader.rs:467:30:
attempt to divide by zero
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected Results:

-------------------------
1: int32 not null
-------------------------
pyarrow.RecordBatch
1: int32 not null
----
1: [1]
-------------------------
No Schema
-------------------------
IBMREQD: string not null
FIVE: int32 not null
-------------------------
pyarrow.RecordBatch
IBMREQD: string not null
FIVE: int32 not null
----
IBMREQD: ["Y"]
FIVE: [5]

application run crash

        let cursor = conn
            .execute("SHOW COLUMNS FROM  hello", ())
            .unwrap()
            .expect("");
        // Each batch shall only consist of maximum 10.000 rows.
        let max_batch_size = 10_000;
        // Read result set as arrow batches. Infer Arrow types automatically using the meta
        // information of `cursor`.
        let mut arrow_record_batches = OdbcReader::new(cursor, max_batch_size).unwrap();

When "SHOW COLUMNS FROM hello" is executed, if batch_size is set to 10000, the program will crash. If it is set to 10, it will not crash.