Giter Club home page Giter Club logo

Comments (13)

pacman82 avatar pacman82 commented on July 17, 2024

Hello Chris,

I am happy you are excited about this tool. Sorry it does not seem to do the job right now. I am trying to understand what is missing though. The query subcommand does offer a --column-compression-default option. It defaults to gzip.

Cheers, Markus

from odbc2parquet.

pacman82 avatar pacman82 commented on July 17, 2024

See issue #51

from odbc2parquet.

chrisfw avatar chrisfw commented on July 17, 2024

Thanks for your response @pacman82. Apologies, I didn't see that option initially and it turns out that the initial batch size (1) was causing the output file to be a large size in spite of the gzip compression that as you indicated was applied by default. I tried a number of batch sizes that errored out due to memory allocation failures, but when I found a size that worked, the resulting file was much smaller and more manageable.
Some of my initial difficulties may stem from me testing the tool with the MS Access ODBC driver.

FYI, below is an example of the errors I was seeing.

2021-07-11T10:17:52-04:00 - WARN - State: 01000, Native error: 0, Message: [Microsoft][ODBC Driver Manager] The driver doesn't support the version
of ODBC behavior that the application requested (see SQLSetEnvAttr).
2021-07-11T10:17:52-04:00 - INFO - Memory usage per row is 1073741931 bytes. This excludes memory directly allocated by the ODBC driver.
memory allocation of 80530636800 bytes failed

Thanks again for your assistance.
Chris

from odbc2parquet.

pacman82 avatar pacman82 commented on July 17, 2024

Hello Chris,

yeah the tool chooses the batch size, so it does not exceed 2GB on a 64 Bit platform. The --batch-size-mib option might also be helpful to find the right size for your system, as you are likely to know how much memory you want to spend on each batch. I fully intend to add a row by row mode, which sacrifices performance for the ability to stream large values from the database. However I would need to dedicate some time to this, and my life is pretty busy right now, so no timeline on that.

Thanks for giving this tool a spin, and I am happy it worked for you after all.

Cheers, Markus

from odbc2parquet.

chrisfw avatar chrisfw commented on July 17, 2024

Thanks for following up Markus and for making this helpful utility. The 2GB default limit makes sense. I do have a problem importing (into a table or creating a view over) any file exported by odbc2parquet with batch size > 1 into DuckDB, but I am not sure where the problem lies at this point. DuckDB reports the following error:

>>> con.execute("CREATE TABLE chist as SELECT * from parquet_scan('/Users/cwhelan/Downloads/outcompressed.par')")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: INTERNAL Error: Invalid unicode detected in segment statistics update!

This could likely be an issue on the DuckDB side, but I thought I'd make you aware of it in case you have any ideas or have seen this problem occur before.

Regards,
Chris

from odbc2parquet.

chrisfw avatar chrisfw commented on July 17, 2024

@pacman82 , I installed parquet-tools and cat-ting the problematic file shows the below. Could the device fields with the \u0000 values be causing the issue?

cwhelan@Christophers-MacBook-Pro Downloads % parquet-tools cat --json outcompressed.par  
{"application":3,"accountnumber":112293300,"readsequence":2,"device":"035247N02002"}
{"application":3,"accountnumber":112294300,"readsequence":1,"device":"035446N02002"}
{"application":1,"accountnumber":112291000,"readsequence":2,"device":"035446N02002"}
{"application":1,"accountnumber":112291000,"readsequence":3,"device":"030216H04001"}
{"application":1,"accountnumber":112291000,"readsequence":4,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112292300,"readsequence":1,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112292300,"readsequence":2,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":3,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":4,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":5,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":6,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112293300,"readsequence":1,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}

When I cat the file produced where batch size = 1, all of the device values are populated with actual values.

from odbc2parquet.

pacman82 avatar pacman82 commented on July 17, 2024

Thanks for making me aware, but since it says INTERNAL error I am leaning towards an error within DuckDB. However, you may want to inspect the schema you are exporting carefully. Verbose mode might help. Check if there are for example some columns with a ridiculously large column size in there. If so, try to cast them down into something smaller in the query. 2GB per row sounds like a lot, you may get away with way less and Duck DB is then more likely to be able to process this.

Right now the tool just uses the schemas reported by ODBC, but they might not always be sensible.

from odbc2parquet.

pacman82 avatar pacman82 commented on July 17, 2024

Sorry for my last response I had not seen, your last post yet

from odbc2parquet.

pacman82 avatar pacman82 commented on July 17, 2024

If the files contain the Nuls before catting, please open a bug

from odbc2parquet.

chrisfw avatar chrisfw commented on July 17, 2024

@pacman82 , Thanks for your info, I will investigate the schema using verbose mode as you suggested. I suspect you are correct and the schema provided by the driver may be causing the problem. If that does prove true and I still have issues after casting to more reasonable column sizes and the nulls are there in the output file, I'll create a bug per your instruction.

from odbc2parquet.

pacman82 avatar pacman82 commented on July 17, 2024

FYI: Nils in the output are allowed, as long as they also appear in the input. That doesn't seem to be the case here though. See: #36

Thanks for keeping me in the loop

from odbc2parquet.

chrisfw avatar chrisfw commented on July 17, 2024

@pacman82 You were correct. It was a schema issue caused in this case by using the Long Text data type within MS Access. Once I changed all of the Long Text columns to Short Text columns, my problems both on the generation side with batching and compression and on the consumption side with the DuckDB import failure were resolved. Thanks for all of your assistance.

from odbc2parquet.

pacman82 avatar pacman82 commented on July 17, 2024

@chrisfw You are welcome.

Cheers, Markus

from odbc2parquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.