Hello, I was excited to utilize your utility to export some microsof

See issue <a class="issue-link js-issue-link" data-error-text="Failed to load title" d

Thanks for your response <a class="user-mention notranslate" data-hovercard-type="user

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Enable compression options as argument about odbc2parquet HOT 13 CLOSED

chrisfw commented on July 17, 2024

Enable compression options as argument

from odbc2parquet.

Comments (13)

pacman82 commented on July 17, 2024

Hello Chris,

I am happy you are excited about this tool. Sorry it does not seem to do the job right now. I am trying to understand what is missing though. The query subcommand does offer a --column-compression-default option. It defaults to gzip.

Cheers, Markus

from odbc2parquet.

pacman82 commented on July 17, 2024

See issue #51

from odbc2parquet.

chrisfw commented on July 17, 2024

Thanks for your response @pacman82. Apologies, I didn't see that option initially and it turns out that the initial batch size (1) was causing the output file to be a large size in spite of the gzip compression that as you indicated was applied by default. I tried a number of batch sizes that errored out due to memory allocation failures, but when I found a size that worked, the resulting file was much smaller and more manageable.
Some of my initial difficulties may stem from me testing the tool with the MS Access ODBC driver.

FYI, below is an example of the errors I was seeing.

2021-07-11T10:17:52-04:00 - WARN - State: 01000, Native error: 0, Message: [Microsoft][ODBC Driver Manager] The driver doesn't support the version
of ODBC behavior that the application requested (see SQLSetEnvAttr).
2021-07-11T10:17:52-04:00 - INFO - Memory usage per row is 1073741931 bytes. This excludes memory directly allocated by the ODBC driver.
memory allocation of 80530636800 bytes failed

Thanks again for your assistance.
Chris

from odbc2parquet.

pacman82 commented on July 17, 2024

Hello Chris,

yeah the tool chooses the batch size, so it does not exceed 2GB on a 64 Bit platform. The --batch-size-mib option might also be helpful to find the right size for your system, as you are likely to know how much memory you want to spend on each batch. I fully intend to add a row by row mode, which sacrifices performance for the ability to stream large values from the database. However I would need to dedicate some time to this, and my life is pretty busy right now, so no timeline on that.

Thanks for giving this tool a spin, and I am happy it worked for you after all.

Cheers, Markus

from odbc2parquet.

chrisfw commented on July 17, 2024

Thanks for following up Markus and for making this helpful utility. The 2GB default limit makes sense. I do have a problem importing (into a table or creating a view over) any file exported by odbc2parquet with batch size > 1 into DuckDB, but I am not sure where the problem lies at this point. DuckDB reports the following error:

>>> con.execute("CREATE TABLE chist as SELECT * from parquet_scan('/Users/cwhelan/Downloads/outcompressed.par')")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: INTERNAL Error: Invalid unicode detected in segment statistics update!

This could likely be an issue on the DuckDB side, but I thought I'd make you aware of it in case you have any ideas or have seen this problem occur before.

Regards,
Chris

from odbc2parquet.

chrisfw commented on July 17, 2024

@pacman82 , I installed parquet-tools and cat-ting the problematic file shows the below. Could the device fields with the \u0000 values be causing the issue?

cwhelan@Christophers-MacBook-Pro Downloads % parquet-tools cat --json outcompressed.par  
{"application":3,"accountnumber":112293300,"readsequence":2,"device":"035247N02002"}
{"application":3,"accountnumber":112294300,"readsequence":1,"device":"035446N02002"}
{"application":1,"accountnumber":112291000,"readsequence":2,"device":"035446N02002"}
{"application":1,"accountnumber":112291000,"readsequence":3,"device":"030216H04001"}
{"application":1,"accountnumber":112291000,"readsequence":4,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112292300,"readsequence":1,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112292300,"readsequence":2,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":3,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":4,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":5,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":6,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112293300,"readsequence":1,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}

When I cat the file produced where batch size = 1, all of the device values are populated with actual values.

from odbc2parquet.

pacman82 commented on July 17, 2024

Thanks for making me aware, but since it says INTERNAL error I am leaning towards an error within DuckDB. However, you may want to inspect the schema you are exporting carefully. Verbose mode might help. Check if there are for example some columns with a ridiculously large column size in there. If so, try to cast them down into something smaller in the query. 2GB per row sounds like a lot, you may get away with way less and Duck DB is then more likely to be able to process this.

Right now the tool just uses the schemas reported by ODBC, but they might not always be sensible.

from odbc2parquet.

pacman82 commented on July 17, 2024

Sorry for my last response I had not seen, your last post yet

from odbc2parquet.

pacman82 commented on July 17, 2024

If the files contain the Nuls before catting, please open a bug

from odbc2parquet.

chrisfw commented on July 17, 2024

@pacman82 , Thanks for your info, I will investigate the schema using verbose mode as you suggested. I suspect you are correct and the schema provided by the driver may be causing the problem. If that does prove true and I still have issues after casting to more reasonable column sizes and the nulls are there in the output file, I'll create a bug per your instruction.

from odbc2parquet.

pacman82 commented on July 17, 2024

FYI: Nils in the output are allowed, as long as they also appear in the input. That doesn't seem to be the case here though. See: #36

Thanks for keeping me in the loop

from odbc2parquet.

chrisfw commented on July 17, 2024

@pacman82 You were correct. It was a schema issue caused in this case by using the Long Text data type within MS Access. Once I changed all of the Long Text columns to Short Text columns, my problems both on the generation side with batching and compression and on the consumption side with the DuckDB import failure were resolved. Thanks for all of your assistance.

from odbc2parquet.

pacman82 commented on July 17, 2024

@chrisfw You are welcome.

Cheers, Markus

from odbc2parquet.

Enable compression options as argument about odbc2parquet HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent