Comments (13)
Hello Chris,
I am happy you are excited about this tool. Sorry it does not seem to do the job right now. I am trying to understand what is missing though. The query
subcommand does offer a --column-compression-default
option. It defaults to gzip
.
Cheers, Markus
from odbc2parquet.
See issue #51
from odbc2parquet.
Thanks for your response @pacman82. Apologies, I didn't see that option initially and it turns out that the initial batch size (1) was causing the output file to be a large size in spite of the gzip compression that as you indicated was applied by default. I tried a number of batch sizes that errored out due to memory allocation failures, but when I found a size that worked, the resulting file was much smaller and more manageable.
Some of my initial difficulties may stem from me testing the tool with the MS Access ODBC driver.
FYI, below is an example of the errors I was seeing.
2021-07-11T10:17:52-04:00 - WARN - State: 01000, Native error: 0, Message: [Microsoft][ODBC Driver Manager] The driver doesn't support the version
of ODBC behavior that the application requested (see SQLSetEnvAttr).
2021-07-11T10:17:52-04:00 - INFO - Memory usage per row is 1073741931 bytes. This excludes memory directly allocated by the ODBC driver.
memory allocation of 80530636800 bytes failed
Thanks again for your assistance.
Chris
from odbc2parquet.
Hello Chris,
yeah the tool chooses the batch size, so it does not exceed 2GB on a 64 Bit platform. The --batch-size-mib
option might also be helpful to find the right size for your system, as you are likely to know how much memory you want to spend on each batch. I fully intend to add a row by row mode, which sacrifices performance for the ability to stream large values from the database. However I would need to dedicate some time to this, and my life is pretty busy right now, so no timeline on that.
Thanks for giving this tool a spin, and I am happy it worked for you after all.
Cheers, Markus
from odbc2parquet.
Thanks for following up Markus and for making this helpful utility. The 2GB default limit makes sense. I do have a problem importing (into a table or creating a view over) any file exported by odbc2parquet with batch size > 1 into DuckDB, but I am not sure where the problem lies at this point. DuckDB reports the following error:
>>> con.execute("CREATE TABLE chist as SELECT * from parquet_scan('/Users/cwhelan/Downloads/outcompressed.par')")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: INTERNAL Error: Invalid unicode detected in segment statistics update!
This could likely be an issue on the DuckDB side, but I thought I'd make you aware of it in case you have any ideas or have seen this problem occur before.
Regards,
Chris
from odbc2parquet.
@pacman82 , I installed parquet-tools and cat-ting the problematic file shows the below. Could the device fields with the \u0000 values be causing the issue?
cwhelan@Christophers-MacBook-Pro Downloads % parquet-tools cat --json outcompressed.par
{"application":3,"accountnumber":112293300,"readsequence":2,"device":"035247N02002"}
{"application":3,"accountnumber":112294300,"readsequence":1,"device":"035446N02002"}
{"application":1,"accountnumber":112291000,"readsequence":2,"device":"035446N02002"}
{"application":1,"accountnumber":112291000,"readsequence":3,"device":"030216H04001"}
{"application":1,"accountnumber":112291000,"readsequence":4,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112292300,"readsequence":1,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112292300,"readsequence":2,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":3,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":4,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":5,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":1,"accountnumber":112292000,"readsequence":6,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
{"application":3,"accountnumber":112293300,"readsequence":1,"device":"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}
When I cat the file produced where batch size = 1, all of the device values are populated with actual values.
from odbc2parquet.
Thanks for making me aware, but since it says INTERNAL error I am leaning towards an error within DuckDB. However, you may want to inspect the schema you are exporting carefully. Verbose mode might help. Check if there are for example some columns with a ridiculously large column size in there. If so, try to cast them down into something smaller in the query. 2GB per row sounds like a lot, you may get away with way less and Duck DB is then more likely to be able to process this.
Right now the tool just uses the schemas reported by ODBC, but they might not always be sensible.
from odbc2parquet.
Sorry for my last response I had not seen, your last post yet
from odbc2parquet.
If the files contain the Nuls before catting, please open a bug
from odbc2parquet.
@pacman82 , Thanks for your info, I will investigate the schema using verbose mode as you suggested. I suspect you are correct and the schema provided by the driver may be causing the problem. If that does prove true and I still have issues after casting to more reasonable column sizes and the nulls are there in the output file, I'll create a bug per your instruction.
from odbc2parquet.
FYI: Nils in the output are allowed, as long as they also appear in the input. That doesn't seem to be the case here though. See: #36
Thanks for keeping me in the loop
from odbc2parquet.
@pacman82 You were correct. It was a schema issue caused in this case by using the Long Text data type within MS Access. Once I changed all of the Long Text columns to Short Text columns, my problems both on the generation side with batching and compression and on the consumption side with the DuckDB import failure were resolved. Thanks for all of your assistance.
from odbc2parquet.
@chrisfw You are welcome.
Cheers, Markus
from odbc2parquet.
Related Issues (20)
- Issue with MySQL JSON columns HOT 8
- Reserved Column Names not Supported HOT 1
- Feature Request - Support column encryption in the generated parquet file HOT 4
- JobName as .sql file in config file HOT 4
- Parquet format version support HOT 9
- Feature suggestion: connect to URL `postgresql://username:pass@host/database` HOT 1
- What permissions are needed? - State: 42501, Native error: 1, Message: ERROR: permission denied HOT 4
- StarRocks parquet file import of parquet file generated by odbc2parquet fails with encoding error HOT 11
- Memory allocation with column-length-limit HOT 11
- Build for alpine HOT 8
- file-size-threshold generates wrong size files HOT 1
- --no-empty-file option doesn't work properly when row-groups-per-file should devide result into few files HOT 6
- MSSQL nvarchar - missing column in output file HOT 2
- Feature request: Progress bar for full table copies HOT 6
- Data source must return valid UTF16 in wide character buffer: Utf16Error HOT 4
- Write statistics HOT 14
- Make zstd the default compression HOT 4
- Build release assets for Ubuntu ARM64 as well HOT 11
- Exporter adding trailing zero's in when exporting from PostgreSQL Numeric dtype HOT 5
- thread 'main' panicked at src/query/date.rs:60:87: called `Option::unwrap()` on a `None` value HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from odbc2parquet.