Giter Club home page Giter Club logo

Comments (11)

pacman82 avatar pacman82 commented on August 15, 2024

Hello Chris,

I can not tell you what is wrong either, but here are some things you might want to try:

  • Does it fail even with a parquet file containing only a single row? This could be produced by changing the input query so only a single row gets returned. It might also be beneficial to speed up the feedback loop for the other experiments.
  • Which column causes this? Change the query so only certain columns are in the odbc2parquet output/StarRocks input. Maybe we could zero in on the exact column causing it?
  • --parquet-column-encoding can be used to change the encoding of a column. See odbc2parquet query --help
  • Ask the maintainers of StarRocks/Parquet kindly to provide a better error message, detailing what exactly it is which is unsupported.

Best, Markus

from odbc2parquet.

chrisfw avatar chrisfw commented on August 15, 2024

@pacman82 , thank you very much for this information.

  • Does it fail even with a parquet file containing only a single row? This could be produced by changing the input query so only a single row gets returned.
    No, I narrowed it down to a particular row set size after which it fails to import - it fails on a select with no sort applied and rownum < 56322, but examination of that row does not seem to contain any unusual text when I export that row to CSV using SQL Developer.

  • Which column causes this? Change the query so only certain columns are in the odbc2parquet output/StarRocks input.
    I started down this path by only including the varchar columns, but I wasn't able to identify the culprit. I will revisit this though and try each column. Sadly there are 181 columns in this table to eliminate. :(

--parquet-column-encoding can be used to change the encoding of a column. See odbc2parquet query --help
I will try this. Can you tell me what the default encoding used by odbc2parquet is?
Also, do you have a suggested sequence in which to try the others?
plain, bit-packed, delta-binary-packed, delta-byte-array, delta-length-byte-array or rle

  • Ask the maintainers of StarRocks/Parquet kindly to provide a better error message, detailing what exactly it is which is unsupported.
    I did this and I am waiting on an answer.

from odbc2parquet.

chrisfw avatar chrisfw commented on August 15, 2024

@pacman82 , FYI I identified the problematic column and specific rows in Oracle. The problem values were in a non-nullable varchar2 column and were populated with a single space. I tried the encoding options, but they were either not recognized or not applicable for the column data type, e.g.

`ENCODING` must be one of: `plain`, `bit-packed`, `delta-binary-packed`, `delta-byte-array`, `delta-length-byte-array` or `rle`

odbc2parquet  query --no-empty-file  --driver-does-not-support-64bit-integers --parquet-column-encoding 'UTCM_REF_ACCTt-packed' --column-compression-default snappy -c "Driver={Oracle 19 ODBC driver};Dbq=UNITST6_UNITIL;Uid=enqjsql;Pwd=enq;" /extdatemp/futcmas.parquet "select * from futcmasview"
error: invalid value 'UTCM_REF_ACCT:bit-packed' for '--parquet-column-encoding <PARQUET_COLUMN_ENCODING>': Sorry, I do not know alumn encoding called 'bit-packed'.``` 

I worked around this issue by `importing` the odbc2parquet generated parquet file into duckdb and then exporting it back out from duckdb as a parquet file.  The newly generated file loaded into StarRocks without an issue.

Regards,
Chris

from odbc2parquet.

pacman82 avatar pacman82 commented on August 15, 2024

Released odbc2parquet 2.0.2 with an updated text for --help query. The new update does not make mention of the bit-packed encoding. Since it is not supported.

I would be interessted to learn what changed importing/exporting the file in/out of DuckDB. The value of the field or its schema?

Command line tools like parquet-schema and parquet-read allow you to inspect the parquet file. Alternatively if you would produce a minimal parquet file with only the column and the field in it. One before and one after the export from duck db and share them with me, I could also take a look at them.

Did you verify that that field is a normal ascii white space? I could imagine a bunch of control characters also rendering that way?

from odbc2parquet.

chrisfw avatar chrisfw commented on August 15, 2024

@pacman82 , thank you for updating the release to exclude the invalid encoding option.

Thanks also for offering to take a look - I will create a test case for you and share the odbc2parquet and duckdb generated parquet files. I did verify through a query that the column contained ascii 32.

from odbc2parquet.

chrisfw avatar chrisfw commented on August 15, 2024

@pacman82 , I recreated the issue and solved it the same way. The error in StarRocks was different, but that may just have been due to the fact that the column was in the middle of the schema in the previous case and it is right after the primary key in this scenario with only 2 columns in total.
type:LOAD_RUN_FAIL; msg:Read Batch Error With Libarrow. filename: s3a://importdata/jobtest.parquet

Attached are the two files for your review. I hope this is helpful and thank you for your assistance. Please let me know if you need anything else.
jobtest.zip

Regards,
Chris

from odbc2parquet.

pacman82 avatar pacman82 commented on August 15, 2024

Hello @chrisfw ,

looking at both files I can see that the one created by odbc2parquet uses parquet format 2. More importantly all columns are required. The one created by duckDB uses parquet format 1 and only contains optional columns. I was suprised how many rows were in these files. Was a single record not enough to reproduce the issue?

I recreated the issue and solved it the same way.

How did you recreate and solve the issue?

Best, Markus

from odbc2parquet.

chrisfw avatar chrisfw commented on August 15, 2024

Hi Markus,

I tried a couple of times to reproduce the issue unsuccessfully using just a single record. Both an export with a single row with normal data in the JOBACCOUNT column and an export with a single row that contained only a " " in the JOBACCOUNT column imported into StarRocks without issue. As a result, I just exported the entire table to cause the StarRocks import to fail.

How did you recreate and solve the issue?

By "solve", I just meant that I used the workaround of importing the file into duckdb and then exporting from duckdb to parquet and finally importing the newly generated file into StarRocks successfully.

Regards,
Chris

from odbc2parquet.

pacman82 avatar pacman82 commented on August 15, 2024

Hello Chris,

so far StarRocks does not seem to take issue with schema, encoding, parquet version or anything. Moreover I currently like the fact that odbc2parquet uses the (recent) Parquet version 2 and correctly identifies required columns as such. At this point I fear any random change might make a file compatible or incompatible.

I inspected the files and could not find fault with them. I would really wish StarRocks would say what precisely it is they do not support, or what is wrong with the file. Really hard for me at the moment to debug the issue further.

Best, Markus

from odbc2parquet.

chrisfw avatar chrisfw commented on August 15, 2024

@pacman82 , thanks for looking into this. I totally agree based on my own and your investigation that this is a StarRocks issue. The StarRocks team indicated they are using the Apache Arrow libs to load the data, but I don't have any further information and it doesn't seem there is anything further to be done at this point. Thanks again for your help.

Regards,
Chris

from odbc2parquet.

pacman82 avatar pacman82 commented on August 15, 2024

Thanks @chrisfw , feel free to reopen this or open a new issue, if you learn something actionable.

Best, Markus

from odbc2parquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.