Giter Club home page Giter Club logo

Comments (9)

yruslan avatar yruslan commented on June 17, 2024

Currently this is not supported, and I not sure at this point how to add this support in a general way.

But there is a possible workaround. You can make the text field binary by specifying 'USAGE COMP' to that field. It will be converted to binary Spark format. And then, you can use Spark, with a UDF to convert the binary field to a string based on the country code.

from cobrix.

gergelyts avatar gergelyts commented on June 17, 2024

Do you have an example code for this? Thanks

from cobrix.

gergelyts avatar gergelyts commented on June 17, 2024

Hi, your workaround does not work, I got the following error:

: scala.MatchError: AlphaNumeric(X(4026),4026,None,Some(EBCDIC),Some(X(4026))) (of class za.co.absa.cobrix.cobol.parser.ast.datatype.AlphaNumeric)

This is the copybook:

           20  I69-105-CLTX-TEXT             PIC X(4026) USAGE COMP.

from cobrix.

yruslan avatar yruslan commented on June 17, 2024

Hi,

you can find an example here:

val data = Array(0x12.toByte, 0x34.toByte, 0x56.toByte, 0x78.toByte)

Please, use the latest version of Cobrix (2.6.8) as this feature was introduced only recently.

from cobrix.

Beno922 avatar Beno922 commented on June 17, 2024

Hi,

I'm working with OP (@gergelyts) on this, we've managed to get it to binary, our problem is now that we'd need it in a (PySpark) String form.

Could you give an example of how to do that (which is not a forced cast)?

Also, we think it would be easier to somehow set the encoding (we think of UTF-8 to support the languages for all the countries).

Thanks for the help in advance!

from cobrix.

yruslan avatar yruslan commented on June 17, 2024

Hi,

Now the binary field needs to be decoded in to the Unicode text (probably UTF-8 encoded). But I realized that decoders we have in Cobrix are not available from Python, only from Scala.

I'm going to think about a a solution. But at first glance, it might need direct support from Cobrix.

Are all strings in each record encoded the same, or only particular fields?
Can you give an example of some country code to code page mapping?
Which is the list of code pages that can be encountered in your files? (this is to check if Cobrix supports these code pages)

from cobrix.

Beno922 avatar Beno922 commented on June 17, 2024

No, out of 10 columns 7 are done by default encoding, and the remaining 3 are decided by country code (1 and 2 bytes codepage).

In the source application the country code mappings look like this: (ebcidic_codepage_mapping (1).txt
)

For the third question see also the prior reply.

(These prior requests are related to this topic:)
#574
#539

If you need more information regarding the source application, feel free to contact @BenceBenedek

from cobrix.

yruslan avatar yruslan commented on June 17, 2024

Yes, I see. Thank you for the context!

So ideally, you would like a mapping like this:

[
  { 
    "code_field" : "country_code", 
    "target_fields": [ "field1", "field2", "field3" ] 
    "code_mapping": {
      "kor": "cp300",
      "chn": "cp1388"
    }
  }
]

so that the encoding of field1, field2 and field3 is determined by the column country_code. When country_code=jpn use cp300 , right?

If multiple country code fields are defined for each record, it can be split like this:

[
  { 
    "code_field" : "country_code1", 
    "target_fields": [ "field1", "field2" ] 
    "code_mapping": {
      "jpn": "cp300",
      "chn": "cp1388"
    }
  },
  { 
    "code_field" : "country_code2", 
    "target_fields": [ "field3" ] 
    "code_mapping": {
      "japan": "cp300",
      "china": "cp1388"
    }
  },
]

So ideally, if you want to be able to pass such a mapping to Cobrix, and it should figure things out, right?

from cobrix.

yruslan avatar yruslan commented on June 17, 2024

Now that I am thinking about it, a workaround is possible even now, but not too effective.

   val df1 = spark.read.format("cobol")
     .option("ebcdic_code_page", "cp037")
     .option("field_code_page:cp300" -> "field1")
     .load("/path/to/files")
     .filter(col("country_code") === "jpn")

   val df2 = spark.read.format("cobol")
     .option("ebcdic_code_page", "cp037")
     .option("field_code_page: cp1388" -> "field1")
     .load("/path/to/files")
     .filter(col("country_code") === "chn")

  val df = df1.union(df2)

from cobrix.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.