Comments (9)
Currently this is not supported, and I not sure at this point how to add this support in a general way.
But there is a possible workaround. You can make the text field binary by specifying 'USAGE COMP' to that field. It will be converted to binary
Spark format. And then, you can use Spark, with a UDF to convert the binary field to a string based on the country code.
from cobrix.
Do you have an example code for this? Thanks
from cobrix.
Hi, your workaround does not work, I got the following error:
: scala.MatchError: AlphaNumeric(X(4026),4026,None,Some(EBCDIC),Some(X(4026))) (of class za.co.absa.cobrix.cobol.parser.ast.datatype.AlphaNumeric)
This is the copybook:
20 I69-105-CLTX-TEXT PIC X(4026) USAGE COMP.
from cobrix.
Hi,
you can find an example here:
Please, use the latest version of Cobrix (2.6.8) as this feature was introduced only recently.
from cobrix.
Hi,
I'm working with OP (@gergelyts) on this, we've managed to get it to binary, our problem is now that we'd need it in a (PySpark) String form.
Could you give an example of how to do that (which is not a forced cast)?
Also, we think it would be easier to somehow set the encoding (we think of UTF-8 to support the languages for all the countries).
Thanks for the help in advance!
from cobrix.
Hi,
Now the binary field needs to be decoded in to the Unicode text (probably UTF-8 encoded). But I realized that decoders we have in Cobrix are not available from Python, only from Scala.
I'm going to think about a a solution. But at first glance, it might need direct support from Cobrix.
Are all strings in each record encoded the same, or only particular fields?
Can you give an example of some country code to code page mapping?
Which is the list of code pages that can be encountered in your files? (this is to check if Cobrix supports these code pages)
from cobrix.
No, out of 10 columns 7 are done by default encoding, and the remaining 3 are decided by country code (1 and 2 bytes codepage).
In the source application the country code mappings look like this: (ebcidic_codepage_mapping (1).txt
)
For the third question see also the prior reply.
(These prior requests are related to this topic:)
#574
#539
If you need more information regarding the source application, feel free to contact @BenceBenedek
from cobrix.
Yes, I see. Thank you for the context!
So ideally, you would like a mapping like this:
[
{
"code_field" : "country_code",
"target_fields": [ "field1", "field2", "field3" ]
"code_mapping": {
"kor": "cp300",
"chn": "cp1388"
}
}
]
so that the encoding of field1
, field2
and field3
is determined by the column country_code
. When country_code=jpn
use cp300
, right?
If multiple country code fields are defined for each record, it can be split like this:
[
{
"code_field" : "country_code1",
"target_fields": [ "field1", "field2" ]
"code_mapping": {
"jpn": "cp300",
"chn": "cp1388"
}
},
{
"code_field" : "country_code2",
"target_fields": [ "field3" ]
"code_mapping": {
"japan": "cp300",
"china": "cp1388"
}
},
]
So ideally, if you want to be able to pass such a mapping to Cobrix, and it should figure things out, right?
from cobrix.
Now that I am thinking about it, a workaround is possible even now, but not too effective.
val df1 = spark.read.format("cobol")
.option("ebcdic_code_page", "cp037")
.option("field_code_page:cp300" -> "field1")
.load("/path/to/files")
.filter(col("country_code") === "jpn")
val df2 = spark.read.format("cobol")
.option("ebcdic_code_page", "cp037")
.option("field_code_page: cp1388" -> "field1")
.load("/path/to/files")
.filter(col("country_code") === "chn")
val df = df1.union(df2)
from cobrix.
Related Issues (20)
- Generating the 5 dependency jars to run cobrix HOT 2
- copybook meta data for RDBMS HOT 5
- ADLS support HOT 1
- Mainframe Condensed data HOT 1
- Is it possible to flatten a nested schema so all values are the root? HOT 5
- COMP-3 field is being read with a value 3 less than expected value HOT 3
- Df to sas7bdat file writer HOT 3
- Installing Cobrix Libraries HOT 1
- record_format VB file fails with length of BDW block is too big HOT 7
- File start/end offset issue #601 HOT 4
- Make project Spark 3.5 compatible.
- File start/end offset issue for VB file HOT 5
- RDW headers should never be zero (0,0,0,0). Found zero size record at 4078719. HOT 1
- Stream processing with Flink HOT 1
- Not able to run simple cobol app with java HOT 1
- Process ASCII file with fixed length format HOT 5
- US ASCII file with newline character present within data HOT 1
- ebcdic_code_page for German character Γ€,Γ,ΓΌ HOT 9
- Can I get the raw record bytes from ebcdic file w/out parsing HOT 4
- BBBB in copybook HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cobrix.