Comments (12)
Thanks yruslan. Will test and get back to you. In the meantime, I manually added the RDW which is 4 bytes to the original Mainframe File (i.e. 2 Times RDW). So my original VB File was 100 bytes (4 Bytes of RDW and 96 bytes of Data). I recreated it as 104 bytes (4 bytes RDW, 4bytes RDW and 96 bytes of Data) . Now when I SFTP I am getting 100 bytes of data alongwith RDW with the original RDW being dropped. My RDW has Hex values "01AB0000" which translates to 427, Hex "00B00000" which translates to 176. I am able to read the file without segment option with the below code
val cobolDataframe = spark
.read
.format("cobol")
.option("copybook", "data/test1_copybook.cob")
.option("record_format", "V")
.option("is_rdw_big_endian", "true")
.option("rdw_adjustment", -4)
.option("variable_size_occurs", "true")
.load("data/test2_data")
I am jot sire though why record_format V instead of VB and rdw_adjustment -4 worked
from cobrix.
Hi @pinakigit
.option("record_format", "V")
is only for files that have RDW headers for each record
.option("record_format", "VB")
is only for files that have BDW for record blocks, and RDW for each record.
Do I understand it correctly that records have variable size, but there is no numerical records that specify record size?
If PEN-REG-CODE='N' the size of the record is one, but when PEN-REG-CODE='Y' the size is different?
from cobrix.
Hi @yruslan
Yes. If PEN-REG-CODE='N' the size of the record is one which is kind of fixed record length in the VB File, but when PEN-REG-CODE='Y' the size is vareiable depending on th4e occurs clause?
The file is single file which has 2 different type of records. One with PEN-REG-CODE='N' and pother with PEN-REG-CODE='Y' which has OCCURS Depending clause.
The file in Mianframe is VB but I am not sure whether the BDW and RDW are retained after FTP from Mainframe . Also I am not sure how to check the RDW and BDW values in Mainframe as well as in the ADLS.
I am able to read the records without .option("record_format", "V") or .option("record_format", "VB") with .option("PKLR1-PAR-PEN-REG-CODE", "Y") or .option("PKLR1-PAR-PEN-REG-CODE", "N") but it shows the record only till the record of the other type appears.
If I am putting .option("record_format", "V") it doesn't return any record. If I am putting .option("record_format", "VB") it throws below error.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 255.0 failed 4 times, most recent failure: Lost task 0.3 in stage 255.0 (TID 2441) (10.189.204.152 executor 16): java.lang.IllegalStateException: BDW headers contain non-zero values where zeros are expected (check 'rdw_big_endian' flag. Header: 64,193,194,64, offset: 0.
from cobrix.
Files with RDW headers has record length as a binary field in first 4 bytes of each record. Also, either first 2 bytes are zeros or last 2 bytes are zeros. You can check if you file has RDW by looking at first 4 bytes.
Here are more details on record formats: https://www.ibm.com/docs/en/zos/2.3.0?topic=files-selecting-record-formats-non-vsam-data-sets
In the meantime, I'm going to implement an option that allows mapping between a field and record size.
from cobrix.
Hi yruslan,
How the option that allows mapping between a field and record size will help us? I hope it's related to Query1. Did you have a chance to look at Query 2 i.e. flattening the occurs clause values i.e. if the OCCURS 1 TO 5 TIMES
DEPENDING ON PKLR1-OUT-NO-OF-LOC. If PKLR1-OUT-NO-OF-LOC has value 2 then have value for 2 occurences and have the other 3 occurence as NULL.
from cobrix.
Hi @pinakigit ,
For the Query 2, please see #668
It is a similar issue and was fixed using a Cobrix option. Let me know if it works for you as well.
Query 3: No matter which filesystem the file comes from if you are using 'V' or 'VB' you need to ensure the headers are in place. Otherwise these record formats can't be used, and you need to use some other format and parsing options instead.
from cobrix.
Thanks yruslan. I have rephrased my question in #668. We have already implemented the solution provided there. But we need some additional capability. Let me know if the comments there are clear for understanding else will create some data and provide wit examples. Basically we want to fit in that record into a RDBMS kind of layout without splitting one record into multiple records.
And I hope the option that allows mapping between a field and record size will help us resolving the main issue reported here. In the meantime I will do more research on RDW. I have to somehow read the binary file in Unix to see if it has the RDW or not which is present in MF
from cobrix.
You can flatten arrays in the output dataframe using one of these options:
- Cobrix flatten utility: method
SparkUtils.flattenSchema(df)
(Scala only) - Spark's array explode function (https://spark.apache.org/docs/latest/api/sql/index.html#explode). This has a different semantics though
from cobrix.
Thanks yruslan. It worked like a charm. Coming to the original question, our FTP from Mainframes is dropping the RDW and BDW and probably thatβs the reason I am not able to use the VB option. Is there a way to FTP from Mainframe to retain the RDW? I tried LOCSITE RDW but its not working.
from cobrix.
I'm glad variable OCCURS worked for you. Regarding retention of RDWs, it all depends on tools used to load files from mainframes. I can't advice you any particular tool, unfortunately.
But the record length field to size papping that is being developed should help you even if you don't have RDW headers in mainframe files.
from cobrix.
The mapping between record length field values and record sized is now merged to the master branch: #674
Please, let know if it works.
Example:
val df = spark.read
.format("cobol")
.option("copybook_contents", copybook)
.option("record_format", "F")
.option("record_length_field", "SEG-ID")
.option("record_length_map", """{"A":4,"B":7,"C":8}""") // <---- this
.load(tempFile)
from cobrix.
The record length field value to record size mapping is available in spark-cobol:2.7.0
from cobrix.
Related Issues (20)
- Not able to parse the content correctly when copybook has OCCURS X TIMES DEPENDING ON FIELD_NAME HOT 3
- Support for decimal scaling PV HOT 6
- Can't read multiple main headers defined in single copybook HOT 4
- Add support for parsing copybooks given Spark options
- Missing SIgn for few fileds that are negative HOT 5
- How to read a pipe separated file with Cobrix HOT 3
- PIC S9(10)V USAGE COMP-3 is converted to long instead of Decimal(10,0) HOT 7
- comp-3 values parsing issues HOT 2
- Shade ANTLR runtime in the parser to avoid ANTLR potential incompatibility issues
- Under some circumstances Cobrix selects wrong record reader failing the Spark job
- Add a feature to collapse structs or the output data
- Add support for `_` for key generation
- DataFrame with some columns in EBCDIC HOT 1
- How to read a EBCDIC file with multiple columns HOT 30
- Metadata copying method does not retain existing metadata HOT 3
- EBCDIC to ASCII file conversion HOT 2
- Add support for COMP-3 numbers without the sign nibble HOT 20
- java.lang.AssertionError: assertion failed: Byte array does not have correct length HOT 14
- Add maximum length metadata for 'seg_id0', ... fields
- Add EBCDIC writer HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cobrix.