Comments (5)
Hi,
For the example above this might work:
spark.read.format("cobol")
.option("copybook_contents", copybookContentsStr)
.option("encoding", "ascii")
.option("record_format", "F") // Fixed length
.option("record_length", "426") // Record length encoded in the copybook
.option("file_start_offset", "100") // skip file header
.option("file_end_offset", "100") // skip file footer
.load(pathToData)
But you also need the copybook for your record payload that might look like:
01 RECORD.
10 FIELD1 PIC X(10).
10 FIELD2 PIC X(15).
...
Remember that first characters in each line of the copybook should be spaces.
from cobrix.
I just got another requirements added to the above, if this is considered a new question, I can create a new question. There are scenarios where record length can vary and it is presented by the first 4 bytes... Example is below, (there is no LF or CR character to separate the line)
Header -- size 100 bytes.
0426xxxxmyrecorddataupto426bytes
0456xxxxxmyrecorddataupto456bytes
0435xxxxxmyrecorddataupto435bytes
Trailer - size 100 bytes (optional)
Does Cobrix support this kind of variable length format ? What options should I use for that.
from cobrix.
With the new requirement the record format is now 'V', which is 'variable length. You can specify the field in the copybook that contains the length:
spark.read.format("cobol")
.option("copybook_contents", copybookContentsStr)
.option("encoding", "ascii")
.option("record_format", "V") // Variable length records
.option("record_length_field", "RECORD_LENGTH_FIELD")
.option("file_start_offset", "100") // skip file header
.option("file_end_offset", "100") // skip file footer
.load(pathToData)
The copybook should define the first 4 bytes as a numeric field:
01 RECORD.
10 RECORD_LENGTH_FIELD PIC 9(4).
10 FIELD1 PIC X(10).
10 FIELD2 PIC X(15).
...
The record length by default is the full record payload. If the value in the field does not match the record length exactly, you can use an arithmetic expression, for instance:
.option("record_length_field", "RECORD_LENGTH_FIELD + 4")
from cobrix.
Ok. Thanks. In the above copybook example you gave, the lenght of field is defined by "RECORD_LENGTH_FIELD PIC 9(4)" The FIELD1 and FIELD2 length will depend upon the value in RECORD_LENGTH_FIELD. For every record it can be different based on the value in RECORD_LENGTH_FIELD. In that case, PIC X(10) and PIC X(15) may not be true all the time.. My structure will be like this,
``
01 RECORD.
10 RECORD_LENGTH_FIELD PIC 9(4).
10 BASE_SEGMENT PIC X(???). ** The size of this will come from RECORD_LENGTH_FIELD above.
How should I define the contents for copybook for that usecase. I have used any arbitary number in the above number and I was able to process the file successfully using Cobrix datasource. The sample is https://github.com/jaysara/spark-cobol-jay/blob/main/src/main/java/com/test/cobol/FixedWidthApp.java
from cobrix.
Since each record type probably has a different schema, your data can be considered multisegment. In this case you can define a redefined group for each segment. So the copybook will look like this:
01 RECORD.
10 RECORD_LENGTH_FIELD PIC 9(4).
15 SEGMENT1.
20 SEG1_FIELD1 PIC X(15).
20 SEG1_FIELD2 PIC X(10).
15 SEGMENT2 REDEFINES SEGMENT1.
20 SEG2_FIELD1 PIC X(5).
20 SEG2_FIELD2 PIC X(11).
(note that SEGMENT2 redefines SEGMENT1)
You can also apply automatic segment filtering based on record length, like this: https://github.com/AbsaOSS/cobrix?tab=readme-ov-file#automatic-segment-redefines-filtering
You can use the record length field as the segment id.
from cobrix.
Related Issues (20)
- Can I get the raw record bytes from ebcdic file w/out parsing HOT 4
- BBBB in copybook HOT 3
- Is it possible to read a nested Binary Field? HOT 1
- Record length option is ignored when generate record id is turued on
- Add CI/CD for automatic releases
- Reading EBCDIC file with multiple structure HOT 1
- DataBricks Unity Catalog and Cobrix HOT 11
- Reading Variable Length File with OCCCURS DEPENDING HOT 12
- NoClassDefFoundError: Could not initialize class za.co.absa.cobrix.cobol.parser.decoders.FloatingPointDecoders$ HOT 3
- Not able to parse the content correctly when copybook has OCCURS X TIMES DEPENDING ON FIELD_NAME HOT 3
- Support for decimal scaling PV HOT 6
- Can't read multiple main headers defined in single copybook HOT 4
- Add support for parsing copybooks given Spark options
- Missing SIgn for few fileds that are negative HOT 5
- How to read a pipe separated file with Cobrix HOT 3
- PIC S9(10)V USAGE COMP-3 is converted to long instead of Decimal(10,0) HOT 7
- comp-3 values parsing issues HOT 2
- Shade ANTLR runtime in the parser to avoid ANTLR potential incompatibility issues
- Under some circumstances Cobrix selects wrong record reader failing the Spark job
- Add a feature to collapse structs or the output data
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cobrix.