Comments (5)
Hi @D3v3sh5ingh, what's your high level offset layout?
For example:
0 - 19 Headers (to be ignored)
20 - 23 BDW
24 - 27 RDW
28 - 99 Payload
100 - 193 RDW
...
32000 Payload
32093 Footer (to be ignored)
from cobrix.
Hi @yruslan
My high level layout looks like below:
BDW { RDW 45 bytes , RDW 1000 bytes, RDW 1000 bytes , RDW 1000 bytes ....}
BDW { RDW 1000 bytes .....}
......
BDW { RDW 1000 bytes...., RDW 45 bytes}
45 bytes of header and trailer are inside the BDW as shown above.
We want to remove these 45 bytes of header and trailer present in the file.
from cobrix.
file_start_offset
and file_end_offset
work on the level of file, e.g. cases like:
HEDAER {45 bytes} BDW { RDW 1000 bytes, RDW 1000 bytes, RDW 1000 bytes , RDW 1000 bytes ....}
Since your 45 headers are part of record payload you can't do it using these options. What you can do is you can add the header as a redefine segment in your copybook, and then you can filter it out after you get the dataframe.
The copybook will looks like this:
01 RECORD.
05 HEDAER.
10 CONTENT X(45).
05 PAYLOAD REDEFINES HEADER.
... your payload goes at level 10 here
from cobrix.
Hi ,
This is a sample output for my file . 45 bytes that i want to skip are at the start and at the end only . Not in each record.
If I don't use the file _start_offset and file_end_offset , i am able to get above dataframe as output but I am getting two extra records(Header and Trailer).
But if I use these options with 45 bytes , i face an error ( length of BDW block is too big ) .
from cobrix.
Options 'file_start_offset' and 'file_end_offset' only drop bytes from the beginning or at the end of files, not from the payload. This is the expected behavior.
There are no options that allow dropping bytes from inside records, so possible solutions are:
- If you need to keep these special 45-byte records, you can use the modified copybook solution above.
- (probably your case) If you want to ignore these special 45-byte records, just remove these records in post-processing, e.g.
df.filter(col("COL1").isNotNull)
from cobrix.
Related Issues (20)
- Not able to parse the content correctly when copybook has OCCURS X TIMES DEPENDING ON FIELD_NAME HOT 3
- Support for decimal scaling PV HOT 6
- Can't read multiple main headers defined in single copybook HOT 4
- Add support for parsing copybooks given Spark options
- Missing SIgn for few fileds that are negative HOT 5
- How to read a pipe separated file with Cobrix HOT 3
- PIC S9(10)V USAGE COMP-3 is converted to long instead of Decimal(10,0) HOT 7
- comp-3 values parsing issues HOT 2
- Shade ANTLR runtime in the parser to avoid ANTLR potential incompatibility issues
- Under some circumstances Cobrix selects wrong record reader failing the Spark job
- Add a feature to collapse structs or the output data
- Add support for `_` for key generation
- DataFrame with some columns in EBCDIC HOT 1
- How to read a EBCDIC file with multiple columns HOT 30
- Metadata copying method does not retain existing metadata HOT 3
- EBCDIC to ASCII file conversion HOT 2
- Add support for COMP-3 numbers without the sign nibble HOT 20
- java.lang.AssertionError: assertion failed: Byte array does not have correct length HOT 14
- Add maximum length metadata for 'seg_id0', ... fields
- Add EBCDIC writer HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cobrix.