Comments (42)
The first 5 characters of each line are considered comments and are ignored. At line 16 you have 3 <tab>
characters instead of 5 spaces. That's the reason for the error.
from cobrix.
thank you. Can you please answer my other questions also. I am working on this and stuck. I just started working on these files. If you help me the tips how the copybook files need to cleaned it helps me a lot.
Thanks a lot
Appreciate your cooperation
from cobrix.
When I parsed another copybook file schema is parsed but whiie parsing the data file I got the following error'. Please let me know how I can resolve this issue in data file
Exception in thread "main" java.lang.IllegalArgumentException: There are some files in /user/abc_binary that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (3835 bytes per record). Check the logs for the names of the files.
at za.co.absa.cobrix.spark.cobol.source.scanners.CobolScanners$.buildScanForFixedLength(CobolScanners.scala:87)
at za.co.absa.cobrix.spark.cobol.source.CobolRelation.buildScan(CobolRelation.scala:85)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:348)
from cobrix.
This error happens when you try to load a fixed record length file where the record length does not divide the file size. In your case, the file size should be evenly divisible by 3853.
This can happen
- when a copybook does not completely match the data file
- when the file is a multisegment variable record length file
In case the file is a multisegment variable record length one, you need to add .option("is_record_sequence", "true")
. In this case, the parser will expect 4-byte RDW headers for each record. The fields for that header should not be present in the copybook itself.
from cobrix.
Hi,
Thanks a lot for your suggestion.
Please let me know how can I check the log files.
I have added the .option("is_record_sequence", "true") and with this jar I tried execute different file. I got the below error
ERROR FileUtils$: File hdfs://xyz/abc IS NOT divisible by 17163.
Exception in thread "main" java.lang.IllegalArgumentException: There are some files in /user/vabc/binaryfile that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (17163 bytes per record). Check the logs for the names of the files.
at za.co.absa.cobrix.spark.cobol.source.scanners.CobolScanners$.buildScanForFixedLength(CobolScanners.scala:87)
at za.co.absa.cobrix.spark.cobol.source.CobolRelation.buildScan(CobolRelation.scala:85)
from cobrix.
Could you please send the snippet of code you use for reading the file, e.g. the line that starts with spark.read(...)
?
from cobrix.
Hi below is the .scala class code I am using for parsing the mainframe copybook and data file. Please suggest me what changes I need to make in code or in the copybook or binary file to parse this correctly.
Thanks a lot for checking my issues and helping me to parse the mainframe file.
package com.cobrix
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import za.co.absa.cobrix.spark.cobol.source
import za.co.absa.cobrix.spark.cobol._
import za.co.absa.cobrix.cobol.parser.CopybookParser
import za.co.absa.cobrix.spark.cobol.schema.{CobolSchema, SchemaRetentionPolicy}
import za.co.absa.cobrix.spark.cobol.utils.SparkUtils
object cobrixtest extends Serializable {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setAppName("cobrixtest")
val v_copybook = args(1)
val v_data = args(0)
println(v_copybook)
val spark: SparkSession = SparkSession.builder.config(sparkConf).enableHiveSupport().getOrCreate()
import spark.implicits._
val cobolDataframe = spark
.read
.format("cobol")
.option("generate_record_id", false) // this adds the file id and record id
.option("is_record_sequence", "true") // reader to use 4 byte record headers to extract records from a mainframe file
.option("schema_retention_policy","collapse_root") //removes the root record header
.option("copybook", v_copybook)
.load(v_data)
cobolDataframe.printSchema()
cobolDataframe.show(300,false)
}
}
from cobrix.
Interesting. This error should not happen on variable record length files. Which version of Cobrix are you using?
from cobrix.
I am using 0.4.2 cobrix version libraries. Also I was using the scala version 2.11.8. spark 2.1.1. Also this what I have. I can use lower version of scala.
Please let me know what version of cobrix need to be used. Also what version of scala and spark need to be used.
from cobrix.
From my perspective everything looks good: the program, the version of Cobrix, Spark and Scala. The strange thing is that the error message you are having occurs only when reading fixed record length files. But .option("is_record_sequence", "true")
should turn on variable record length reader that don't throw that particular error.
Is it possible to get an example data file and a copybook that causes that to reproduce the error at our side?
from cobrix.
from cobrix.
I have placed the jar file on the worker node and the copybook and binary files are in hdfs. Is this correct. Please confirm.
I am getting below errors. for that I have added the line
000550 05 FILLER PIC X(04). AMTR010
in the copy book. but the same error is coming.
java.lang.IllegalStateException: RDW headers should never be zero (0,100,0,0). Found zero size record at 0.
at za.co.absa.cobrix.cobol.parser.decoders.BinaryUtils$.extractRdwRecordSize(BinaryUtils.scala:305)
at za.co.absa.cobrix.spark.cobol.reader.index.IndexGenerator$.getNextRecordSize(IndexGenerator.scala:136)
at za.co.absa.cobrix.spark.cobol.reader.index.IndexGenerator$.sparseIndexGenerator(IndexGenerator.scala:58)
from cobrix.
Can you please let me know how to fix the above issue. I am stuck here.
from cobrix.
For the 4 byte RDW header there should not be an entry in the copybook. So, please remove the FILLER.
But from what I can see from the values of the RDW header(0,100, 0, 0), is that it is possible that your RDW headers are big endian. To load files that have big endian RDW use this option: .option("is_rdw_big_endian", "true")
from cobrix.
Thank you so much. I tried what you recommended..
I have removed the filler from the copybook.
I have added .option("is_rdw_big_endian", "true") and ran it.
Same error appears again.
are there any options left for me to try for parsing my data file.
If I am able to parse these files that helps me a lot.
from cobrix.
I would like to help, but unfortunately mainframe data files are very versatile. In order to parse a mainframe file we need to understand how records are placed in the data, which headers does the data file have, be sure that the copybook properly matches the data file, etc.
We have tried different combinations of options and I'm out of suggestions that can be just tried and checked. If you have a small example of a similar file and the corresponding copybook, we cold look at it and try to figure out what is needed to parse it properly.
from cobrix.
Thank you so much for your time. According to my company policies I cannot share the data. I am not a mainframe guy so I cannot generate myself sample data.
from cobrix.
how can i run the unit tests za.co.absa.cobrix.spark.cobol with all the unit tests with data files in the data folder. Can they run local or do we need to move all files to the hdfs and then run them. In hdfs how do we run the tests. Please help to how to check the log files also how to run the unit tests.
from cobrix.
All unit tests can be ran using mvn test
or mvn clean test
at project's root directory. It will run everything in local mode, no need to copy files to HDFS.
from cobrix.
Hi, I am getting below error when I am packaging at the maven life cycle. Please let me know How to resolve this
01:26:47.163 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
01:26:47.523 ERROR org.apache.hadoop.util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)
at org.apache.hadoop.util.Shell.(Shell.java:386)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
at org.apache.hadoop.security.Groups.(Groups.java:93)
at org.apache.hadoop.security.Groups.(Groups.java:73)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2427)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2427)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
at org.apache.spark.SparkContext.(SparkContext.scala:295)
from cobrix.
This is a known issue of running Spark on Windows.
https://stackoverflow.com/a/39525952/1038282
from cobrix.
Thank you for your reply.
from cobrix.
I sent files to your email. Can you do the favor of checking them and let me know the issue for fix. Or what should I do to make the files parsing.
from cobrix.
Received the files, will take a look. It will take some time. Likely will get back to you for more questions. So far the copybook seems quite complex and the record structure of the file is not obvious. No guarantees I'd be able to figure it out.
What could also be very helpful if you can get the first couple of records of this file in a parsed format, like csv. It will be easier for me to figure out where one record ends and the other record begins.
from cobrix.
Hi
I have a quick question. Is cobrix supports nested occurs. If yes how many levels it supports.
from cobrix.
Yes, this should be supported with arbitrary number of levels. We didn't specifically tested this scenario, but the code is generic enough to cover this case.
from cobrix.
For fixed width file when I am trying to parse I am getting the below error. Can you please help me how to fix this.
/* 495328 /
/ 495329 / mutableRow.update(0, value);
/ 495330 / }
/ 495331 /
/ 495332 / return mutableRow;
/ 495333 / }
/ 495334 */ }
org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection has grown past JVM limit of 0xFFFF
at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
at org.codehaus.janino.util.ClassFile.addConstantFieldrefInfo(ClassFile.java:342)
at org.codehaus.janino.UnitCompiler.writeConstantFieldrefInfo(UnitCompiler.java:11109)
from cobrix.
It looks like you are reaching a limit of JVM on Spark's code generation stage. Try creating a Spark session with codegen turned off. Something like this:
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("Example")
.config("spark.sql.codegen.wholeStage", value = false)
.getOrCreate()
from cobrix.
Thanks a lot. Your suggestion worked. Fixed width files are parsing. But I have another quick question. For fixed width file If I have filler at the end in the copybook file it is unable to parse the data file and data file has blanks at the end. Is there any fix for this. Please suggest me how to overcome this.
from cobrix.
It is great to hear that files are parsing now, at least partially. But I'm sorry, not completely following what is the issue. Could you please describe this using a simplified example?
from cobrix.
Below is the example for fixed width file having fillers at the end.
our copybook ends like
15 CUSTOM-STATUS PIC X(01).
15 FILLER PIC X(15).
15 FILLER PIC X(773).
and data file had blanks at the end.
I am unable to parse this file as a fixed width file. blanks at the end is not handled getting the errolr NOT DIVISIBLE .....
If the file does not have FILLER at the end in the copybook I am able to parse the file.
from cobrix.
Do I understand it right that the file has 773 bytes at the end that should be ignored?
There is a feature planned to be introduced - file headers and footers. Using this new feature you can specify how many bytes to ignore at the beginning and at the end of a file. Will le t you know when thie feature is available.
from cobrix.
Hi Ruslan,
I am trying to parse copybook which has 4000 columns and their binary file.
I am again getting the thousands of lines of code on the screen and JVM error
after adding .config("spark.sql.codegen.wholeStage", value = false).
My spark version is 2.1.1
For me the solution is only to work with RDDs. Is there any other solution.
I read online that spark 2.3 has the fix for this.
Please let me know what are my options.
Also I have another question. Is the cobrix works with spark 2.3 and spark 2.4 versions.
from cobrix.
Yes, newer versions of Spark handle wide column dataframes (with thouthands of columns) much better.
Yes, you can use Spark 2.3 and Spark 2.4, as long as you use the version with Scala 2.11 (not Scala 2.12).
from cobrix.
Thank you. But is there any way to work with spark 2.1 to fix this issue.
from cobrix.
Not sure. It depends on your exact use case. Handling wide dataframes definitely got better in 2.3. But the codegen error you got seems odd, So really hard to tell.
from cobrix.
The issue with 773 spaces is related to #87
from cobrix.
We renamed FILLER to something else it worked. Thanks for your time.
from cobrix.
Also my big lines of code scrolling in output for huge number of columns in copybook is solved when I used spark 2.4. But I was looking for something in 2.1.1
from cobrix.
I have field in copybook with packed decimal fields having length as S9(X)V9(8). The values after parsing for these kinds of fields are coming as 0E-8 instead of 0. Is there any way we can fix this. Please advise.
from cobrix.
Short answer: 0
and 0E-8
are the same values. They are just displayed differently on a screen depending on a tool you use.
The picture of S9(10)V9(8).
converts to a Spark decimal(10,8)
value. It is a fixed point decimal type. I presume that for Spark methods, like df.show()
, the scientific format is chosen so it would be clear for a viewer that the column has a decimal type.
What is your output format (Parquet, JSON, CSV, etc)? Does the scientific notation present in files themselves?
from cobrix.
Thank you. I will check the files.
from cobrix.
Related Issues (20)
- Not able to run simple cobol app with java HOT 1
- Process ASCII file with fixed length format HOT 5
- US ASCII file with newline character present within data HOT 1
- ebcdic_code_page for German character ä,ß,ü HOT 9
- Can I get the raw record bytes from ebcdic file w/out parsing HOT 4
- BBBB in copybook HOT 3
- Is it possible to read a nested Binary Field? HOT 1
- Record length option is ignored when generate record id is turued on
- Add CI/CD for automatic releases
- Reading EBCDIC file with multiple structure HOT 1
- DataBricks Unity Catalog and Cobrix HOT 7
- Reading Variable Length File with OCCCURS DEPENDING HOT 12
- NoClassDefFoundError: Could not initialize class za.co.absa.cobrix.cobol.parser.decoders.FloatingPointDecoders$ HOT 3
- Not able to parse the content correctly when copybook has OCCURS X TIMES DEPENDING ON FIELD_NAME HOT 3
- Support for decimal scaling PV HOT 6
- Can't read multiple main headers defined in single copybook HOT 4
- Add support for parsing copybooks given Spark options
- Missing SIgn for few fileds that are negative HOT 5
- How to read a pipe separated file with Cobrix HOT 3
- PIC S9(10)V USAGE COMP-3 is converted to long instead of Decimal(10,0) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cobrix.