Not able to get the result in tabular form as it is shown in cobolCopybookExample1 under Spark-Cobol project,about absaoss/cobrix

Comments (27)

yruslan commented on May 27, 2024 3

I think here is what you want. Just replace the paths:

    import java.nio.charset.StandardCharsets
    import java.nio.file.{Files, Paths}
    import za.co.absa.cobrix.cobol.parser.CopybookParser
    import za.co.absa.cobrix.spark.cobol.utils.RowExtractors
    import za.co.absa.cobrix.cobol.parser.encoding.ASCII
    import za.co.absa.cobrix.spark.cobol.schema.SchemaRetentionPolicy

    val copybookPath = "data/edited1.cbl"
    val dataPath = "data/edited"

    val copybookContents = Files.readAllLines(Paths.get(copybookPath), StandardCharsets.ISO_8859_1).toArray.mkString("\n")

    val parsedCopybook = CopybookParser.parseTree(ASCII(), copybookContents, dropGroupFillers = false)

    val df = spark
      .read
      .format("za.co.absa.cobrix.spark.cobol.source")
      .option("copybook_contents", copybookContents)
      .option("encoding", "ascii")
      .option("schema_retention_policy", "collapse_root")
      .load(dataPath)

    val sparkSchema = df.schema

    val rddText = spark.sparkContext.textFile(dataPath)

    val rddRow = rddText.map( str => {
      RowExtractors.extractRecord(parsedCopybook.ast, str.getBytes(), 0, SchemaRetentionPolicy.CollapseRoot)
    })

    val dfOut = spark.createDataFrame(rddRow, sparkSchema)

    dfOut.printSchema()
    dfOut.show()

dfOut is your dataset.

from cobrix.

yruslan commented on May 27, 2024 1

The support for ASCII has been added. You can try the following:

Clone the branch git clone -b feature/BTR-99-add-support-for-9-4-pics https://github.com/AbsaOSS/cobrix.git
Build the snapshot version and install it into your local maven repo mvn clean install
Change Cobrix version of your project in pom.xml or your app <spark.cobol.version>0.3.1-SNAPSHOT</spark.cobol.version>
Use .option("encoding","ascii") when loading your file.

I've tried this:

    val df = spark
      .read
      .format("za.co.absa.cobrix.spark.cobol.source")
      .option("copybook", "data/edited1.cbl")
      .option("encoding", "ascii")
      .option("schema_retention_policy", "collapse_root")
      .load("data/edited")

    df.printSchema()
    df.show()

But it seems that the file you are trying to process is not a raw mainframe file. The file looks like it has alraady been processed. It looks like a text file where each line is a fixed with column format. Cobrix currently cannot process such files. The above app prints this:

+--------------------+----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------------+------------------------------+--------------------+----------------------+---------------------------+------------------------+
|     WORKPO_SORT_KEY|WORKPO_USAPOMTF_RECORD|         WS_FMT0_000|         WS_FMT1_S10|         WS_FMT2_S11|         WS_FMT3_S12|         WS_FMT6_S20|WORKPO_PRINCIPAL_VEND_NUM|WORKPO_PRINCIPAL_VEND_SUB_ACNT|WORKPO_WIMS_SUB_VEND|WORKPO_BROKER_VEND_NUM|WORKPO_BROKER_VEND_SUB_ACNT|WORKPO_SEND_BROKER_PO_SW|
+--------------------+----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------------+------------------------------+--------------------+----------------------+---------------------------+------------------------+
|[ACHFOUND, 079675...|  000M   8019721193...|[000, M, 80197211...|[000, M, 80197211...|[000, M, 80197211...|[000, M, 93, AC,,...|[000, M, 8019721,...|                         |                              |                    |                      |                           |                        |
|[EOR> ACHFOUN, D ...|  10000   S10    AC...|[00, S1, 0    ACH...|[00, S1, 0    ACH...|[00, S1, 0    ACH...|[00, S1,, , 87507...|[00, S1,, HF,, [7...|                         |                              |                    |                      |                           |                        |
|[<EOR> ACHFOU, ND...|  Y20000   S11    S...|[000, S, 11    Su...|[000, S, 11    Su...|[000, S, 11    Su...|[000, S,, GM,, ,,...|[000, S,, un,, [,...|                         |                           025|                8220|                026501|                           |                        |
|[<EOR> A, CHFOUND...|  65001Y30000   S12...|[01Y, 3000, 0   S...|[01Y, 3000, 0   S...|[01Y, 3000, 0   S...|[01Y, 3000,, , 3,...|[01Y, 3000,, , 0....|                         |                              |                    |                      |                           |                        |
|[<E, OR> ACHFOUND...|  0661365001Y60001 ...|[136, 5001, Y6000...|[136, 5001, Y6000...|[136, 5001, Y6000...|[136, 5001,, 0,, ...|[136, 5001,, ,, [...|                         |                              |                    |                      |                           |                        |
|[, <EOR> ACHF, OU...|  446320661365001Y6...|[320, 6613, 65001...|[320, 6613, 65001...|[320, 6613, 65001...|[320, 6613,, 20, ...|[320, 6613,, 00,,...|                         |                              |                    |                      |                           |                        |
...

This means that each record does not have a fixed size and the copybook does not exactly match the data.

The only way Cobrix can help you with this data is if you can get the data in a record sequence EBCDIC format instead of an ASCII text file.

from cobrix.

yruslan commented on May 27, 2024 1

Yes, if you could have the original EBCIDIC record sequence file it should work.

I checked the exceptions. By default, Cobrix reads a file as a fixed length record sequence file. That is a file that contains a sequence of records each one having the same length. So the file size should be exactly the same as the number of records multiplied by the record count. When it is not the case EOF exceptions are raised.

from cobrix.

yruslan commented on May 27, 2024 1

Sure, use RDDs text reader:
spark.sparkContext.textFile(...)
You will get each line of the text file as an element of RDD so you can parse each line/record however you need.

from cobrix.

yruslan commented on May 27, 2024 1

The ascii support has been added in 0.3.1.

from cobrix.

yruslan commented on May 27, 2024 1

@touseefzaki, looks like a lot of work has been done :)

Yes, it is possible to combine the RDD's textFile approach with the COBOL parser to achieve what you need.

First, you can use CopybookParser.parseTree(copyBookContents) to get your schema's AST. From each leaf object you can get binaryProperties attriute which will be your arguments to .substring(...).
The second approach is to do CopybookParser.parseTree(copyBookContents) and use RowExtractors.extractRecord(...) to extract a Row from each record. When you have an RDD[Row] and a schema you can convert it to a DataFrame.

The first approach might be easier for you to implement.
The second approach probably is cleaner. I'll help you with it. I'll post the solution once I'm done.

from cobrix.

yruslan commented on May 27, 2024 1

I've looked at your input file again and I think what you'd like to do is possible, although it might be a bit involved.
So the field that contains '000M' or 'S12TLR' is what we call 'segment id field'. Depending on a segment id ('000M', 'S10', 'S11', 'S12', etc) the record can have a different schema, e.g. a different copybook should be applied for each record types. Looking at your input file is seems obvious that different segments contain different fields.
Then, you can extract each segment group into a separate dataframes/tables, you can use autogenerated ids to join the tables and then output XMLs using Databrix's spark-xml library (https://github.com/databricks/spark-xml).
An example of how to extract segments into separate dataframes with autogenerated segment ids is described in README.md 'Reading hierarchical data sets' section.

from cobrix.

touseefzaki commented on May 27, 2024

In my project there was a need to convert copybook file into xml format for that i used your CobolCopybookExample1 and CobolCopybookExample2
With the copybook example and dataset i am able to run the program and it works perfectly.

But now in my realtime need i am trying to use Cobrix for the same need.

I have a flat file(Input) like below

ACHFOUND 079675882 1446320661365001Y00000 000M 8019721193 ACHFOUND 6613-144632 000875
ACHFOUND 079675882 1446320661365001Y10000 S10 ACHFOUND 875079675882 144632 11180524180525TEST1 21130 8019721193 ACHFOUND 1805241300000000000087500000000180524144632 BTALBAMA COMPANIES, LLC 9 0091372096500NATIONAL SERVICES CENTER P.O. BOX 29093 PHOENIX AZ85038 STALLA LLC A SUB. OF STALLA COMPANIES, LLC 9 0091372096613

and the schema of my cobol copybook is
root
|-- WORKPO_RECORD: struct (nullable = true)
| |-- WORKPO_SORT_KEY: struct (nullable = true)
| | |-- WORKPO_RECEIVERS_ID: string (nullable = true)
| | |-- WORKPO_RECEIVER_DUNS_PLUS_4: string (nullable = true)
| | |-- WORKPO_PO_NUM: string (nullable = true)
| | |-- WORKPO_COPY_IND: string (nullable = true)
| | |-- WORKPO_FACILITY: string (nullable = true)
| | |-- WORKPO_DIVISION: string (nullable = true)
| | |-- WORKPO_CORP: string (nullable = true)
| | |-- WORKPO_UPDATE_PO_SW: string (nullable = true)
| | |-- WORKPO_RECORD_TYPE: string (nullable = true)
| | |-- WORKPO_LINE_NUM: integer (nullable = true)
| |-- WORKPO_USAPOMTF_RECORD: string (nullable = true)
| |-- WS_FMT0_000: struct (nullable = true)
| | |-- WS_FMT0_RECDID: string (nullable = true)
| | |-- WS_FMT0_SEGEXT: string (nullable = true)
| | |-- WS_FMT0_USER_ID: string (nullable = true)
| | |-- WS_FMT0_PARTNER_ID: string (nullable = true)
| | |-- WS_FMT0_DOCREF_NUM: struct (nullable = true)
| | | |-- WS_FMT0_DOCREF_FACILITY: string (nullable = true)
| | | |-- WS_FMT0_DOCREF_PO_NUM: string (nullable = true)
| | |-- WS_FMT0_POS875: string (nullable = true)
| |-- WS_FMT1_S10: struct (nullable = true)
| | |-- WS_FMT1_RECDID: string (nullable = true)
| | |-- WS_FMT1_SEGEXT: string (nullable = true)
| | |-- WS_FMT1_EDI_VEND_ACNT_NUM: string (nullable = true)
| | |-- WS_FMT1_SET_IDENTIFIER: string (nullable = true)
| | |-- WS_FMT1_VEND_DUNS_AND_SUFX: string (nullable = true)
| | |-- WS_FMT1_PO_NUM_1: string (nullable = true)
| | |-- WS_FMT1_CPU_DATE_YYMMDD: integer (nullable = true)
| | |-- WS_FMT1_PROCESS_DATE_YYMMDD: integer (nullable = true)
| | |-- WS_FMT1_COMMUNICATIONS_ID: string (nullable = true)
| | |-- WS_FMT1_COMMUNICATIONS_PW: string (nullable = true)
| | |-- WS_FMT1_APPLICATION_SNDER_CD: string (nullable = true)
| | |-- WS_FMT1_PARTNER_ID: string (nullable = true)
| | |-- WS_FMT1_EDI_DATE_YYMMDD: integer (nullable = true)
| | |-- WS_FMT1_EDI_TIME_HHMM: integer (nullable = true)
| | |-- WS_FMT1_TRANS_SEQ_NUM: integer (nullable = true)
| | |-- WS_FMT1_EDI_CONTROL_NUM: integer (nullable = true)
| | |-- WS_FMT1_SET_ID_875: string (nullable = true)
| | |-- WS_FMT1_TRANS_SET_CNTL_NUM: integer (nullable = true)
| | |-- WS_FMT1_ORD_DATE_YYMMDD: struct (nullable = true)
| | | |-- WS_FMT1_ORD_DATE_YY: integer (nullable = true)
| | | |-- WS_FMT1_ORD_DATE_MM: integer (nullable = true)
| | | |-- WS_FMT1_ORD_DATE_DD: integer (nullable = true)
| | |-- WS_FMT1_PO_NUM_2: string (nullable = true)
| | |-- WS_FMT1_BILL_TO_ENTITY_CD: string (nullable = true)
| | |-- WS_FMT1_BT_OMCORP_CORP_NAME: string (nullable = true)
| | |-- WS_FMT1_BT_IDCD_QUALIFIER: string (nullable = true)
| | |-- WS_FMT1_BT_DUNS_PLUS_4: struct (nullable = true)
| | | |-- WS_FMT1_BT_DUNN_BRAD_ID: string (nullable = true)
| | | |-- WS_FMT1_BT_DIVISION: string (nullable = true)
| | | |-- WS_FMT1_BT_FILLER: string (nullable = true)
| | |-- WS_FMT1_BT_FAC_NAME: string (nullable = true)
| | |-- WS_FMT1_BT_STREET_ADDRESS: string (nullable = true)
| | |-- WS_FMT1_BT_CITY_NAME: string (nullable = true)
| | |-- WS_FMT1_BT_PROV_CD: string (nullable = true)
| | |-- WS_FMT1_BT_POSTAL_CD: string (nullable = true)
| | |-- WS_FMT1_SHIP_TO_ENTITY_CD: string (nullable = true)
| | |-- WS_FMT1_ST_OMCORP_CORP_NAME: string (nullable = true)
| | |-- WS_FMT1_ST_IDCD_QUALIFIER: string (nullable = true)
| | |-- WS_FMT1_ST_DUNS_PLUS_4: struct (nullable = true)
| | | |-- WS_FMT1_ST_DUNN_BRAD_ID: string (nullable = true)
| | | |-- WS_FMT1_ST_FACILITY: string (nullable = true)
| | |-- WS_FMT1_SHIP_WITH_PO: string (nullable = true)
| | |-- WS_FMT1_SHIP_DATE_YYMMDD: struct (nullable = true)
| | | |-- WS_FMT1_SHIP_DATE_YY: string (nullable = true)
| | | |-- WS_FMT1_SHIP_DATE_MM: string (nullable = true)
| | | |-- WS_FMT1_SHIP_DATE_DD: string (nullable = true)
| | |-- WS_FMT1_PO_PICKUP_NBR: string (nullable = true)
| |-- WS_FMT2_S11: struct (nullable = true)
| | |-- WS_FMT2_RECDID: string (nullable = true)
| | |-- WS_FMT2_SEGEXT: string (nullable = true)
| | |-- WS_FMT2_ST_FACILITY_NAME: string (nullable = true)
| | |-- WS_FMT2_ST_STREET_ADDRESS: string (nullable = true)
| | |-- WS_FMT2_ST_CITY_NAME: string (nullable = true)
| | |-- WS_FMT2_ST_PROV_CD: string (nullable = true)
| | |-- WS_FMT2_ST_POSTAL_CD: string (nullable = true)
| | |-- WS_FMT2_VN_ENTITY_ID_CD: string (nullable = true)
| | |-- WS_FMT2_VN_PRINCIPAL_NAME: string (nullable = true)
| | |-- WS_FMT2_VN_ID_CD_QUALIFIER: string (nullable = true)
| | |-- WS_FMT2_VN_DUNS_PLUS_FOUR: string (nullable = true)
| | |-- WS_FMT2_VN_BROKER_NAME: string (nullable = true)
| | |-- WS_FMT2_VN_STREET_ADDRESS: string (nullable = true)
| | |-- WS_FMT2_VN_PARSED_CITY: string (nullable = true)
| | |-- WS_FMT2_VN_PROV_CD: string (nullable = true)
| | |-- WS_FMT2_VN_POSTAL_CD: string (nullable = true)
| | |-- WS_FMT2_LOOP_ID: string (nullable = true)
| | |-- WS_FMT2_CONTACT_FUNC_CD: string (nullable = true)
| | |-- WS_FMT2_BUYER_NAME: string (nullable = true)
| | |-- WS_FMT2_BUYER_PHONE_NUM: struct (nullable = true)
| | | |-- WS_FMT2_PHONE_AREA_CD: string (nullable = true)
| | | |-- WS_FMT2_PHONE_EXCHANGE: string (nullable = true)
| | | |-- WS_FMT2_PHONE_ID: string (nullable = true)
| | |-- WS_FMT2_COM_NO_QUALIFIER: string (nullable = true)
| | |-- WS_FMT2_DUEDATE_QUALIFIER: string (nullable = true)
| | |-- WS_FMT2_DUE_DATE_YYMMDD: struct (nullable = true)
| | | |-- WS_FMT2_DUE_DATE_YY: string (nullable = true)
| | | |-- WS_FMT2_DUE_DATE_MM: string (nullable = true)
| | | |-- WS_FMT2_DUE_DATE_DD: string (nullable = true)
| | |-- WS_FMT2_1ST_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT2_MSG_1: string (nullable = true)
| | |-- WS_FMT2_MSG_2: string (nullable = true)
| | |-- WS_FMT2_2ND_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT2_MSG_3: string (nullable = true)
| | |-- WS_FMT2_MSG_4: string (nullable = true)
| | |-- WS_FMT2_CARRIER_NAME: string (nullable = true)
| |-- WS_FMT3_S12: struct (nullable = true)
| | |-- WS_FMT3_RECDID: string (nullable = true)
| | |-- WS_FMT3_SEGEXT: string (nullable = true)
| | |-- WS_FMT3_TOT_ORD: integer (nullable = true)
| | |-- WS_FMT3_TOT_ORD_UOM: string (nullable = true)
| | |-- WS_FMT3_TOT_WEIGHT: integer (nullable = true)
| | |-- WS_FMT3_TOT_WEIGHT_UOM: string (nullable = true)
| | |-- WS_FMT3_NUM_INCL_SEGMENTS_1: integer (nullable = true)
| | |-- WS_FMT3_TRANS_SET_CTRL_NO: integer (nullable = true)
| | |-- WS_FMT3_NUM_INCL_TRNS_SETS_1: integer (nullable = true)
| | |-- WS_FMT3_NUM_INCL_SEGMENTS_2: integer (nullable = true)
| | |-- WS_FMT3_DATA_INTRCHG_CTRL_NO: integer (nullable = true)
| | |-- WS_FMT3_TRANSMISSION_CTRL_NO: integer (nullable = true)
| | |-- WS_FMT3_NUM_INCL_FUNC_GROUPS: integer (nullable = true)
| | |-- WS_FMT3_NUM_INCL_TRNS_SETS_2: integer (nullable = true)
| | |-- WS_FMT3_NUM_INCL_SEGMENTS_3: integer (nullable = true)
| | |-- WS_FMT3_ORDER_STATUS_CD: string (nullable = true)
| | |-- WS_FMT3_MSG_5: string (nullable = true)
| | |-- WS_FMT3_3RD_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT3_MSG_6: string (nullable = true)
| | |-- WS_FMT3_SHIP_METHOD_OF_PAY: string (nullable = true)
| | |-- WS_FMT3_MODE_OF_TRANSPORT: string (nullable = true)
| | |-- WS_FMT3_UCS_PALLET_XCHNGE_CD: string (nullable = true)
| | |-- WS_FMT3_UCS_UNIT_LOAD_OPT_CD: string (nullable = true)
| | |-- WS_FMT3_SPUE_FILE_VN_CONTACT: struct (nullable = true)
| | | |-- WS_FMT3_SPUE_AREA_CD: string (nullable = true)
| | | |-- WS_FMT3_SPUE_PHONE_NUM: string (nullable = true)
| | |-- WS_FMT2_4TH_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT3_MSG_7: string (nullable = true)
| | |-- WS_FMT2_5TH_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT3_MSG_8: string (nullable = true)
| | |-- WS_FMT2_6TH_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT3_MSG_9: string (nullable = true)
| | |-- WS_FMT2_7TH_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT3_MSG_10: string (nullable = true)
| | |-- WS_FMT2_8TH_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT3_S12_LUMPER_MSG_SW: string (nullable = true)
| | |-- WS_FMT2_9TH_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT3_S12_ONAS_MSG_SW: string (nullable = true)
| | |-- WS_FMT2_10TH_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT3_MSG_11: string (nullable = true)
| | |-- WS_FMT2_11TH_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT3_MSG_12: string (nullable = true)
| | |-- WS_FMT2_12TH_NTE_REF_CD: string (nullable = true)
| | |-- WS_FMT3_MSG_13: string (nullable = true)
| |-- WS_FMT6_S20: struct (nullable = true)
| | |-- WS_FMT6_RECDID: string (nullable = true)
| | |-- WS_FMT6_SEGEXT: string (nullable = true)
| | |-- WS_FMT6_TOT_ORD: integer (nullable = true)
| | |-- WS_FMT6_VEND_PACKAGE: string (nullable = true)
| | |-- WS_FMT6_COST_VEND: decimal(9,4) (nullable = true)
| | |-- WS_FMT6_UCS_UPC_CASE_CD: struct (nullable = true)
| | | |-- WS_FMT6_UPC_COUNTRY: integer (nullable = true)
| | | |-- WS_FMT6_UPC_SYSTEM: integer (nullable = true)
| | | |-- WS_FMT6_UPC_MANUF: integer (nullable = true)
| | | |-- WS_FMT6_UPC_SALES: integer (nullable = true)
| | |-- WS_FMT6_VN_PROD_ID_QUALIFIER: string (nullable = true)
| | |-- WS_FMT6_VEND_PROD_SERV_ID: string (nullable = true)
| | |-- WS_FMT6_PURCHSR_ID_QUALIFIER: string (nullable = true)
| | |-- WS_FMT6_PURCHASER_ID_NUM: string (nullable = true)
| | |-- WS_FMT6_SAFEWAY_CORP_ITEM_CD: struct (nullable = true)
| | | |-- WS_FMT6_CORP_ITEM_CD: integer (nullable = true)
| | |-- WS_FMT6_SAFEWAY_BR_ITEM_CD: struct (nullable = true)
| | | |-- WS_FMT6_BR_ITEM_CD: string (nullable = true)
| | |-- WS_FMT6_DESC_ITEM: string (nullable = true)
| | |-- WS_FMT6_PACK: string (nullable = true)
| | |-- WS_FMT6_SIZE: string (nullable = true)
| | |-- WS_FMT6_PALLET_PATTERN: struct (nullable = true)
| | | |-- WS_FMT6_PALLET_LAYER: integer (nullable = true)
| | | |-- WS_FMT6_PALLET_HEIGHT: integer (nullable = true)
| | |-- WS_FMT6_NON_OI_ALW_MOH: string (nullable = true)
| | |-- WS_FMT6_NON_OI_ALW_CHRGE_CD: string (nullable = true)
| | |-- WS_FMT6_NON_OI_ALW_CHRGE_AMT: decimal(9,4) (nullable = true)
| | |-- WS_FMT6_OI_METHOD_OF_HNDLING: string (nullable = true)
| | |-- WS_FMT6_OI_ALLOW_OR_CHRGE_CD: string (nullable = true)
| | |-- WS_FMT6_OI_ALLOW_CHRGE_AMT: decimal(9,4) (nullable = true)
| | |-- WS_FMT6_UPC_BASIS: string (nullable = true)
| |-- WORKPO_PRINCIPAL_VEND_NUM: string (nullable = true)
| |-- WORKPO_PRINCIPAL_VEND_SUB_ACNT: string (nullable = true)
| |-- WORKPO_WIMS_SUB_VEND: string (nullable = true)
| |-- WORKPO_BROKER_VEND_NUM: string (nullable = true)
| |-- WORKPO_BROKER_VEND_SUB_ACNT: string (nullable = true)
| |-- WORKPO_SEND_BROKER_PO_SW: string (nullable = true)

Now i am using CobolCopybookExample1 as a reference i am not getting the output in correct format I mean schema is there but no data is getting loaded. Can you please help me with the correct program. Below i am also sending you my program.

package za.co.absa.cobrix.spark.cobol.examples

import org.apache.spark.sql.{SaveMode, SparkSession}
import za.co.absa.cobrix.spark.cobol.utils.SparkUtils
import com.databricks.spark.xml._

// This is an example Spark Job that uses COBOL data source.
// IMPORTANT! To run this locally change the scope of all Scala and Spark libraries from 'provided' to 'compile' in pom.xml
// But revert it to 'provided' to create an uber jar for running on a cluster

object CobolSparkExample1 {

def main(args: Array[String]): Unit = {

val sparkBuilder = SparkSession.builder().appName("Cobol source reader example 1")
val spark = sparkBuilder
.master("local[*]")
.getOrCreate()

// This is an example read from a mainframe data file.
// You can turn on/off the 'generate_record_id' and 'schema_retention_policy' options to see what difference it makes.
val df = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")

.option("copybook", "file:///Users/admin/Desktop/Kafka/cobrix-master/cobrix-master/examples/example_data/edited1.cbl")
.option("delimiter", "#")
//.option("generate_record_id", true) // Generates File_Id and Record_Id fields for line order dependent data
//.option("schema_retention_policy", "collapse_root") // Collapses the root group returning it's field on the top level of the schema
.load("file:///Users/admin/Desktop/Kafka/cobrix-master/cobrix-master/examples/example_data/Albertsan_Input") //My input flat file is here

// If you get this exception:
// Class Not found exception java.lang.ClassNotFoundException: Failed to find data source: cobol.
// please use full class name of the data source:
// .format("za.co.absa.cobrix.spark.cobol.source")

df.printSchema()
//df.printSchema
//println(df.count)

df.show(100, truncate = false)

//df.toJSON.take(100).foreach(println)

//todo convert it to XML
//df.write.mode(SaveMode.Overwrite).

df.write
.format("com.databricks.spark.xml")
.option(rowTag = "book", rootTag ="book")
.option("book", "book")
.save("file:///Users/admin/Desktop/Kafka/cobrix-master/cobrix-master/examples/example_data/Output/newbooks.xml")

//df.write.mode(SaveMode.Overwrite).parquet("file:///Users/admin/Desktop/Kafka Alberstsons/cobrix-master/cobrix-master/examples/example_data/Output")
}

}

from cobrix.

yruslan commented on May 27, 2024

Your use caae is very interesting. Looks like your input file is in ASCII already. I think you are getting no data because Cobrix tries to convert your data from EBCDIC to ASCII although it does not need to do so.

ASCII is supported in the parser. The only thing need to be done from our side is add an option to specify that the input data is in ASCII.

Please, attach your example copybook as well. We will check our assumptions. And if they are correct we will implement it rather quickly so it will be available in the next release.

from cobrix.

touseefzaki commented on May 27, 2024

CobolExampleNew.txt

In my previous comment i have shared the input file as well. This is my code file, I hope rest of the things are all correct in my program? and also please let me know what i can do to fix this for now. Please update on next release, i am in urgent need of it

from cobrix.

touseefzaki commented on May 27, 2024

20180524_190829_PO_D20180524130814_TXT.txt

Input File

from cobrix.

yruslan commented on May 27, 2024

The code looks good. If you could provide the copybook (edited1.cbl file) we could test it against your data file.
Unfortunately there is no workaround for this issue. But if it is so urgent we can provide you with a fix and a snapshot version (0.3.1-SNAPSHOT) tomorrow and it will be released until the end of this week, Usually our release cycle is 2 weeks, but 0.3.1 took longer because of the holidays.

from cobrix.

touseefzaki commented on May 27, 2024

Sample.zip

Sending the Copybook file, Looking forward for the new release. Earlier i tried to submit with the snapshot jar as well as by adding 4 packages from your post but that did not work therefore i used SCALA IDE to run these programs

from cobrix.

touseefzaki commented on May 27, 2024

Please upload the whole project like you did in https://github.com/AbsaOSS/cobrix link , Thanks so much for help, You saved lot of my time i have been struggling with trasformation from over a week now.

from cobrix.

touseefzaki commented on May 27, 2024

Hi @yruslan thanks much for help, I have tried implementing this and for some of the fields i am able to get the records but for other i am not. Looks like as you said it is a file format issue, so i guess once i have EDCBIC format file Cobrix will convert it rite?

One more thing i observed is that while running the program i am getting the output but it is also throwing some errors like I am attaching the error file here
Error File.txt

from cobrix.

touseefzaki commented on May 27, 2024

Is there a way to read ASCII file or fixed length files in Spark?

from cobrix.

touseefzaki commented on May 27, 2024

( I am using the same schema and dataset that i shared as i have only that for now. EBCDIC file is not available therfore trying few things to get it work

Approach 1

**import sqlCtx.implicits._

val DataReaderDF = spark.read
.option("delimiter", "\r\n\r\n") // i am not sure if this delimiter is ok or not to be used in my ASCII file input source
.option("header", false)
.text("file:///Users/admin/Desktop/Kafka/cobrix-master/cobrix-master/examples/example_data/InputFile/20180524_840860__PO_D20180524130814_TXT")
.toDF()

val auctions = DataReaderDF.toDF()

auctions.registerTempTable("auctions")

val dataDF = sqlCtx.sql("select * from auctions").toDF()
dataDF.show()**

I am thinking if i can read this ASCII file data correctly using the above code then i will create dataframe using Cobrix to get the schma like below

val df = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
.option("copybook", "file:///Users/admin/Desktop/Kafka/cobrix-master/cobrix-master/examples/example_data/edited1.cbl")
.load("file:///Users/admin/Desktop/Kafka/cobrix-master/cobrix-master/examples/example_data/InputFile/20180524_840860__PO_D20180524130814_TXT")
.toDF()

With above code i am able to fetch the schema

So what i am thinking if there is a way merge the DataReaderDF (data file) with df(Schema file), please suggest how i can do this.

*************************-------------------------------------

Approach 2:

val conf=new SparkConf().setAppName("Name").setMaster("local");
val sc=SparkContext.getOrCreate(conf)

val sqlCtx = new SQLContext(sc)

import sqlCtx.implicits._

sc.setLogLevel("ERROR")

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\r\n\r\n") //may be some better delimiter i need to use for my dataset plese advise

val dfnew= sc.textFile("file:///Users/admin/Desktop/Kafka/cobrix-master/cobrix-master/examples/example_data/InputFile/20180524_840860__PO_D20180524130814_TXT")

.map(l => (l.substring(25, 31).trim(), l.substring(50, l.length()-5).trim())) //, l.substring(13,18).trim(), l.substring(18,22).trim()))
//i found this in internet but not of much help in my dataser as my schema is huge and i cannot hardcode it.
.toDF()

dfnew.show()

If i read the dataset correctly and then read the schema like i did in Approach 1 in another dataset and then merge schema and dataset, is it possible then how?

from cobrix.

touseefzaki commented on May 27, 2024

@yruslan is it going to address the problem which i have with my dataset ? thanks a lot for all your help

from cobrix.

touseefzaki commented on May 27, 2024

thanks million @yruslan you are a genius. Thanks for guiding in correct direction. The above code works perfectly in my use case. I will try the other approach you advised above and will come back to you with different type of datasets 💯 percent

With this issue there are other issues which arised like accepting 'Z' , and ASCII support for Cobrix came up lets see in my future datasets if i get anything new i will update here. I would advise my developer community to use Cobrix when the datasets are complex like in this case.

Thanks again :)

from cobrix.

touseefzaki commented on May 27, 2024

Hi @yruslan I got an additional use case where i want to levarage use of cobrix in combination with spark
Like i shared the input datafile(20180524_190829_PO_D20180524130814_TXT.txt) in this file there are tags like 000M (it is considered to be start of one PO) and S12TLR (it is considered to be end of PO). I mean to say my input file has 100s of different business objects in one file and i need to generate XML file for each of these business objects from one input data file will it be possible in Spark + Cobrix?

I have achieved this with core java without using cobrix by applying hashmap mechanism, But it like there are lot of hard coding i needed to do to achieve this, which probably is not a good idea in production environment. I want to do this with Spark+Cobrix program like you shared in your earlier comment

i am attaching my code here
POTransformation.txt

from cobrix.

touseefzaki commented on May 27, 2024

I think here is what you want. Just replace the paths:

    import java.nio.charset.StandardCharsets
    import java.nio.file.{Files, Paths}
    import za.co.absa.cobrix.cobol.parser.CopybookParser
    import za.co.absa.cobrix.spark.cobol.utils.RowExtractors
    import za.co.absa.cobrix.cobol.parser.encoding.ASCII
    import za.co.absa.cobrix.spark.cobol.schema.SchemaRetentionPolicy

    val copybookPath = "data/edited1.cbl"
    val dataPath = "data/edited"

    val copybookContents = Files.readAllLines(Paths.get(copybookPath), StandardCharsets.ISO_8859_1).toArray.mkString("\n")

    val parsedCopybook = CopybookParser.parseTree(ASCII(), copybookContents, dropGroupFillers = false)

    val df = spark
      .read
      .format("za.co.absa.cobrix.spark.cobol.source")
      .option("copybook_contents", copybookContents)
      .option("encoding", "ascii")
      .option("schema_retention_policy", "collapse_root")
      .load(dataPath)

    val sparkSchema = df.schema

    val rddText = spark.sparkContext.textFile(dataPath)

    val rddRow = rddText.map( str => {
      RowExtractors.extractRecord(parsedCopybook.ast, str.getBytes(), 0, SchemaRetentionPolicy.CollapseRoot)
    })

    val dfOut = spark.createDataFrame(rddRow, sparkSchema)

    dfOut.printSchema()
    dfOut.show()

dfOut is your dataset.

**The Output what we are getting is not correct i think, Like for one PO i have segment IDs such as 000M, S10, S11,S12, S20(3 times) and the S12TLR, What i have observed here is that each segment ID is looping with other segmentIDs present under 1 parent (Like ACHFOUND) which should not happen.
For Eg:
this logic loop goes like, it should pick only its own child line items not other line items.
(1). 000M -000(Correctly mapping)
-S10(Wrong)
-S11(Wrong)
-S12(Wrong)
-S20 (Wrong)

(2). S10 - 000(Wrong)
-S10(Correctly mapping)
-S11(Wrong)
-S12(Wrong)
-S20(Wrong)

(3). S11 - 000(Wrong)
-S10(Wrong)
-S11(Correctly Mapping)
-S12(Wrong)
-S20(Wrong)
---etc

--
Also the S20 is creating data only once it should create it 3 times as per dataset(as there are 3 rows for S20 segmentid) moreover S12TLR is not considered as segment by this logic i guess so it is not writing it to the table. I am attaching the XML Ouput for 1 PO from CORE java which i have written and got the expected output, this is correct as per my data team. I am clueless how it can be achieved otherwise in SPARK+COBRIX. Can you please help me get it.
XML Output.txt

from cobrix.

yruslan commented on May 27, 2024

We had a discussion about your use case. The conclusion is that at the moment I think Cobrix cannot replace your Java program that extracts XMLs.

Here is our reasoning and our plans to handle such cases in the future.

Spark SQL API (DataFrames and Datasets) supports loading data where which row has the same schema. In multisegment mainframe file such as yours you have a root segment and several record types as child segments. So effectively each record can have different schema. This doesn't fit Spark dataframe very well.

Your input file has multiple segments with a root segment and a bunch of child segments. Cobrix can extract each segment into its own dataframe. But that is not very helpful for your use case, because you need a combined XML.

We have an idea how we can map a multisegment mainframe file into a Spark schema. That way we can extract the whole big records from several segments at once in a one Spark dataframe. Basically, child segments will be nullable arrays of structs that may or may not contain child records.
But it will take some time to implement this in Cobrix. Probably it will take several months.

Also, to extract multisegment file in one go the user will need to provide the parent-child relationships between segments and a copybook for each segment. For example, you have only one copybook, but each segment need to be parsed and extracted differently. For this to work you will need to have a copybook for each of our segments. Several segments can have the same copybook, but root and child segments need to be described by several copybooks.

from cobrix.

touseefzaki commented on May 27, 2024

We had a discussion about your use case. The conclusion is that at the moment I think Cobrix cannot replace your Java program that extracts XMLs.

Here is our reasoning and our plans to handle such cases in the future.

Spark SQL API (DataFrames and Datasets) supports loading data where which row has the same schema. In multisegment mainframe file such as yours you have a root segment and several record types as child segments. So effectively each record can have different schema. This doesn't fit Spark dataframe very well.

Your input file has multiple segments with a root segment and a bunch of child segments. Cobrix can extract each segment into its own dataframe. But that is not very helpful for your use case, because you need a combined XML.

We have an idea how we can map a multisegment mainframe file into a Spark schema. That way we can extract the whole big records from several segments at once in a one Spark dataframe. Basically, child segments will be nullable arrays of structs that may or may not contain child records.
But it will take some time to implement this in Cobrix. Probably it will take several months.

Also, to extract multisegment file in one go the user will need to provide the parent-child relationships between segments and a copybook for each segment. For example, you have only one copybook, but each segment need to be parsed and extracted differently. For this to work you will need to have a copybook for each of our segments. Several segments can have the same copybook, but root and child segments need to be described by several copybooks.

Hi @yruslan, Unfortunately in my use case i have only once schema and that is always going to be the case, the existing system is in IIB and we are planning to move it into Kafka + SparkStreams.
I am also trying to do it with Kafka + KafkaStreams. So far i could achieve it only in Core Java.

Regarding this
Spark SQL API (DataFrames and Datasets) supports loading data where which row has the same schema. In multisegment mainframe file such as yours you have a root segment and several record types as child segments. So effectively each record can have different schema. This doesn't fit Spark dataframe very well. --- I would like to say that if there is no record for any schema then the XML tag remains empty, we have this kind of cases.

Regarding this "Your input file has multiple segments with a root segment and a bunch of child segments. Cobrix can extract each segment into its own dataframe. But that is not very helpful for your use case, because you need a combined XML." --- I do not need to read the whole file at once in XML, I need to read segment wise for each of the PO like ACHFOUND(it has 8 segment ID (00M, S10, S11, S12,S20,S20, S20 and STLR) for these 8 segments i need to create 1 XML similarily for other POs **like(ACOSTA,BARILLAUSA etc ** i need to create separate XMLs

Let me give one example may be it will give clarity on how to read the data
In the Cobol sample input file, first 47 positions are header and last 23 are trailer and in between record details are available. The challenge is to read the data.

The expected output is like
Attached in the post
ExpectedOutput.txt

from cobrix.

yruslan commented on May 27, 2024

It seems that your case is simpler than I thought. You can use the last program I sent you and add segment-id field mappings using Spark. After that you can export it as XML and you might get the correct results.

    import spark.implicits._
    import org.apache.spark.sql.functions._
    val dfOut2 = dfOut.select($"WORKPO_SORT_KEY",
      when($"WS_FMT0_000.WS_FMT0_RECDID" === "000", $"WS_FMT0_000"),
      when($"WS_FMT0_000.WS_FMT0_RECDID" === "S10", $"WS_FMT1_S10"),
      when($"WS_FMT0_000.WS_FMT0_RECDID" === "S11", $"WS_FMT2_S11"),
      when($"WS_FMT0_000.WS_FMT0_RECDID" === "S12", $"WS_FMT3_S12"),
      when($"WS_FMT0_000.WS_FMT0_RECDID" === "S20", $"WS_FMT6_S20")
    )
    dfOut2.show()

from cobrix.

touseefzaki commented on May 27, 2024

    import spark.implicits._
    import org.apache.spark.sql.functions._
    val dfOut2 = dfOut.select($"WORKPO_SORT_KEY",
      when($"WS_FMT0_000.WS_FMT0_RECDID" === "000", $"WS_FMT0_000"),
      when($"WS_FMT0_000.WS_FMT0_RECDID" === "S10", $"WS_FMT1_S10"),
      when($"WS_FMT0_000.WS_FMT0_RECDID" === "S11", $"WS_FMT2_S11"),
      when($"WS_FMT0_000.WS_FMT0_RECDID" === "S12", $"WS_FMT3_S12"),
      when($"WS_FMT0_000.WS_FMT0_RECDID" === "S20", $"WS_FMT6_S20")
    )
    dfOut2.show()

It gives output but not as expected, i guess to parse S20 or any other segments like S12 there should be some logic in the cobrix parser as otherwise if you take a look at the input file at row number 5,6,and 7 are 3 different items for S20 but in output only 1st record (5th Row ) items are picked up as Output which is wrong. Also the S12TLR is not in the output.

from cobrix.

yruslan commented on May 27, 2024

Yes, I guess so. If you have a suggestion of a parser feature that would help your use case, let me know. Ideas are welcome.

from cobrix.

touseefzaki commented on May 27, 2024

Yes, I guess so. If you have a suggestion of a parser feature that would help your use case, let me know. Ideas are welcome.

I am trying to achieve this with the help of RegEx but the result is not 100% correct. Meanwhile i came to a new requirement where i I have 2 dataframes I have to pull the value from one df and put that into another df. Any idea how this can be done?

//XML Data Reader
"val dataDF = (new XmlReader()).withRowTag(booksFileTag1).xmlFile(sqlContext, supplyData).toDF()"

//this is reading each item from the df
val valuesdf = dataDF.collect().foreach {row => row.mkString(",") }

//XML Schema Reader
val schemaDf = (new XmlReader()).withRowTag(booksFileTag).xmlFile(sqlContext, suppySchema).toDF()

For example: I need to create 1 final df after the transformation and merging of 2 dfs.
In my case out of many fields one field is like
schemaDf .CorporationId [In schemaDf ] == valuesdf .ROW.CLTCORP [In valuesdf ]

Now i need to pull values of each cell items from valuesdf and push it to the schemaDf is it possible? if so then may be you can give an example

from cobrix.

Not able to get the result in tabular form as it is shown in cobolCopybookExample1 under Spark-Cobol project about cobrix HOT 27 CLOSED

Comments (27)

(3). S11 - 000(Wrong)
-S10(Wrong)
-S11(Correctly Mapping)
-S12(Wrong)
-S20(Wrong)
---etc

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (27)

(3). S11 - 000(Wrong) -S10(Wrong) -S11(Correctly Mapping) -S12(Wrong) -S20(Wrong) ---etc

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

(3). S11 - 000(Wrong)
-S10(Wrong)
-S11(Correctly Mapping)
-S12(Wrong)
-S20(Wrong)
---etc