crealytics / spark-excel Goto Github PK

View Code? Open in Web Editor NEW

441.0 42.0 142.0 1.33 MB

A Spark plugin for reading and writing Excel files

License: Apache License 2.0

Scala 100.00%

spark data-frame excel scala etl

spark-excel's Introduction

Spark Excel Library

A library for querying Excel files with Apache Spark, for Spark SQL and DataFrames.

Co-maintainers wanted

Due to personal and professional constraints, the development of this library has been rather slow. If you find value in this library, please consider stepping up as a co-maintainer by leaving a comment here. Help is very welcome e.g. in the following areas:

Additional features
Code improvements and reviews
Bug analysis and fixing
Documentation improvements
Build / test infrastructure

Requirements

This library requires Spark 2.0+.

List of spark versions, those are automatically tested:

spark: ["2.4.1", "2.4.7", "2.4.8", "3.0.1", "3.0.3", "3.1.1", "3.1.2", "3.2.4", "3.3.2", "3.4.1"]

For more detail, please refer to project CI: ci.yml

Linking

You can link against this library in your program at the following coordinates:

Scala 2.12

groupId: com.crealytics
artifactId: spark-excel_2.12
version: <spark-version>_0.18.0

Scala 2.11

groupId: com.crealytics
artifactId: spark-excel_2.11
version: <spark-version>_0.13.7

Using with Spark shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

Spark compiled with Scala 2.12

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:<spark-version>_0.18.0

Spark compiled with Scala 2.11

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:<spark-version>_0.13.7

Features

This package allows querying Excel spreadsheets as Spark DataFrames.
From spark-excel 0.14.0 (August 24, 2021), there are two implementation of spark-excel
- Original Spark-Excel with Spark data source API 1.0
- Spark-Excel V2 with data source API V2.0+, which supports loading from multiple files, corrupted record handling and some improvement on handling data types. See below for further details

To use V2 implementation, just change your .format from .format("com.crealytics.spark.excel") to .format("excel"). See below for some details

See the changelog for latest features, fixes etc.

Scala API

Spark 2.0+:

Create a DataFrame from an Excel file

import org.apache.spark.sql._

val spark: SparkSession = ???
val df = spark.read
    .format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation
    .option("dataAddress", "'My Sheet'!B3:C35") // Optional, default: "A1"
    .option("header", "true") // Required
    .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
    .option("setErrorCellsToFallbackValues", "true") // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
    .option("usePlainNumberFormat", "false") // Optional, default: false, If true, format the cells without rounding and scientific notations
    .option("inferSchema", "false") // Optional, default: false
    .option("addColorColumns", "true") // Optional, default: false
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
    .option("maxByteArraySize", 2147483647) // Optional, default None. See https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int-
    .option("tempFileThreshold", 10000000) // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead
    .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    .option("workbookPassword", "pass") // Optional, default None. Requires unlimited strength JCE for older JVMs
    .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
    .load("Worktime.xlsx")

For convenience, there is an implicit that wraps the DataFrameReader returned by spark.read and provides a .excel method which accepts all possible options and provides default values:

import org.apache.spark.sql._
import com.crealytics.spark.excel._

val spark: SparkSession = ???
val df = spark.read.excel(
    header = true,  // Required
    dataAddress = "'My Sheet'!B3:C35", // Optional, default: "A1"
    treatEmptyValuesAsNulls = false,  // Optional, default: true
    setErrorCellsToFallbackValues = false, // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
    usePlainNumberFormat = false,  // Optional, default: false. If true, format the cells without rounding and scientific notations
    inferSchema = false,  // Optional, default: false
    addColorColumns = true,  // Optional, default: false
    timestampFormat = "MM-dd-yyyy HH:mm:ss",  // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    maxRowsInMemory = 20,  // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
    maxByteArraySize = 2147483647,  // Optional, default None. See https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int-
    tempFileThreshold = 10000000, // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead
    excerptSize = 10,  // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    workbookPassword = "pass"  // Optional, default None. Requires unlimited strength JCE for older JVMs
).schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
 .load("Worktime.xlsx")

If the sheet name is unavailable, it is possible to pass in an index:

val df = spark.read.excel(
  header = true,
  dataAddress = "0!B3:C35"
).load("Worktime.xlsx")

or to read in the names dynamically:

import com.crealytics.spark.excel.WorkbookReader
val sheetNames = WorkbookReader( Map("path" -> "Worktime.xlsx")
                               , spark.sparkContext.hadoopConfiguration
                               ).sheetNames
val df = spark.read.excel(
  header = true,
  dataAddress = sheetNames(0)
)

Create a DataFrame from an Excel file using custom schema

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val peopleSchema = StructType(Array(
    StructField("Name", StringType, nullable = false),
    StructField("Age", DoubleType, nullable = false),
    StructField("Occupation", StringType, nullable = false),
    StructField("Date of birth", StringType, nullable = false)))

val spark: SparkSession = ???
val df = spark.read
    .format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation
    .option("dataAddress", "'Info'!A1")
    .option("header", "true")
    .schema(peopleSchema)
    .load("People.xlsx")

Write a DataFrame to an Excel file

import org.apache.spark.sql._

val df: DataFrame = ???
df.write
  .format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation
  .option("dataAddress", "'My Sheet'!B3:C35")
  .option("header", "true")
  .option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
  .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
  .mode("append") // Optional, default: overwrite.
  .save("Worktime2.xlsx")

Data Addresses

As you can see in the examples above, the location of data to read or write can be specified with the dataAddress option. Currently the following address styles are supported:

B3: Start cell of the data. Reading will return all rows below and all columns to the right. Writing will start here and use as many columns and rows as required.
B3:F35: Cell range of data. Reading will return only rows and columns in the specified range. Writing will start in the first cell (B3 in this example) and use only the specified columns and rows. If there are more rows or columns in the DataFrame to write, they will be truncated. Make sure this is what you want.
'My Sheet'!B3:F35: Same as above, but with a specific sheet.
MyTable[#All]: Table of data. Reading will return all rows and columns in this table. Writing will only write within the current range of the table. No growing of the table will be performed. PRs to change this are welcome.

Excel API based on DataSourceV2

The V2 API offers you several improvements when it comes to file and folder handling. and works in a very similar way than data sources like csv and parquet.

To use V2 implementation, just change your .format from .format("com.crealytics.spark.excel") to .format("excel")

The big difference is the fact that you provide a path to read / write data from/to and not an individual single file only:

dataFrame.write
        .format("excel")
        .save("some/path")

spark.read
        .format("excel")
        // ... insert excel read specific options you need
        .load("some/path")

Because folders are supported you can read/write from/to a "partitioned" folder structure, just the same way as csv or parquet. Note that writing partitioned structures is only available for spark >=3.0.1

dataFrame.write
        .partitionBy("col1")
        .format("excel")
        .save("some/path")

Need some more examples? Check out the test cases or have a look at our wiki

Building From Source

This library is built with Mill. To build a JAR file simply run e.g. mill spark-excel[2.13.10,3.3.1].assembly from the project root, where 2.13.10 is the Scala version and 3.3.1 the Spark version. To list all available combinations of Scala and Spark, run mill resolve spark-excel[__].

Star History

spark-excel's People

Contributors

Stargazers

Watchers

Forkers

mkorneev sfines frank-dkvan shoffing virtualramblas neilyooloil prayagkhanal tellius gitter-badger koiralo envoysir raktotpal punit-naik vijay9792 sereneant ergunbaris esi-mineset skyler3d dmytrodragan habelson litzebauer rmalleman anubcamca yzc717 oliviasuw solvedgespark guptam da-liii anjireddya flyingwing anandsingh1311 wangzhiqun emesday alekssh yusha2016 arpangrwl sumit88 falloutall aminetri ggchangan surykumar ererermao skadyan vpatrikis a2393439531 nggirish jitkasempin dsantial koenvantomme ebigram rambabupulumati sakthi5006 liyanchen1995 klunio enverosmanov hwki77 scala-steward karraar rmahidhara-rv simondale yuzhangyi purushothjp paeven qfjp gshumkov shyam1693 reddy279 lldreamvip ylevin rikima zhuohuwu0603 pjfanning cmftall syndic8-stephen calvinvette finch0001 fokko koenr3 naveenag86 s-codespaces wuhuaqiang mrgambal sureshin123 yanggit2019 waiyan1612 lhcj2012 alexzhang267 zml1206 tufanrakshit harivignesh3419 derianpt anthonyawuley tatkas chandu-atina ideaguy3d quanghgx rubik-ai toheight nulledexceptions kuldeep-mishra

spark-excel's Issues

option for inserting userSchema is not possible

HI I am tryin to read a date field from excel, but your API doesn't provide support for same. So i try to provide user provided schema as shown in the example in readme, but still not able to pass as schema is an object not a string.
val df = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", "Worktime.xlsx")
.option("sheetName", "Daily")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
# .option("userSchema", StructType(???)) // Optional// Not possible to pass schema as string
.option("inferSchema", "true")
.option("addColorColumns", "true")
.option("startColumn", 0) // Optional
.option("endColumn", 99) // Optional
.load()

getErrorCellValue ignore NotSupportedException option

It would be awesome if there was an option for errors in excelsheets not to throw exceptions (NotSupportedException), but could be handled as null values with some log message that certain cells have been skipped.

infer schema only reading first ten rows?

With version 0.9.6, when I do not provide a schema and set inferSchema to true, the inferred schema seems to be based on the first ten rows only.
This means that if a column contains only numbers in the first ten rows, but contains text (that cannot be parsed as a number) in later rows, then there will be a crash when trying to parse the text as a number.
If there is non-numeric text in the first ten rows of data then the column is correctly inferred as a string type and there is no problem.
I am pretty sure that in previous versions, the schema was inferred from all rows and this problem never occurred.
I have maxRowsInMemory set to 20
The problem can be seen with the online retail dataset

0.8.3 Version - Parameter "location" is missing in options

Hi,
I am using Spark-Excel 0.8.3 version with Spark 1.6 and Scala 2.10 and getting this exception

Exception in thread "main" java.lang.IllegalArgumentException: Parameter "location" is missing in options.
at com.crealytics.spark.excel.DefaultSource.checkParameter(DefaultSource.scala:37)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:17)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at sparkExcelClass$delayedInit$body.apply(sparkExcelClass.scala:28)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at sparkExcelClass$.main(sparkExcelClass.scala:4)
at sparkExcelClass.main(sparkExcelClass.scala)

Thanks,
Srikanth

Custom schema example

Could you please give me an example of custom schema format to be used with this library?

FileNotFoundException when trying to use com.crealytics.spark.excel

My file is in s3 and I am trying to run this code from databricks. My s3 location is accessible by other libraries. Is it a bug?

val df = sqlContext.read
.format("com.crealytics.spark.excel")
//.option("sheetName", "2018.06.30 GLSFT1 Data Tape") // Required
.option("location","s3location")
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("inferSchema", "true") // Optional, default: false
.option("addColorColumns", "true") // Optional, default: false
.option("startColumn", 0) // Optional, default: 0
.option("endColumn", 99) // Optional, default: Int.MaxValue
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
.option("skipFirstRows", 1) // Optional, default None. If set skips the first n rows and checks for headers in row n+1
//.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.load("GLS_Flow_201806.xlsx")

Would need multiple sheets to generate for multiple dataset.

getting a List of all sheets in the excel file

It will be very useful for having a function that will give me a list of all excel sheets names from the file, so I will later decide which of them to load as dataframes

Issues with date column in excel with format dd/mm/yyyy

I'm trying to create a dataframe from a large excel and I have a column that it has a format like this.
mm/dd/yyyy
I've tryed to create a custom schema for this column using StringType, and DateType, but none of these have worked.
If I don't set any schema I receive an error like this; "Exception in thread "main" java.lang.IllegalStateException: Cannot get a STRING value from a NUMERIC cell"

I'm loading the dataframe like this:
val df = dataFrameReader.format("com.crealytics.spark.excel")
.option("sheetName", "Hoja1")
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "true") // Optional, default: true
.option("inferSchema", "true") // Optional, default: false
//.option("addColorColumns", "true") // Optional, default: false
.option("startColumn", 1) // Optional, default: 0
//.option("endColumn", 99) // Optional, default: Int.MaxValue
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 1000) // Optional, default None. If set, uses a streaming reader which can help with big files
//.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
//.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.schema(schema)
.load( s"hdfs://${fileName}")

My custom schema:
val schema = StructType(
List(
StructField("NroTramite", LongType, true),
StructField("FchAlta", DateType, true),
StructField("HraAlta", StringType, true),
StructField("UsrAlta", StringType, true),
StructField("Estado", StringType, true),
StructField("Proceso", StringType, true),
StructField("Cliente", LongType, true),
StructField("Razon Social", StringType, true),
StructField("RelCnt", LongType, true),
StructField("Mid", LongType, true),
StructField("Contrato", LongType, true),
StructField("PlanComercial", StringType, true),
StructField("Categ", StringType, true),
StructField("SubCateg", StringType, true),
StructField("Tecnologia", StringType, true),
StructField("AgPromotorPlnPrincipal", StringType, true),
StructField("AgPromotorProducto", StringType, true),
StructField("ProdExt", IntegerType, true),
StructField("NombreProdExterno", StringType, true),
StructField("AgVenta", IntegerType, true),
StructField("NombreAgVenta", StringType, true),
StructField("OrigSisExterno", IntegerType, true),
StructField("FlagAcraTercero", StringType, true),
StructField("RelCntquePaga", IntegerType, true)
)
)

The error that I get when I use this schema:
Exception in thread "main" java.lang.NumberFormatException: For input string: "Fch Alta"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:284)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:29)
at com.crealytics.spark.excel.ExcelRelation.numericValue$lzycompute$1(ExcelRelation.scala:111)
at com.crealytics.spark.excel.ExcelRelation.numericValue$1(ExcelRelation.scala:108)

I look forward for your answers, thanks in advance.

Read multiple sheets through comma separated values via option sheetName

Can we read multiple sheets through comma separated values via option sheetName with this connector?
like option("sheetName", "SHEET_NAME1,SHEET_NAME2")

aws s3 file system not supported

I was trying to create dataframe from excel file kept on aws s3 bucket , but the api could not process the s3a url . Somehow the s3a URL is getting down to java.io.FileInputStream.open(), which only works with local filesystem files, not HDFS, S3. I checked there is a slash (/) missing from the path.

val data = sqlContext.read.
format("com.crealytics.spark.excel").
option("location", s3path).
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema","true").
option("addColorColumns", "true").
load()

ERROR:
Name: java.io.FileNotFoundException
Message: s3a:/AKIAJYDDDDDDNA:A6DDDDDDDDDwqxkRqUQyXqqNOUsONDy@my-test/test.xlsx (No such file or directory)
StackTrace: at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:212)
at java.io.FileInputStream.(FileInputStream.java:152)

Getting Java heap space error while executing it via spark-submit/spark-shell/zeppelin notebook

Hi,
I am using this connector to load my xlsx files and I am getting Java Heap space error everytime i execute my spark job.
I am running my job on 5 node emr spark cluster.
The data size is something around 130 MB.
The same file is getting processed when I convert to tab separated file.
Kindly help me out.

Thanks and regards,
Dipesh Vora

Not able to read excel from hadoop location

Hi,

When I am trying to read this Excel from Hadoop using this API, it's giving file not found. As an API to be compatible with SPARK this should work flawlessly with Hadoop files. Please look into the issue.
Code Snippet
DataFrame df = sqlContext.read().format("com.crealytics.spark.excel").option("location", "hdfs:////localhost:54310/home/impadmin/abc.xlsx").
option("sheetName", "Sheet1")
.option("useHeader", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true")
.option("addColorColumns", "true").load();

Exception:
java.io.FileNotFoundException: hdfs:/localhost:54310/home/impadmin/StreamAnalytix/uploadjar/AMEX.csv (No such file or dir
ectory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:28)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)

input_file_name returns empty string

Hello,

I'm trying to create a dataframe from an Excel doc and then append a column with the input file's name as returned by input_file_name. Instead of returning the file path, input_file_name is returning an empty string.

Test code

from pyspark.sql.functions import input_file_name

df = (sql_context.read.format("com.crealytics.spark.excel")
      .option('sheetName', "Sheet1")
      .option('useHeader', True)
      .load("/home/btelle/test.xlsx")
      )

df = df.withColumn('file_name', input_file_name())
df.select('file_name').show(1)

Expected result

+--------------------+
|           file_name|
+--------------------+
|file:///home/btel...|
+--------------------+

Actual result

+---------+
|file_name|
+---------+
|         |
+---------+

Out of memory exception when reading big xlsx file

I am reading big xlsx file of 100mb with 28 sheets(10000 rows per sheet) and creating a single dataframe out of it . I am facing out of memory exception when running on cluster mode .My code looks like this.

def buildDataframe(spark: SparkSession, filePath: String, requiresHeader: Boolean): DataFrame = {

var excelDF: DataFrame = null
var flag: Boolean = false

// getting inputstream from s3 for all sheet names
val s3FilePath = filePath.replaceAll(filePath.substring(0, filePath.indexOf("//") + 2), "")
val s3Bucket = s3FilePath.substring(0, s3FilePath.indexOf("/"))
val s3key = s3FilePath.substring(s3FilePath.indexOf("/") + 1, s3FilePath.length())
val s3Client: AmazonS3 = new AmazonS3Client();
val s3object: S3Object = s3Client.getObject(new GetObjectRequest(
s3Bucket, s3key));
val inputStream: java.io.InputStream = s3object.getObjectContent()
val workbook = new XSSFWorkbook(inputStream)
var sno = workbook.getNumberOfSheets
var sheetName: Array[String] = new ArrayString
for ((sheet, i) <- workbook.zipWithIndex) {
val initialDF = spark.read.format("com.crealytics.spark.excel")
.option("useHeader", requiresHeader)
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", true)
.option("addColorColumns", false)
.option("sheetName", sheet.getSheetName)
.option("path", filePath)
.load()

  if (flag == false) {
    excelDF = initialDF
    flag = true
  } else {
    excelDF = excelDF.union(initialDF)
  }

}

return excelDF

}

Any help is much appreciated.

unable to download the jar com.crealytics:spark-excel_2.10:jar:0.8.2 even after adding to the porm.xml

jar is not able to be downloaded automatically

Failed to find data source: com.crealytics.spark.excel

com.crealytics spark-excel_2.10 0.8.2

could find below note from the spark packages page

u'Maven coordinate should be groupId:artifactId but got com.crealytics:spark-google-adwords_2.11:0.8.2.'

Is it possible to write multiple dataframes to different sheets in the same Excel worksheet

I have a case where I have multiple DataFrames and each DataFrame maps to a sheet in the Excel document. Is it possible to do this? It seems like this needs to support the append write mode for this to work.

ExcelFileSaver.toCell() doesn't handle null values in DataFrame

When a DataFrame row has a null value in one of the columns, a scala.MatchError exception gets thrown by toCell. I am writing with the following write calls:

    dataFrame.write
      .format("com.crealytics.spark.excel")
      .option("sheetName", config.sheetName)
      .option("treatEmptyValuesAsNulls", true)
      .option("useHeader", true)
      .mode("overwrite")
      .save(config.path)

This is reproducible by loading in an Excel or CSV file with blanks in any of the columns and attempting to save it. I added the treatEmptyValuesAsNulls just to make sure I wasn't missing an option but looking at the code it seems that a case needs to be added to handle if a == null. It is also possible I'm doing something wrong here, and if so, I apologize!

Rows are returned in incorrect order on cluster

When reading an excel file on a cluster, the rows in the RDD are out of order

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: com.crealytics.spark.excel.

Overnight I can't seem to find the package anymore? Is it being updated?

Compatibility Issue

Do we have any Excel or spreadsheet proccessing dependency jar for Spark version 1.6.

I am facing error
"java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;
at com.crealytics.spark.excel.ExcelRelation$$anonfun$4.apply(ExcelRelation.scala:113)"

while processing it with spark 1.6.

ExcelRelation read file in driver

Hi,
Basically buildScan method will be run on driver, not executor.
It will process file and then send it's content through parallelize(), which is also bad practice for large stuff.

Options:

use sc.hadoopFile(...) with custom ExcelInputFormat.
problems can be with providing parameters cause it use reflection to create instance of InputFormat
use sc.parallelize(..).flatMap( // read file, get List[Row])
It will read file on executor. Unfortunately, one executor.

java.lang.RuntimeException: Unknown cell type 2

Hi we have one excel file where it has Hyperlink cells (type2 cells) it is giving us error.can you please help me to walk through this.

Reading excel file with multi line header throw an exception.

When I tried to read a file with multilines header, the QueryExecution of the dataset throw an exception Method threw 'scala.MatchError' exception. Cannot evaluate org.apache.spark.sql.execution.QueryExecution.toString()

Steps to reproduce;

SparkSession session = SparkSession.builder().getOrCreate();
String path = "testMultiLineHeader.xlsx";
Dataset<Row> dsBal = session.read().format("com.crealytics.spark.excel")
        .option("location", path)
        .option("sheetName", "Feuil1")
        .option("useHeader", "true")
        .option("treatEmptyValuesAsNulls", "true")
        .option("inferSchema", "true")
        .option("addColorColumns", "false")
        .load(path);
dsBal.show();

Example file that cause this error : file that cause this error : https://files.fm/u/jjjvkvxe#/view/testMultiLineHeader.xlsx

Full stack:

scala.MatchError: Nom de l'entité <-- Name of the header first line.
Name of the entity <-- Name of the header second line.
(of class java.lang.String)
    at com.crealytics.spark.excel.ExcelRelation$$anonfun$2.apply(ExcelRelation.scala:122)
    at com.crealytics.spark.excel.ExcelRelation$$anonfun$2.apply(ExcelRelation.scala:120)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
    at com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:120)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:393)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:296)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
    at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
    at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
    at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
    at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
    at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
    at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3248)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
    at Main.main(Main.java:33)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.microsoft.azure.hdinsight.spark.mock.SparkLocalRunner.runJobMain(SparkLocalRunner.java:75)
    at com.microsoft.azure.hdinsight.spark.mock.SparkLocalRunner.main(SparkLocalRunner.java:48)

How to infer schema for Int column types

I am trying to dig into your codes but I am relatively new to scala, so I don't get why this can't infer schema for Integer column. Is there any specific reason you excluded this type?

Reject bad data rows

I have a Scala app utilizing spark-excel 0.9.17 and am interested in rejecting rows that contain bad data as defined by the schema. E.g. if a row has a column with a string in an Integer column, reject the row. It seems, at least in the case of Integer types, non-numeric values are converted to zero, and the row is loaded into the data frame. Is there anything similar to mode=DROPMALFORMED used for importing csv files that can be used in this case? Such that in the example below, the second row would be rejected instead of converting the "x" to zero? Thanks.

INPUT:

Name | Age | Comment
Joe | 25 | A
John | x | B
Bill | 35 | C

OUTPUT:

dfValid.printSchema()
root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Comment: string (nullable = true)

dfValid.show()
+----+---+-------+
|Name|Age|Comment|
+----+---+-------+
| Joe| 25|      A|
|John|  0|      B|
|Bill| 35|      C|
+----+---+-------+```

I am unable to fix the issue with the code created by Martin Mauch nightscape.. I am also trying to create a scala code for landing excel data from sharepoint to hadoop and then convert it into parquet format.I am getting error saying value InferSchema not found

import java.math.BigDecimal
4 import java.sql.{Date, Timestamp}
5 import java.text.NumberFormat
6 import java.util.Locale
7 import java.io._
8 import scala.collection.JavaConverters._
9 import org.apache.poi.ss.usermodel.{ WorkbookFactory, Row => SheetRow, Cell, DataFormatter }
10 import org.apache.spark.rdd.RDD
11 import org.apache.spark.sql._
12 import org.apache.spark.sql.sources._
13 import org.apache.spark.sql.types._
14
15 import scala.util.Try
16
17 case class ExcelRelation(
18 location: String,
19 sheetName: String,
20 useHeader: Boolean,
21 treatEmptyValuesAsNulls: Boolean,
22 inferSheetSchema: Boolean,
23 addColorColumns: Boolean = true,
24 userSchema: StructType = null
25 )
26 (@transient val sqlContext: SQLContext)
27 extends BaseRelation with TableScan with PrunedScan {
28 val workbook = WorkbookFactory.create(new FileInputStream(location))
29 val sheet = workbook.getSheet(sheetName)
30 val headers = sheet.getRow(0).cellIterator().asScala.to[Vector]
31 override val schema: StructType = inferSchema
32 val dataFormatter = new DataFormatter();
33 override def buildScan: RDD[Row] = buildScan(schema.map(_.name).toArray)
34
35 override def buildScan(requiredColumns: Array[String]): RDD[Row] = {
36 val lookups = requiredColumns.map {c =>
37 val columnNameRegex = s"(.*?)(color)?".r
38 val columnNameRegex(columnName, isColor) = c
39 val columnIndex = schema.indexWhere(.name == columnName)
40
41 val cellExtractor: Cell => Any = if (isColor == null) {
42 { cell: Cell =>
43 val value = cell.getCellType match {
44 case Cell.CELL_TYPE_NUMERIC => cell.getNumericCellValue
45 case Cell.CELL_TYPE_BOOLEAN => cell.getBooleanCellValue
46 case Cell.CELL_TYPE_STRING => cell.getStringCellValue
47 case t => throw new RuntimeException(s"Unknown cell type $t for $cell")
48 }
49 castTo(value.toString, schema(columnIndex).dataType)
50 }
51 } else {
52 _.getCellStyle.getFillForegroundColorColor match {
53 case null => ""
54 case c: org.apache.poi.xssf.usermodel.XSSFColor => c.getARGBHex
55 case c => throw new RuntimeException(s"Unknown color type $c: ${c.getClass}")
56 }
57 }
58 { row: SheetRow =>
59 val cell = row.getCell(columnIndex)
60 if (cell == null) {
61 null
62 } else {
63 cellExtractor(cell)
64 }
65 }
66 }.to[Vector]
67 val rows = sheet.rowIterator().asScala.drop(1).map(row => lookups.map(l => l(row)))
68 val result = rows.to[Vector]
69
70 sqlContext.sparkContext.parallelize(result.map(Row.fromSeq))
71 }
72
73 // Cast a String to a Spark Data Type
74 private def castTo(datum: String, castType: DataType): Any = {
75 castType match {
76 case _: ByteType => datum.toByte
77 case _: ShortType => datum.toShort
78 case _: IntegerType => datum.toInt
79 case _: LongType => datum.toLong
80 case _: FloatType => Try(datum.toFloat)
81 .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).floatValue())
82 case _: DoubleType => Try(datum.toDouble)
83 .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).doubleValue())
84 case _: BooleanType => datum.toBoolean
85 case _: DecimalType => new BigDecimal(datum.replaceAll(",", ""))
86 case _: TimestampType => Timestamp.valueOf(datum)
87 case : DateType => Date.valueOf(datum)
88 case : StringType => datum
89 case _ => throw new RuntimeException(s"Unsupported type: ${castType.typeName}")
90 }
91 }
92
93 private def rowsRdd: RDD[SheetRow] = {
94 parallelize(sheet.rowIterator().asScala.toSeq)
95 }
96 private def parallelize[T: scala.reflect.ClassTag](seq: Seq[T]): RDD[T] = sqlContext.sparkContext.parallelize(seq)
97 private def inferSchema(): StructType = {
98 if (this.userSchema != null) {
99 userSchema
100 } else {
101 val firstRow = headers
102 val header = if (useHeader) {
103 firstRow.map(.getStringCellValue)
104 } else {
105 firstRow.zipWithIndex.map { case (value, index) => s"C$index"}
106 }
107 val baseSchema = if (this.inferSheetSchema) {
108 val stringsAndCellTypes = sheet.rowIterator.asScala.drop(1).map(.cellIterator.asScala.map(c => c.getCellType).toVector).toVector
109 InferSchema(parallelize(stringsAndCellTypes), header.toArray)
110 } else {
111 // By default fields are assumed to be StringType
112 val schemaFields = header.map { fieldName =>
113 StructField(fieldName.toString, StringType, nullable = true)
114 }
115 StructType(schemaFields)
116 }
117 if (addColorColumns) {
118 header.foldLeft(baseSchema) {(schema, header) => schema.add(s"${header}_color", StringType, nullable = true)}
119 } else {
120 baseSchema
121 }
122 }
123 }
124 }

saving at HDFS errors

Whenever I try something like:

notInGenAge
  .write
  .format("com.crealytics.spark.excel")
  .option("sheetName", "NotInGenAge")
  .option("useHeader", "true")
  .mode("overwrite")
  .save(ml + "/genage.xlsx")

where ml is HDFS folder. I get

java.nio.channels.ClosedChannelException
  at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:291)
  at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:147)
  at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:537)
  at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:492)
  at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
  at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:41)
  at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:69)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  ... 109 elided

Error when reading excel-2010

If I just use apache poi to read excel-2010, that will be ok. But with spark-excel, it throws the following exception:

The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)
org.apache.poi.openxml4j.exceptions.OLE2NotOfficeXmlFileException: The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)
at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:179)
at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:237)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:134)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:295)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:201)
at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:111)
at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:91)
at com.monitorjbl.xlsx.StreamingReader$Builder.open(StreamingReader.java:263)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$1.apply(ExcelRelation.scala:57)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$1.apply(ExcelRelation.scala:52)
at scala.Option.map(Option.scala:146)
at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:52)
at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala:63)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:270)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:269)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:97)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:

structured streaming support

do you have plans to support structured streaming?

Not reading all columns

I am reading a sheet from an excel file which has 6 columns. I am getting 5 columns in the resulting data frame. I added start and end columns as 0 and 5. Still I am not getting the last column in the data frame. This code was working before.
val df = spark.read.format("com.crealytics.spark.excel")
.option("sheetName","MRI_NPM")
.option("useHeader", "false")
.option("startColumn",0)
.option("endColumn",5)
.load(configFilePath)

Allow reading only a subset of rows

This would be useful for the project I work on.
We first ingest only 1,000 rows (inferring the schema) and show the user a "preview" of the dataset.
The user can then override column types/names before ingesting the entire dataset.
If the dataset is large then it is much faster to be able to ingest only the first 1,000 records at the preview stage.

Allow reading multiple files specified as a list OR by a pattern

Currently Excel library don't allow to read multiples files specified as a pattern OR as a list ( formats has been shown below). These formats are supported for reading delimited files by DataFrameReader

reader.format( "com.crealytics.spark.excel" )
.option( "useHeader", true )
.option( "timestampFormat", "MM-dd-yyyy HH:mm:ss" )
.load( "/Data*.xlsx" );
OR

reader.format( "com.crealytics.spark.excel" )
.option( "useHeader", true )
.option( "timestampFormat", "MM-dd-yyyy HH:mm:ss" )
.load( "/Data1.xlsx", "/Data2.xlsx" );

It would be really nice to have above features.

NullPointerException when the first row of a sheet has only empty fields.

Hi everybody. I found the following issue using the release 0.8.2 of the library for Scala 2.11 (and Spark 1.6.2): when the selected sheet to transform to DataFrame has the first row containing only empty fields it throws a NullPointerException:

17/03/21 23:03:49 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.NullPointerException
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:30)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at com.optum.sparkexcelpoc.DataFrameFromExcel$.main(DataFrameFromExcel.scala:30)
at com.optum.sparkexcelpoc.DataFrameFromExcel.main(DataFrameFromExcel.scala)

Browsing the source code I found the root cause of the problem and how to fix it. Please let me know if you're happy to have me fixing it and what procedure do you follow to accept changes. Thanks a lot.

Excel writing error com.crealytics.spark.excel.DefaultSource does not allow create table as select

hi,
we are using com.crealytics.spark.excel in my java-spark project ,i working good but when i come to write the data an excel file i got error
com.crealytics.spark.excel.DefaultSource does not allow create table as select.

The will be code use to write in excel file.
dataFileContent.write()
.format("com.crealytics.spark.excel")
.option("sheetName", "Daily")
.option("useHeader", "true")
.mode("overwrite")
.save("D:\xls\Book1.xlsx");

skip rows from excel sheet

Hi ,

my excel file has rows on the top which needs to be skipped , can I do it using this library ?

Thanks
Sri

Facing isuue while reading xlsx.

18/06/20 15:22:47 INFO util.Version: Elasticsearch Hadoop v6.2.2 [102cfaa226]
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.poi.ss.usermodel.Workbook.close()V
at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala:81)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:270)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:269)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:97)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
at fire.samplecode.ReadExcel$.main(ReadExcel.scala:31)
at fire.samplecode.ReadExcel.main(ReadExcel.scala)
18/06/20 15:23:02 INFO spark.SparkContext: Invoking stop() from shutdown hook
18/06/20 15:23:02 INFO server.ServerConnector: Stopped ServerConnector@200a26bc{HTTP/1.1}{0.0.0.0:4040}

Error when excel file have cell with formular

Hi,
My stack:
-scala 2.10
-spark 1.6
-spark-excel_2.10 version : 0.8.3

My sheet have cell with formular , i have this error:
java.lang.RuntimeException: Unknown cell type 2 for NOW()+A2
at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$3.apply(ExcelRelation.scala:66)
To correct this issues , i get the latest version of source and change scala version to match with my stack and update ExcelRelation.scala.

Please can you check my update and push a new spark-excel_2.10 version : 0.8.3 on maven central?

See attachment.
ExcelRelation.zip

Regard

saving multiple datasets as different sheets in excel file

Saving arrays, structs, maps & objects

It seems that saving a DataFrame with any arrays, structs, maps or objects does not work. I'm thinking they'll need to be added to ExcelFileSaver in toCell.

I can take on the work within a few days most likely. What I envision if you had an array of Strings like:

["may", "the", "force", "be", "with", "you"]

It would be presented in the resulting Excel sheet as:

quoteWords
may, the, force, be, with, you

For Maps and Structs, may be just putting a JSON representation of their contents?

If they're objects, I guess just using toString on them would be enough? Any input would be great, and then I can create a PR with the changes.

Temporary files not being deleted

When reading an Excel file, excel-streaming-reader creates temporary excel files in the system temp directory. These get cleaned up when the workbook object is closed. The workbook object is supposed to be autoclosable but this doesn't seem to be working, at least for me. I think that the fix is to explicitly close the workbook object.

Writing a large Dataset into an Excel file causes `java.lang.OutOfMemoryError: GC overhead limit exceeded`

Hi All,

I am trying to write a relatively large Spark Dataset (> 50000 rows) into an Excel file using spark-excel as follows:

DataFrameWriter<Row> writer = dataset.write()
                .format("com.crealytics.spark.excel")
                .option("sheetName", sheets.get(0)) // Only first sheet
                .option("useHeader", useHeader)
                .mode("overwrite");

This works fine if the Dataset contains around 20000 rows, but from approximately 35000 rows, the code yields java.lang.OutOfMemoryError: GC overhead limit exceeded. It seems that the easiest temporary solution would be bumping up the allocated memory for JVM, but I would like to avoid that since bigger Dataset may come in and cause the same issue. Any thoughts on this? Thanks.

Upgrade to POI 4.0.0

I created #80 - the changes are pretty small.

I have forked the xlsx-streamer to support poi 4.0.0 but the APIs are basically the same.
The spoiwo library will probably also need to be upgraded when they publish a poi 4.0.0 compatible jar (they've already merged a change to support this).

java.lang.NoSuchMethodError while trying to display the Dataframe using 'df.show'

I am new to spark. I have a excel file that I need to read into a Dataframe. I am using the com.crealytics.spark.excel library to achieve this. The following is my code:

     val df = hiveContext.read.format("com.crealytics.spark.excel")
             .option("useHeader", "true")
             .option("treatEmptyValuesAsNulls", "true")
             .load("file:///data/home/items.xlsx")

The above code runs without any error. And I am also able to count the number of rows in the df using df.count. But when I try to print the df using df.show, it throws an error saying:

java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;

I also tried using the StructType to define the schema and impose it during the loading the data into df:

 val newschema = StructType(List(StructField("1", StringType, nullable = true),
      StructField("2", StringType, nullable = true),
      StructField("3", StringType, nullable = true),
      StructField("4", StringType, nullable = true),
      StructField("5", StringType, nullable = true),
      StructField("6", StringType, nullable = true),
      StructField("7", StringType, nullable = true),
      StructField("8", StringType, nullable = true),
      StructField("9", StringType, nullable = true),
      StructField("10", StringType, nullable = true)))
val df = hiveContext.read.schema(newschema).format("com.crealytics.spark.excel")...

This doesn't help and I get the same error as before when I try to display the df.

I am using Spark 1.6, Java 1.8 and scala 2.10.5.
I am not sure why this is happening. How do I solve this error and look at the data in the df ?

Any help would be appreciated. Thank you.

Maturity / Production Readiness

Hi Martin,

I am enthusiastic about your library ☺︎

May I ask your opinion about the production readiness or maturity of your library? Would you recommend using it in production?

Danke, Mark

ERROR: java.lang.IllegalStateException: Cannot get a text value from a numeric cell

I used this lib, but had a error with apache.poi. Can not read excel with number

[info]   java.lang.IllegalStateException: Cannot get a text value from a numeric cell
[info]   at org.apache.poi.hssf.usermodel.HSSFCell.typeMismatch(HSSFCell.java:640)
[info]   at org.apache.poi.hssf.usermodel.HSSFCell.getRichStringCellValue(HSSFCell.java:717)
[info]   at org.apache.poi.hssf.usermodel.HSSFCell.getStringCellValue(HSSFCell.java:700)
[info]   at com.crealytics.spark.excel.ExcelRelation$$anonfun$7.apply(ExcelRelation.scala:117)
[info]   at com.crealytics.spark.excel.ExcelRelation$$anonfun$7.apply(ExcelRelation.scala:117)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
[info]   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
[info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
[info]   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

excel sample:

Michael	29
Andy	30
Justin	19

FileNotFoundException when try to launch Job with master yarn and deploy-mode cluster

I am using Hadoop cluster with Yarn resource manager. When I run the job in fully distributed mode master yarn & deploy-mode cluster the job fails with below exception. However, when I change "deploy-mode" to client it runs successfully.

ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) Caused by: java.io.FileNotFoundException: /mapr/Helios/cxpdi/datasets/test/data/TEST/INPUT/data.xls (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at java.io.FileInputStream.(FileInputStream.java:93) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:28) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)

I would suggest, instead of java.io.FileInputStream use org.apache.hadoop.fs.FileSystem api to open stream when master yarn and deploy is cluster.

Dataframe write gives error in createRelation method

Hi,
I am trying to write dataframe to a file and getting following exception.

java.lang.AbstractMethodError: com.crealytics.spark.excel.DefaultSource.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/DataFrame;)Lorg/apache/spark/sql/sources/BaseRelation;
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
at org.apache.spark.sql.DataFrameWriter.dataSource$lzycompute$1(DataFrameWriter.scala:181)
at org.apache.spark.sql.DataFrameWriter.org$apache$spark$sql$DataFrameWriter$$dataSource$1(DataFrameWriter.scala:181)
at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply$mcV$sp(DataFrameWriter.scala:188)
at org.apache.spark.sql.DataFrameWriter.executeAndCallQEListener(DataFrameWriter.scala:154)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:188)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:172)