Giter Club home page Giter Club logo

spark-excel's Introduction

Spark Excel Library

A library for querying Excel files with Apache Spark, for Spark SQL and DataFrames.

Build Status Maven Central

Co-maintainers wanted

Due to personal and professional constraints, the development of this library has been rather slow. If you find value in this library, please consider stepping up as a co-maintainer by leaving a comment here. Help is very welcome e.g. in the following areas:

  • Additional features
  • Code improvements and reviews
  • Bug analysis and fixing
  • Documentation improvements
  • Build / test infrastructure

Requirements

This library requires Spark 2.0+.

List of spark versions, those are automatically tested:

spark: ["2.4.1", "2.4.7", "2.4.8", "3.0.1", "3.0.3", "3.1.1", "3.1.2", "3.2.4", "3.3.2", "3.4.1"]

For more detail, please refer to project CI: ci.yml

Linking

You can link against this library in your program at the following coordinates:

Scala 2.12

groupId: com.crealytics
artifactId: spark-excel_2.12
version: <spark-version>_0.18.0

Scala 2.11

groupId: com.crealytics
artifactId: spark-excel_2.11
version: <spark-version>_0.13.7

Using with Spark shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

Spark compiled with Scala 2.12

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:<spark-version>_0.18.0

Spark compiled with Scala 2.11

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:<spark-version>_0.13.7

Features

  • This package allows querying Excel spreadsheets as Spark DataFrames.
  • From spark-excel 0.14.0 (August 24, 2021), there are two implementation of spark-excel
    • Original Spark-Excel with Spark data source API 1.0
    • Spark-Excel V2 with data source API V2.0+, which supports loading from multiple files, corrupted record handling and some improvement on handling data types. See below for further details

To use V2 implementation, just change your .format from .format("com.crealytics.spark.excel") to .format("excel"). See below for some details

See the changelog for latest features, fixes etc.

Scala API

Spark 2.0+:

Create a DataFrame from an Excel file

import org.apache.spark.sql._

val spark: SparkSession = ???
val df = spark.read
    .format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation
    .option("dataAddress", "'My Sheet'!B3:C35") // Optional, default: "A1"
    .option("header", "true") // Required
    .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
    .option("setErrorCellsToFallbackValues", "true") // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
    .option("usePlainNumberFormat", "false") // Optional, default: false, If true, format the cells without rounding and scientific notations
    .option("inferSchema", "false") // Optional, default: false
    .option("addColorColumns", "true") // Optional, default: false
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
    .option("maxByteArraySize", 2147483647) // Optional, default None. See https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int-
    .option("tempFileThreshold", 10000000) // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead
    .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    .option("workbookPassword", "pass") // Optional, default None. Requires unlimited strength JCE for older JVMs
    .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
    .load("Worktime.xlsx")

For convenience, there is an implicit that wraps the DataFrameReader returned by spark.read and provides a .excel method which accepts all possible options and provides default values:

import org.apache.spark.sql._
import com.crealytics.spark.excel._

val spark: SparkSession = ???
val df = spark.read.excel(
    header = true,  // Required
    dataAddress = "'My Sheet'!B3:C35", // Optional, default: "A1"
    treatEmptyValuesAsNulls = false,  // Optional, default: true
    setErrorCellsToFallbackValues = false, // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
    usePlainNumberFormat = false,  // Optional, default: false. If true, format the cells without rounding and scientific notations
    inferSchema = false,  // Optional, default: false
    addColorColumns = true,  // Optional, default: false
    timestampFormat = "MM-dd-yyyy HH:mm:ss",  // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    maxRowsInMemory = 20,  // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
    maxByteArraySize = 2147483647,  // Optional, default None. See https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int-
    tempFileThreshold = 10000000, // Optional, default None. Number of bytes at which a zip entry is regarded as too large for holding in memory and the data is put in a temp file instead
    excerptSize = 10,  // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    workbookPassword = "pass"  // Optional, default None. Requires unlimited strength JCE for older JVMs
).schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
 .load("Worktime.xlsx")

If the sheet name is unavailable, it is possible to pass in an index:

val df = spark.read.excel(
  header = true,
  dataAddress = "0!B3:C35"
).load("Worktime.xlsx")

or to read in the names dynamically:

import com.crealytics.spark.excel.WorkbookReader
val sheetNames = WorkbookReader( Map("path" -> "Worktime.xlsx")
                               , spark.sparkContext.hadoopConfiguration
                               ).sheetNames
val df = spark.read.excel(
  header = true,
  dataAddress = sheetNames(0)
)

Create a DataFrame from an Excel file using custom schema

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val peopleSchema = StructType(Array(
    StructField("Name", StringType, nullable = false),
    StructField("Age", DoubleType, nullable = false),
    StructField("Occupation", StringType, nullable = false),
    StructField("Date of birth", StringType, nullable = false)))

val spark: SparkSession = ???
val df = spark.read
    .format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation
    .option("dataAddress", "'Info'!A1")
    .option("header", "true")
    .schema(peopleSchema)
    .load("People.xlsx")

Write a DataFrame to an Excel file

import org.apache.spark.sql._

val df: DataFrame = ???
df.write
  .format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation
  .option("dataAddress", "'My Sheet'!B3:C35")
  .option("header", "true")
  .option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
  .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
  .mode("append") // Optional, default: overwrite.
  .save("Worktime2.xlsx")

Data Addresses

As you can see in the examples above, the location of data to read or write can be specified with the dataAddress option. Currently the following address styles are supported:

  • B3: Start cell of the data. Reading will return all rows below and all columns to the right. Writing will start here and use as many columns and rows as required.
  • B3:F35: Cell range of data. Reading will return only rows and columns in the specified range. Writing will start in the first cell (B3 in this example) and use only the specified columns and rows. If there are more rows or columns in the DataFrame to write, they will be truncated. Make sure this is what you want.
  • 'My Sheet'!B3:F35: Same as above, but with a specific sheet.
  • MyTable[#All]: Table of data. Reading will return all rows and columns in this table. Writing will only write within the current range of the table. No growing of the table will be performed. PRs to change this are welcome.

Excel API based on DataSourceV2

The V2 API offers you several improvements when it comes to file and folder handling. and works in a very similar way than data sources like csv and parquet.

To use V2 implementation, just change your .format from .format("com.crealytics.spark.excel") to .format("excel")

The big difference is the fact that you provide a path to read / write data from/to and not an individual single file only:

dataFrame.write
        .format("excel")
        .save("some/path")
spark.read
        .format("excel")
        // ... insert excel read specific options you need
        .load("some/path")

Because folders are supported you can read/write from/to a "partitioned" folder structure, just the same way as csv or parquet. Note that writing partitioned structures is only available for spark >=3.0.1

dataFrame.write
        .partitionBy("col1")
        .format("excel")
        .save("some/path")

Need some more examples? Check out the test cases or have a look at our wiki

Building From Source

This library is built with Mill. To build a JAR file simply run e.g. mill spark-excel[2.13.10,3.3.1].assembly from the project root, where 2.13.10 is the Scala version and 3.3.1 the Spark version. To list all available combinations of Scala and Spark, run mill resolve spark-excel[__].

Star History

Star History Chart

spark-excel's People

Contributors

christianknoepfle avatar cristichircu avatar cvs-mckinsey avatar da-liii avatar danielpade-zmi avatar davidw76 avatar dependabot[bot] avatar dzlab avatar enverosmanov avatar flo076 avatar gdubya avatar gwdgithubnom avatar habelson avatar kholodilov avatar nightscape avatar panaeon avatar pjfanning avatar quanghgx avatar scala-steward avatar sereneant avatar shoffing avatar skyler3d avatar syndic8-stephen avatar waiyan1612 avatar williamdphillips avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-excel's Issues

option for inserting userSchema is not possible

HI I am tryin to read a date field from excel, but your API doesn't provide support for same. So i try to provide user provided schema as shown in the example in readme, but still not able to pass as schema is an object not a string.
val df = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", "Worktime.xlsx")
.option("sheetName", "Daily")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
# .option("userSchema", StructType(???)) // Optional// Not possible to pass schema as string
.option("inferSchema", "true")
.option("addColorColumns", "true")
.option("startColumn", 0) // Optional
.option("endColumn", 99) // Optional
.load()

getErrorCellValue ignore NotSupportedException option

It would be awesome if there was an option for errors in excelsheets not to throw exceptions (NotSupportedException), but could be handled as null values with some log message that certain cells have been skipped.

infer schema only reading first ten rows?

With version 0.9.6, when I do not provide a schema and set inferSchema to true, the inferred schema seems to be based on the first ten rows only.
This means that if a column contains only numbers in the first ten rows, but contains text (that cannot be parsed as a number) in later rows, then there will be a crash when trying to parse the text as a number.
If there is non-numeric text in the first ten rows of data then the column is correctly inferred as a string type and there is no problem.
I am pretty sure that in previous versions, the schema was inferred from all rows and this problem never occurred.
I have maxRowsInMemory set to 20
The problem can be seen with the online retail dataset

0.8.3 Version - Parameter "location" is missing in options

Hi,
I am using Spark-Excel 0.8.3 version with Spark 1.6 and Scala 2.10 and getting this exception

Exception in thread "main" java.lang.IllegalArgumentException: Parameter "location" is missing in options.
at com.crealytics.spark.excel.DefaultSource.checkParameter(DefaultSource.scala:37)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:17)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at sparkExcelClass$delayedInit$body.apply(sparkExcelClass.scala:28)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at sparkExcelClass$.main(sparkExcelClass.scala:4)
at sparkExcelClass.main(sparkExcelClass.scala)

Thanks,
Srikanth

Custom schema example

Could you please give me an example of custom schema format to be used with this library?

FileNotFoundException when trying to use com.crealytics.spark.excel

My file is in s3 and I am trying to run this code from databricks. My s3 location is accessible by other libraries. Is it a bug?

val df = sqlContext.read
.format("com.crealytics.spark.excel")
//.option("sheetName", "2018.06.30 GLSFT1 Data Tape") // Required
.option("location","s3location")
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("inferSchema", "true") // Optional, default: false
.option("addColorColumns", "true") // Optional, default: false
.option("startColumn", 0) // Optional, default: 0
.option("endColumn", 99) // Optional, default: Int.MaxValue
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
.option("skipFirstRows", 1) // Optional, default None. If set skips the first n rows and checks for headers in row n+1
//.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.load("GLS_Flow_201806.xlsx")

Issues with date column in excel with format dd/mm/yyyy

I'm trying to create a dataframe from a large excel and I have a column that it has a format like this.
mm/dd/yyyy
I've tryed to create a custom schema for this column using StringType, and DateType, but none of these have worked.
If I don't set any schema I receive an error like this; "Exception in thread "main" java.lang.IllegalStateException: Cannot get a STRING value from a NUMERIC cell"

I'm loading the dataframe like this:
val df = dataFrameReader.format("com.crealytics.spark.excel")
.option("sheetName", "Hoja1")
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "true") // Optional, default: true
.option("inferSchema", "true") // Optional, default: false
//.option("addColorColumns", "true") // Optional, default: false
.option("startColumn", 1) // Optional, default: 0
//.option("endColumn", 99) // Optional, default: Int.MaxValue
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 1000) // Optional, default None. If set, uses a streaming reader which can help with big files
//.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
//.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.schema(schema)
.load( s"hdfs://${fileName}")

My custom schema:
val schema = StructType(
List(
StructField("NroTramite", LongType, true),
StructField("FchAlta", DateType, true),
StructField("HraAlta", StringType, true),
StructField("UsrAlta", StringType, true),
StructField("Estado", StringType, true),
StructField("Proceso", StringType, true),
StructField("Cliente", LongType, true),
StructField("Razon Social", StringType, true),
StructField("RelCnt", LongType, true),
StructField("Mid", LongType, true),
StructField("Contrato", LongType, true),
StructField("PlanComercial", StringType, true),
StructField("Categ", StringType, true),
StructField("SubCateg", StringType, true),
StructField("Tecnologia", StringType, true),
StructField("AgPromotorPlnPrincipal", StringType, true),
StructField("AgPromotorProducto", StringType, true),
StructField("ProdExt", IntegerType, true),
StructField("NombreProdExterno", StringType, true),
StructField("AgVenta", IntegerType, true),
StructField("NombreAgVenta", StringType, true),
StructField("OrigSisExterno", IntegerType, true),
StructField("FlagAcraTercero", StringType, true),
StructField("RelCntquePaga", IntegerType, true)
)
)

The error that I get when I use this schema:
Exception in thread "main" java.lang.NumberFormatException: For input string: "Fch Alta"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:284)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:29)
at com.crealytics.spark.excel.ExcelRelation.numericValue$lzycompute$1(ExcelRelation.scala:111)
at com.crealytics.spark.excel.ExcelRelation.numericValue$1(ExcelRelation.scala:108)

I look forward for your answers, thanks in advance.

aws s3 file system not supported

I was trying to create dataframe from excel file kept on aws s3 bucket , but the api could not process the s3a url . Somehow the s3a URL is getting down to java.io.FileInputStream.open(), which only works with local filesystem files, not HDFS, S3. I checked there is a slash (/) missing from the path.

val data = sqlContext.read.
format("com.crealytics.spark.excel").
option("location", s3path).
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema","true").
option("addColorColumns", "true").
load()

ERROR:
Name: java.io.FileNotFoundException
Message: s3a:/AKIAJYDDDDDDNA:A6DDDDDDDDDwqxkRqUQyXqqNOUsONDy@my-test/test.xlsx (No such file or directory)
StackTrace: at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:212)
at java.io.FileInputStream.(FileInputStream.java:152)

Not able to read excel from hadoop location

Hi,

When I am trying to read this Excel from Hadoop using this API, it's giving file not found. As an API to be compatible with SPARK this should work flawlessly with Hadoop files. Please look into the issue.
Code Snippet
DataFrame df = sqlContext.read().format("com.crealytics.spark.excel").option("location", "hdfs:////localhost:54310/home/impadmin/abc.xlsx").
option("sheetName", "Sheet1")
.option("useHeader", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true")
.option("addColorColumns", "true").load();

Exception:
java.io.FileNotFoundException: hdfs:/localhost:54310/home/impadmin/StreamAnalytix/uploadjar/AMEX.csv (No such file or dir
ectory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:28)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)

input_file_name returns empty string

Hello,

I'm trying to create a dataframe from an Excel doc and then append a column with the input file's name as returned by input_file_name. Instead of returning the file path, input_file_name is returning an empty string.

Test code

from pyspark.sql.functions import input_file_name

df = (sql_context.read.format("com.crealytics.spark.excel")
      .option('sheetName', "Sheet1")
      .option('useHeader', True)
      .load("/home/btelle/test.xlsx")
      )

df = df.withColumn('file_name', input_file_name())
df.select('file_name').show(1)

Expected result

+--------------------+
|           file_name|
+--------------------+
|file:///home/btel...|
+--------------------+

Actual result

+---------+
|file_name|
+---------+
|         |
+---------+

Out of memory exception when reading big xlsx file

I am reading big xlsx file of 100mb with 28 sheets(10000 rows per sheet) and creating a single dataframe out of it . I am facing out of memory exception when running on cluster mode .My code looks like this.

def buildDataframe(spark: SparkSession, filePath: String, requiresHeader: Boolean): DataFrame = {

var excelDF: DataFrame = null
var flag: Boolean = false

// getting inputstream from s3 for all sheet names
val s3FilePath = filePath.replaceAll(filePath.substring(0, filePath.indexOf("//") + 2), "")
val s3Bucket = s3FilePath.substring(0, s3FilePath.indexOf("/"))
val s3key = s3FilePath.substring(s3FilePath.indexOf("/") + 1, s3FilePath.length())
val s3Client: AmazonS3 = new AmazonS3Client();
val s3object: S3Object = s3Client.getObject(new GetObjectRequest(
s3Bucket, s3key));
val inputStream: java.io.InputStream = s3object.getObjectContent()
val workbook = new XSSFWorkbook(inputStream)
var sno = workbook.getNumberOfSheets
var sheetName: Array[String] = new ArrayString
for ((sheet, i) <- workbook.zipWithIndex) {
val initialDF = spark.read.format("com.crealytics.spark.excel")
.option("useHeader", requiresHeader)
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", true)
.option("addColorColumns", false)
.option("sheetName", sheet.getSheetName)
.option("path", filePath)
.load()

  if (flag == false) {
    excelDF = initialDF
    flag = true
  } else {
    excelDF = excelDF.union(initialDF)
  }

}

return excelDF

}

Any help is much appreciated.

ExcelFileSaver.toCell() doesn't handle null values in DataFrame

When a DataFrame row has a null value in one of the columns, a scala.MatchError exception gets thrown by toCell. I am writing with the following write calls:

    dataFrame.write
      .format("com.crealytics.spark.excel")
      .option("sheetName", config.sheetName)
      .option("treatEmptyValuesAsNulls", true)
      .option("useHeader", true)
      .mode("overwrite")
      .save(config.path)

This is reproducible by loading in an Excel or CSV file with blanks in any of the columns and attempting to save it. I added the treatEmptyValuesAsNulls just to make sure I wasn't missing an option but looking at the code it seems that a case needs to be added to handle if a == null. It is also possible I'm doing something wrong here, and if so, I apologize!

Compatibility Issue

Do we have any Excel or spreadsheet proccessing dependency jar for Spark version 1.6.

I am facing error
"java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;
at com.crealytics.spark.excel.ExcelRelation$$anonfun$4.apply(ExcelRelation.scala:113)"

while processing it with spark 1.6.

ExcelRelation read file in driver

Hi,
Basically buildScan method will be run on driver, not executor.
It will process file and then send it's content through parallelize(), which is also bad practice for large stuff.

Options:

  1. use sc.hadoopFile(...) with custom ExcelInputFormat.
    problems can be with providing parameters cause it use reflection to create instance of InputFormat

  2. use sc.parallelize(..).flatMap( // read file, get List[Row])
    It will read file on executor. Unfortunately, one executor.

Reading excel file with multi line header throw an exception.

When I tried to read a file with multilines header, the QueryExecution of the dataset throw an exception Method threw 'scala.MatchError' exception. Cannot evaluate org.apache.spark.sql.execution.QueryExecution.toString()

Steps to reproduce;

SparkSession session = SparkSession.builder().getOrCreate();
String path = "testMultiLineHeader.xlsx";
Dataset<Row> dsBal = session.read().format("com.crealytics.spark.excel")
        .option("location", path)
        .option("sheetName", "Feuil1")
        .option("useHeader", "true")
        .option("treatEmptyValuesAsNulls", "true")
        .option("inferSchema", "true")
        .option("addColorColumns", "false")
        .load(path);
dsBal.show();

Example file that cause this error : file that cause this error : https://files.fm/u/jjjvkvxe#/view/testMultiLineHeader.xlsx

Full stack:

scala.MatchError: Nom de l'entité <-- Name of the header first line.
Name of the entity <-- Name of the header second line.
(of class java.lang.String)
    at com.crealytics.spark.excel.ExcelRelation$$anonfun$2.apply(ExcelRelation.scala:122)
    at com.crealytics.spark.excel.ExcelRelation$$anonfun$2.apply(ExcelRelation.scala:120)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
    at com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:120)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:393)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:296)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
    at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
    at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
    at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
    at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
    at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
    at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3248)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
    at Main.main(Main.java:33)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.microsoft.azure.hdinsight.spark.mock.SparkLocalRunner.runJobMain(SparkLocalRunner.java:75)
    at com.microsoft.azure.hdinsight.spark.mock.SparkLocalRunner.main(SparkLocalRunner.java:48)

How to infer schema for Int column types

I am trying to dig into your codes but I am relatively new to scala, so I don't get why this can't infer schema for Integer column. Is there any specific reason you excluded this type?

Reject bad data rows

I have a Scala app utilizing spark-excel 0.9.17 and am interested in rejecting rows that contain bad data as defined by the schema. E.g. if a row has a column with a string in an Integer column, reject the row. It seems, at least in the case of Integer types, non-numeric values are converted to zero, and the row is loaded into the data frame. Is there anything similar to mode=DROPMALFORMED used for importing csv files that can be used in this case? Such that in the example below, the second row would be rejected instead of converting the "x" to zero? Thanks.

INPUT:

Name | Age | Comment
Joe | 25 | A
John | x | B
Bill | 35 | C

OUTPUT:

dfValid.printSchema()
root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Comment: string (nullable = true)

dfValid.show()
+----+---+-------+
|Name|Age|Comment|
+----+---+-------+
| Joe| 25|      A|
|John|  0|      B|
|Bill| 35|      C|
+----+---+-------+```

I am unable to fix the issue with the code created by Martin Mauch nightscape.. I am also trying to create a scala code for landing excel data from sharepoint to hadoop and then convert it into parquet format.I am getting error saying value InferSchema not found

import java.math.BigDecimal
4 import java.sql.{Date, Timestamp}
5 import java.text.NumberFormat
6 import java.util.Locale
7 import java.io._
8 import scala.collection.JavaConverters._
9 import org.apache.poi.ss.usermodel.{ WorkbookFactory, Row => SheetRow, Cell, DataFormatter }
10 import org.apache.spark.rdd.RDD
11 import org.apache.spark.sql._
12 import org.apache.spark.sql.sources._
13 import org.apache.spark.sql.types._
14
15 import scala.util.Try
16
17 case class ExcelRelation(
18 location: String,
19 sheetName: String,
20 useHeader: Boolean,
21 treatEmptyValuesAsNulls: Boolean,
22 inferSheetSchema: Boolean,
23 addColorColumns: Boolean = true,
24 userSchema: StructType = null
25 )
26 (@transient val sqlContext: SQLContext)
27 extends BaseRelation with TableScan with PrunedScan {
28 val workbook = WorkbookFactory.create(new FileInputStream(location))
29 val sheet = workbook.getSheet(sheetName)
30 val headers = sheet.getRow(0).cellIterator().asScala.to[Vector]
31 override val schema: StructType = inferSchema
32 val dataFormatter = new DataFormatter();
33 override def buildScan: RDD[Row] = buildScan(schema.map(_.name).toArray)
34
35 override def buildScan(requiredColumns: Array[String]): RDD[Row] = {
36 val lookups = requiredColumns.map {c =>
37 val columnNameRegex = s"(.*?)(color)?".r
38 val columnNameRegex(columnName, isColor) = c
39 val columnIndex = schema.indexWhere(
.name == columnName)
40
41 val cellExtractor: Cell => Any = if (isColor == null) {
42 { cell: Cell =>
43 val value = cell.getCellType match {
44 case Cell.CELL_TYPE_NUMERIC => cell.getNumericCellValue
45 case Cell.CELL_TYPE_BOOLEAN => cell.getBooleanCellValue
46 case Cell.CELL_TYPE_STRING => cell.getStringCellValue
47 case t => throw new RuntimeException(s"Unknown cell type $t for $cell")
48 }
49 castTo(value.toString, schema(columnIndex).dataType)
50 }
51 } else {
52 _.getCellStyle.getFillForegroundColorColor match {
53 case null => ""
54 case c: org.apache.poi.xssf.usermodel.XSSFColor => c.getARGBHex
55 case c => throw new RuntimeException(s"Unknown color type $c: ${c.getClass}")
56 }
57 }
58 { row: SheetRow =>
59 val cell = row.getCell(columnIndex)
60 if (cell == null) {
61 null
62 } else {
63 cellExtractor(cell)
64 }
65 }
66 }.to[Vector]
67 val rows = sheet.rowIterator().asScala.drop(1).map(row => lookups.map(l => l(row)))
68 val result = rows.to[Vector]
69
70 sqlContext.sparkContext.parallelize(result.map(Row.fromSeq))
71 }
72
73 // Cast a String to a Spark Data Type
74 private def castTo(datum: String, castType: DataType): Any = {
75 castType match {
76 case _: ByteType => datum.toByte
77 case _: ShortType => datum.toShort
78 case _: IntegerType => datum.toInt
79 case _: LongType => datum.toLong
80 case _: FloatType => Try(datum.toFloat)
81 .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).floatValue())
82 case _: DoubleType => Try(datum.toDouble)
83 .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).doubleValue())
84 case _: BooleanType => datum.toBoolean
85 case _: DecimalType => new BigDecimal(datum.replaceAll(",", ""))
86 case _: TimestampType => Timestamp.valueOf(datum)
87 case : DateType => Date.valueOf(datum)
88 case : StringType => datum
89 case _ => throw new RuntimeException(s"Unsupported type: ${castType.typeName}")
90 }
91 }
92
93 private def rowsRdd: RDD[SheetRow] = {
94 parallelize(sheet.rowIterator().asScala.toSeq)
95 }
96 private def parallelize[T: scala.reflect.ClassTag](seq: Seq[T]): RDD[T] = sqlContext.sparkContext.parallelize(seq)
97 private def inferSchema(): StructType = {
98 if (this.userSchema != null) {
99 userSchema
100 } else {
101 val firstRow = headers
102 val header = if (useHeader) {
103 firstRow.map(
.getStringCellValue)
104 } else {
105 firstRow.zipWithIndex.map { case (value, index) => s"C$index"}
106 }
107 val baseSchema = if (this.inferSheetSchema) {
108 val stringsAndCellTypes = sheet.rowIterator.asScala.drop(1).map(
.cellIterator.asScala.map(c => c.getCellType).toVector).toVector
109 InferSchema(parallelize(stringsAndCellTypes), header.toArray)
110 } else {
111 // By default fields are assumed to be StringType
112 val schemaFields = header.map { fieldName =>
113 StructField(fieldName.toString, StringType, nullable = true)
114 }
115 StructType(schemaFields)
116 }
117 if (addColorColumns) {
118 header.foldLeft(baseSchema) {(schema, header) => schema.add(s"${header}_color", StringType, nullable = true)}
119 } else {
120 baseSchema
121 }
122 }
123 }
124 }

saving at HDFS errors

Whenever I try something like:

notInGenAge
  .write
  .format("com.crealytics.spark.excel")
  .option("sheetName", "NotInGenAge")
  .option("useHeader", "true")
  .mode("overwrite")
  .save(ml + "/genage.xlsx")

where ml is HDFS folder. I get

java.nio.channels.ClosedChannelException
  at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:291)
  at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:147)
  at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:537)
  at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:492)
  at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
  at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:41)
  at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:69)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  ... 109 elided

Error when reading excel-2010

If I just use apache poi to read excel-2010, that will be ok. But with spark-excel, it throws the following exception:

The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)
org.apache.poi.openxml4j.exceptions.OLE2NotOfficeXmlFileException: The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)
at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:179)
at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:237)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:134)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:295)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:201)
at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:111)
at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:91)
at com.monitorjbl.xlsx.StreamingReader$Builder.open(StreamingReader.java:263)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$1.apply(ExcelRelation.scala:57)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$1.apply(ExcelRelation.scala:52)
at scala.Option.map(Option.scala:146)
at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:52)
at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala:63)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:270)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:269)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:97)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:

Not reading all columns

I am reading a sheet from an excel file which has 6 columns. I am getting 5 columns in the resulting data frame. I added start and end columns as 0 and 5. Still I am not getting the last column in the data frame. This code was working before.
val df = spark.read.format("com.crealytics.spark.excel")
.option("sheetName","MRI_NPM")
.option("useHeader", "false")
.option("startColumn",0)
.option("endColumn",5)
.load(configFilePath)

Allow reading only a subset of rows

This would be useful for the project I work on.
We first ingest only 1,000 rows (inferring the schema) and show the user a "preview" of the dataset.
The user can then override column types/names before ingesting the entire dataset.
If the dataset is large then it is much faster to be able to ingest only the first 1,000 records at the preview stage.

Allow reading multiple files specified as a list OR by a pattern

Currently Excel library don't allow to read multiples files specified as a pattern OR as a list ( formats has been shown below). These formats are supported for reading delimited files by DataFrameReader

reader.format( "com.crealytics.spark.excel" )
.option( "useHeader", true )
.option( "timestampFormat", "MM-dd-yyyy HH:mm:ss" )
.load( "/Data*.xlsx" );
OR

reader.format( "com.crealytics.spark.excel" )
.option( "useHeader", true )
.option( "timestampFormat", "MM-dd-yyyy HH:mm:ss" )
.load( "/Data1.xlsx", "/Data2.xlsx" );

It would be really nice to have above features.

NullPointerException when the first row of a sheet has only empty fields.

Hi everybody. I found the following issue using the release 0.8.2 of the library for Scala 2.11 (and Spark 1.6.2): when the selected sheet to transform to DataFrame has the first row containing only empty fields it throws a NullPointerException:

17/03/21 23:03:49 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.NullPointerException
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:30)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at com.optum.sparkexcelpoc.DataFrameFromExcel$.main(DataFrameFromExcel.scala:30)
at com.optum.sparkexcelpoc.DataFrameFromExcel.main(DataFrameFromExcel.scala)

Browsing the source code I found the root cause of the problem and how to fix it. Please let me know if you're happy to have me fixing it and what procedure do you follow to accept changes. Thanks a lot.

Excel writing error com.crealytics.spark.excel.DefaultSource does not allow create table as select

hi,
we are using com.crealytics.spark.excel in my java-spark project ,i working good but when i come to write the data an excel file i got error
com.crealytics.spark.excel.DefaultSource does not allow create table as select.

The will be code use to write in excel file.
dataFileContent.write()
.format("com.crealytics.spark.excel")
.option("sheetName", "Daily")
.option("useHeader", "true")
.mode("overwrite")
.save("D:\xls\Book1.xlsx");

skip rows from excel sheet

Hi ,

my excel file has rows on the top which needs to be skipped , can I do it using this library ?

Thanks
Sri

Facing isuue while reading xlsx.

18/06/20 15:22:47 INFO util.Version: Elasticsearch Hadoop v6.2.2 [102cfaa226]
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.poi.ss.usermodel.Workbook.close()V
at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala:81)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:270)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:269)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:97)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
at fire.samplecode.ReadExcel$.main(ReadExcel.scala:31)
at fire.samplecode.ReadExcel.main(ReadExcel.scala)
18/06/20 15:23:02 INFO spark.SparkContext: Invoking stop() from shutdown hook
18/06/20 15:23:02 INFO server.ServerConnector: Stopped ServerConnector@200a26bc{HTTP/1.1}{0.0.0.0:4040}

Error when excel file have cell with formular

Hi,
My stack:
-scala 2.10
-spark 1.6
-spark-excel_2.10 version : 0.8.3

My sheet have cell with formular , i have this error:
java.lang.RuntimeException: Unknown cell type 2 for NOW()+A2
at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$3.apply(ExcelRelation.scala:66)
To correct this issues , i get the latest version of source and change scala version to match with my stack and update ExcelRelation.scala.

Please can you check my update and push a new spark-excel_2.10 version : 0.8.3 on maven central?

See attachment.
ExcelRelation.zip

Regard

Saving arrays, structs, maps & objects

It seems that saving a DataFrame with any arrays, structs, maps or objects does not work. I'm thinking they'll need to be added to ExcelFileSaver in toCell.

I can take on the work within a few days most likely. What I envision if you had an array of Strings like:

["may", "the", "force", "be", "with", "you"]

It would be presented in the resulting Excel sheet as:

quoteWords
may, the, force, be, with, you

For Maps and Structs, may be just putting a JSON representation of their contents?

If they're objects, I guess just using toString on them would be enough? Any input would be great, and then I can create a PR with the changes.

Temporary files not being deleted

When reading an Excel file, excel-streaming-reader creates temporary excel files in the system temp directory. These get cleaned up when the workbook object is closed. The workbook object is supposed to be autoclosable but this doesn't seem to be working, at least for me. I think that the fix is to explicitly close the workbook object.

Writing a large Dataset into an Excel file causes `java.lang.OutOfMemoryError: GC overhead limit exceeded`

Hi All,

I am trying to write a relatively large Spark Dataset (> 50000 rows) into an Excel file using spark-excel as follows:

DataFrameWriter<Row> writer = dataset.write()
                .format("com.crealytics.spark.excel")
                .option("sheetName", sheets.get(0)) // Only first sheet
                .option("useHeader", useHeader)
                .mode("overwrite");

This works fine if the Dataset contains around 20000 rows, but from approximately 35000 rows, the code yields java.lang.OutOfMemoryError: GC overhead limit exceeded. It seems that the easiest temporary solution would be bumping up the allocated memory for JVM, but I would like to avoid that since bigger Dataset may come in and cause the same issue. Any thoughts on this? Thanks.

Upgrade to POI 4.0.0

I created #80 - the changes are pretty small.

I have forked the xlsx-streamer to support poi 4.0.0 but the APIs are basically the same.
The spoiwo library will probably also need to be upgraded when they publish a poi 4.0.0 compatible jar (they've already merged a change to support this).

java.lang.NoSuchMethodError while trying to display the Dataframe using 'df.show'

I am new to spark. I have a excel file that I need to read into a Dataframe. I am using the com.crealytics.spark.excel library to achieve this. The following is my code:

     val df = hiveContext.read.format("com.crealytics.spark.excel")
             .option("useHeader", "true")
             .option("treatEmptyValuesAsNulls", "true")
             .load("file:///data/home/items.xlsx")

The above code runs without any error. And I am also able to count the number of rows in the df using df.count. But when I try to print the df using df.show, it throws an error saying:

java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;

I also tried using the StructType to define the schema and impose it during the loading the data into df:

 val newschema = StructType(List(StructField("1", StringType, nullable = true),
      StructField("2", StringType, nullable = true),
      StructField("3", StringType, nullable = true),
      StructField("4", StringType, nullable = true),
      StructField("5", StringType, nullable = true),
      StructField("6", StringType, nullable = true),
      StructField("7", StringType, nullable = true),
      StructField("8", StringType, nullable = true),
      StructField("9", StringType, nullable = true),
      StructField("10", StringType, nullable = true)))
val df = hiveContext.read.schema(newschema).format("com.crealytics.spark.excel")...

This doesn't help and I get the same error as before when I try to display the df.

I am using Spark 1.6, Java 1.8 and scala 2.10.5.
I am not sure why this is happening. How do I solve this error and look at the data in the df ?

Any help would be appreciated. Thank you.

Maturity / Production Readiness

Hi Martin,

I am enthusiastic about your library ☺︎

May I ask your opinion about the production readiness or maturity of your library? Would you recommend using it in production?

Danke, Mark

ERROR: java.lang.IllegalStateException: Cannot get a text value from a numeric cell

I used this lib, but had a error with apache.poi. Can not read excel with number

[info]   java.lang.IllegalStateException: Cannot get a text value from a numeric cell
[info]   at org.apache.poi.hssf.usermodel.HSSFCell.typeMismatch(HSSFCell.java:640)
[info]   at org.apache.poi.hssf.usermodel.HSSFCell.getRichStringCellValue(HSSFCell.java:717)
[info]   at org.apache.poi.hssf.usermodel.HSSFCell.getStringCellValue(HSSFCell.java:700)
[info]   at com.crealytics.spark.excel.ExcelRelation$$anonfun$7.apply(ExcelRelation.scala:117)
[info]   at com.crealytics.spark.excel.ExcelRelation$$anonfun$7.apply(ExcelRelation.scala:117)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
[info]   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
[info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
[info]   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

excel sample:

Michael	29
Andy	30
Justin	19

FileNotFoundException when try to launch Job with master yarn and deploy-mode cluster

I am using Hadoop cluster with Yarn resource manager. When I run the job in fully distributed mode master yarn & deploy-mode cluster the job fails with below exception. However, when I change "deploy-mode" to client it runs successfully.

ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) Caused by: java.io.FileNotFoundException: /mapr/Helios/cxpdi/datasets/test/data/TEST/INPUT/data.xls (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at java.io.FileInputStream.(FileInputStream.java:93) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:28) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)

I would suggest, instead of java.io.FileInputStream use org.apache.hadoop.fs.FileSystem api to open stream when master yarn and deploy is cluster.

Dataframe write gives error in createRelation method

Hi,
I am trying to write dataframe to a file and getting following exception.

java.lang.AbstractMethodError: com.crealytics.spark.excel.DefaultSource.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/DataFrame;)Lorg/apache/spark/sql/sources/BaseRelation;
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
at org.apache.spark.sql.DataFrameWriter.dataSource$lzycompute$1(DataFrameWriter.scala:181)
at org.apache.spark.sql.DataFrameWriter.org$apache$spark$sql$DataFrameWriter$$dataSource$1(DataFrameWriter.scala:181)
at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply$mcV$sp(DataFrameWriter.scala:188)
at org.apache.spark.sql.DataFrameWriter.executeAndCallQEListener(DataFrameWriter.scala:154)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:188)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:172)

Thanks,
Srikanth

formulas in sheets do not appear to be supported

The values read from cells in a sheet that are of type FORMULA, are evaluated as the formula itself rather than it's cached value.

Additionally the infer schema option fails on cells of type FORMULA.

OK to submit a PR for this?

pyspark.sql.types.DecimalType

Getting such error when trying to show dataframe. Its ok in case of FloatType.

: java.lang.NumberFormatException
	at java.math.BigDecimal.<init>(BigDecimal.java:494)
	at java.math.BigDecimal.<init>(BigDecimal.java:383)
	at java.math.BigDecimal.<init>(BigDecimal.java:806)
	at com.crealytics.spark.excel.ExcelRelation.bigDecimal$lzycompute$1(ExcelRelation.scala:114)
	at com.crealytics.spark.excel.ExcelRelation.bigDecimal$1(ExcelRelation.scala:114)
	at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$castTo(ExcelRelation.scala:123)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$3.apply(ExcelRelation.scala:76)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$3.apply(ExcelRelation.scala:76)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$apply$2.apply(ExcelRelation.scala:89)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$apply$2.apply(ExcelRelation.scala:84)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$5$$anonfun$apply$3.apply(ExcelRelation.scala:94)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$5$$anonfun$apply$3.apply(ExcelRelation.scala:94)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.