crealytics / spark-excel Goto Github PK

A Spark plugin for reading and writing Excel files

License: Apache License 2.0

Scala 100.00%

spark-excel's Issues

Temporary files not being deleted

When reading an Excel file, excel-streaming-reader creates temporary excel files in the system temp directory. These get cleaned up when the workbook object is closed. The workbook object is supposed to be autoclosable but this doesn't seem to be working, at least for me. I think that the fix is to explicitly close the workbook object.

infer schema only reading first ten rows?

With version 0.9.6, when I do not provide a schema and set inferSchema to true, the inferred schema seems to be based on the first ten rows only.
This means that if a column contains only numbers in the first ten rows, but contains text (that cannot be parsed as a number) in later rows, then there will be a crash when trying to parse the text as a number.
If there is non-numeric text in the first ten rows of data then the column is correctly inferred as a string type and there is no problem.
I am pretty sure that in previous versions, the schema was inferred from all rows and this problem never occurred.
I have maxRowsInMemory set to 20
The problem can be seen with the online retail dataset

Facing isuue while reading xlsx.

18/06/20 15:22:47 INFO util.Version: Elasticsearch Hadoop v6.2.2 [102cfaa226]
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.poi.ss.usermodel.Workbook.close()V
at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala:81)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:270)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:269)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:97)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
at fire.samplecode.ReadExcel$.main(ReadExcel.scala:31)
at fire.samplecode.ReadExcel.main(ReadExcel.scala)
18/06/20 15:23:02 INFO spark.SparkContext: Invoking stop() from shutdown hook
18/06/20 15:23:02 INFO server.ServerConnector: Stopped ServerConnector@200a26bc{HTTP/1.1}{0.0.0.0:4040}

Would need multiple sheets to generate for multiple dataset.

structured streaming support

do you have plans to support structured streaming?

Maturity / Production Readiness

Hi Martin,

I am enthusiastic about your library ☺︎

May I ask your opinion about the production readiness or maturity of your library? Would you recommend using it in production?

Danke, Mark

Dataframe write gives error in createRelation method

Hi,
I am trying to write dataframe to a file and getting following exception.

java.lang.AbstractMethodError: com.crealytics.spark.excel.DefaultSource.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/DataFrame;)Lorg/apache/spark/sql/sources/BaseRelation;
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
at org.apache.spark.sql.DataFrameWriter.dataSource$lzycompute$1(DataFrameWriter.scala:181)
at org.apache.spark.sql.DataFrameWriter.org$apache$spark$sql$DataFrameWriter$$dataSource$1(DataFrameWriter.scala:181)
at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply$mcV$sp(DataFrameWriter.scala:188)
at org.apache.spark.sql.DataFrameWriter.executeAndCallQEListener(DataFrameWriter.scala:154)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:188)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:172)

Thanks,
Srikanth

Reading excel file with multi line header throw an exception.

When I tried to read a file with multilines header, the QueryExecution of the dataset throw an exception Method threw 'scala.MatchError' exception. Cannot evaluate org.apache.spark.sql.execution.QueryExecution.toString()

Steps to reproduce;

SparkSession session = SparkSession.builder().getOrCreate();
String path = "testMultiLineHeader.xlsx";
Dataset<Row> dsBal = session.read().format("com.crealytics.spark.excel")
        .option("location", path)
        .option("sheetName", "Feuil1")
        .option("useHeader", "true")
        .option("treatEmptyValuesAsNulls", "true")
        .option("inferSchema", "true")
        .option("addColorColumns", "false")
        .load(path);
dsBal.show();

Example file that cause this error : file that cause this error : https://files.fm/u/jjjvkvxe#/view/testMultiLineHeader.xlsx

Full stack:

scala.MatchError: Nom de l'entité <-- Name of the header first line.
Name of the entity <-- Name of the header second line.
(of class java.lang.String)
    at com.crealytics.spark.excel.ExcelRelation$$anonfun$2.apply(ExcelRelation.scala:122)
    at com.crealytics.spark.excel.ExcelRelation$$anonfun$2.apply(ExcelRelation.scala:120)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
    at com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:120)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:300)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:393)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:296)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
    at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
    at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
    at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
    at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
    at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
    at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3248)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
    at Main.main(Main.java:33)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.microsoft.azure.hdinsight.spark.mock.SparkLocalRunner.runJobMain(SparkLocalRunner.java:75)
    at com.microsoft.azure.hdinsight.spark.mock.SparkLocalRunner.main(SparkLocalRunner.java:48)

Upgrade to POI 4.0.0

I created #80 - the changes are pretty small.

I have forked the xlsx-streamer to support poi 4.0.0 but the APIs are basically the same.
The spoiwo library will probably also need to be upgraded when they publish a poi 4.0.0 compatible jar (they've already merged a change to support this).

aws s3 file system not supported

I was trying to create dataframe from excel file kept on aws s3 bucket , but the api could not process the s3a url . Somehow the s3a URL is getting down to java.io.FileInputStream.open(), which only works with local filesystem files, not HDFS, S3. I checked there is a slash (/) missing from the path.

val data = sqlContext.read.
format("com.crealytics.spark.excel").
option("location", s3path).
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema","true").
option("addColorColumns", "true").
load()

ERROR:
Name: java.io.FileNotFoundException
Message: s3a:/AKIAJYDDDDDDNA:A6DDDDDDDDDwqxkRqUQyXqqNOUsONDy@my-test/test.xlsx (No such file or directory)
StackTrace: at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:212)
at java.io.FileInputStream.(FileInputStream.java:152)

Out of memory exception when reading big xlsx file

I am reading big xlsx file of 100mb with 28 sheets(10000 rows per sheet) and creating a single dataframe out of it . I am facing out of memory exception when running on cluster mode .My code looks like this.

def buildDataframe(spark: SparkSession, filePath: String, requiresHeader: Boolean): DataFrame = {

var excelDF: DataFrame = null
var flag: Boolean = false

// getting inputstream from s3 for all sheet names
val s3FilePath = filePath.replaceAll(filePath.substring(0, filePath.indexOf("//") + 2), "")
val s3Bucket = s3FilePath.substring(0, s3FilePath.indexOf("/"))
val s3key = s3FilePath.substring(s3FilePath.indexOf("/") + 1, s3FilePath.length())
val s3Client: AmazonS3 = new AmazonS3Client();
val s3object: S3Object = s3Client.getObject(new GetObjectRequest(
s3Bucket, s3key));
val inputStream: java.io.InputStream = s3object.getObjectContent()
val workbook = new XSSFWorkbook(inputStream)
var sno = workbook.getNumberOfSheets
var sheetName: Array[String] = new ArrayString
for ((sheet, i) <- workbook.zipWithIndex) {
val initialDF = spark.read.format("com.crealytics.spark.excel")
.option("useHeader", requiresHeader)
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", true)
.option("addColorColumns", false)
.option("sheetName", sheet.getSheetName)
.option("path", filePath)
.load()

  if (flag == false) {
    excelDF = initialDF
    flag = true
  } else {
    excelDF = excelDF.union(initialDF)
  }

}

return excelDF

}

Any help is much appreciated.

ERROR: java.lang.IllegalStateException: Cannot get a text value from a numeric cell

I used this lib, but had a error with apache.poi. Can not read excel with number

[info]   java.lang.IllegalStateException: Cannot get a text value from a numeric cell
[info]   at org.apache.poi.hssf.usermodel.HSSFCell.typeMismatch(HSSFCell.java:640)
[info]   at org.apache.poi.hssf.usermodel.HSSFCell.getRichStringCellValue(HSSFCell.java:717)
[info]   at org.apache.poi.hssf.usermodel.HSSFCell.getStringCellValue(HSSFCell.java:700)
[info]   at com.crealytics.spark.excel.ExcelRelation$$anonfun$7.apply(ExcelRelation.scala:117)
[info]   at com.crealytics.spark.excel.ExcelRelation$$anonfun$7.apply(ExcelRelation.scala:117)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
[info]   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
[info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
[info]   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

excel sample:

Michael	29
Andy	30
Justin	19

java.lang.NoSuchMethodError while trying to display the Dataframe using 'df.show'

I am new to spark. I have a excel file that I need to read into a Dataframe. I am using the com.crealytics.spark.excel library to achieve this. The following is my code:

     val df = hiveContext.read.format("com.crealytics.spark.excel")
             .option("useHeader", "true")
             .option("treatEmptyValuesAsNulls", "true")
             .load("file:///data/home/items.xlsx")

The above code runs without any error. And I am also able to count the number of rows in the df using df.count. But when I try to print the df using df.show, it throws an error saying:

java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;

I also tried using the StructType to define the schema and impose it during the loading the data into df:

 val newschema = StructType(List(StructField("1", StringType, nullable = true),
      StructField("2", StringType, nullable = true),
      StructField("3", StringType, nullable = true),
      StructField("4", StringType, nullable = true),
      StructField("5", StringType, nullable = true),
      StructField("6", StringType, nullable = true),
      StructField("7", StringType, nullable = true),
      StructField("8", StringType, nullable = true),
      StructField("9", StringType, nullable = true),
      StructField("10", StringType, nullable = true)))
val df = hiveContext.read.schema(newschema).format("com.crealytics.spark.excel")...

This doesn't help and I get the same error as before when I try to display the df.

I am using Spark 1.6, Java 1.8 and scala 2.10.5.
I am not sure why this is happening. How do I solve this error and look at the data in the df ?

Any help would be appreciated. Thank you.

pyspark.sql.types.DecimalType

Getting such error when trying to show dataframe. Its ok in case of FloatType.

: java.lang.NumberFormatException
	at java.math.BigDecimal.<init>(BigDecimal.java:494)
	at java.math.BigDecimal.<init>(BigDecimal.java:383)
	at java.math.BigDecimal.<init>(BigDecimal.java:806)
	at com.crealytics.spark.excel.ExcelRelation.bigDecimal$lzycompute$1(ExcelRelation.scala:114)
	at com.crealytics.spark.excel.ExcelRelation.bigDecimal$1(ExcelRelation.scala:114)
	at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$castTo(ExcelRelation.scala:123)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$3.apply(ExcelRelation.scala:76)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$3.apply(ExcelRelation.scala:76)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$apply$2.apply(ExcelRelation.scala:89)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$apply$2.apply(ExcelRelation.scala:84)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$5$$anonfun$apply$3.apply(ExcelRelation.scala:94)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$5$$anonfun$apply$3.apply(ExcelRelation.scala:94)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

Read multiple sheets through comma separated values via option sheetName

Can we read multiple sheets through comma separated values via option sheetName with this connector?
like option("sheetName", "SHEET_NAME1,SHEET_NAME2")

NullPointerException when the first row of a sheet has only empty fields.

Hi everybody. I found the following issue using the release 0.8.2 of the library for Scala 2.11 (and Spark 1.6.2): when the selected sheet to transform to DataFrame has the first row containing only empty fields it throws a NullPointerException:

17/03/21 23:03:49 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.NullPointerException
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:30)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at com.optum.sparkexcelpoc.DataFrameFromExcel$.main(DataFrameFromExcel.scala:30)
at com.optum.sparkexcelpoc.DataFrameFromExcel.main(DataFrameFromExcel.scala)

Browsing the source code I found the root cause of the problem and how to fix it. Please let me know if you're happy to have me fixing it and what procedure do you follow to accept changes. Thanks a lot.

Rows are returned in incorrect order on cluster

When reading an excel file on a cluster, the rows in the RDD are out of order

Allow reading only a subset of rows

This would be useful for the project I work on.
We first ingest only 1,000 rows (inferring the schema) and show the user a "preview" of the dataset.
The user can then override column types/names before ingesting the entire dataset.
If the dataset is large then it is much faster to be able to ingest only the first 1,000 records at the preview stage.

ExcelRelation read file in driver

Hi,
Basically buildScan method will be run on driver, not executor.
It will process file and then send it's content through parallelize(), which is also bad practice for large stuff.

Options:

use sc.hadoopFile(...) with custom ExcelInputFormat.
problems can be with providing parameters cause it use reflection to create instance of InputFormat
use sc.parallelize(..).flatMap( // read file, get List[Row])
It will read file on executor. Unfortunately, one executor.

skip rows from excel sheet

Hi ,

my excel file has rows on the top which needs to be skipped , can I do it using this library ?

Thanks
Sri

Excel writing error com.crealytics.spark.excel.DefaultSource does not allow create table as select

hi,
we are using com.crealytics.spark.excel in my java-spark project ,i working good but when i come to write the data an excel file i got error
com.crealytics.spark.excel.DefaultSource does not allow create table as select.

The will be code use to write in excel file.
dataFileContent.write()
.format("com.crealytics.spark.excel")
.option("sheetName", "Daily")
.option("useHeader", "true")
.mode("overwrite")
.save("D:\xls\Book1.xlsx");

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: com.crealytics.spark.excel.

Overnight I can't seem to find the package anymore? Is it being updated?

Writing a large Dataset into an Excel file causes `java.lang.OutOfMemoryError: GC overhead limit exceeded`

Hi All,

I am trying to write a relatively large Spark Dataset (> 50000 rows) into an Excel file using spark-excel as follows:

DataFrameWriter<Row> writer = dataset.write()
                .format("com.crealytics.spark.excel")
                .option("sheetName", sheets.get(0)) // Only first sheet
                .option("useHeader", useHeader)
                .mode("overwrite");

This works fine if the Dataset contains around 20000 rows, but from approximately 35000 rows, the code yields java.lang.OutOfMemoryError: GC overhead limit exceeded. It seems that the easiest temporary solution would be bumping up the allocated memory for JVM, but I would like to avoid that since bigger Dataset may come in and cause the same issue. Any thoughts on this? Thanks.

Not reading all columns

I am reading a sheet from an excel file which has 6 columns. I am getting 5 columns in the resulting data frame. I added start and end columns as 0 and 5. Still I am not getting the last column in the data frame. This code was working before.
val df = spark.read.format("com.crealytics.spark.excel")
.option("sheetName","MRI_NPM")
.option("useHeader", "false")
.option("startColumn",0)
.option("endColumn",5)
.load(configFilePath)

Custom schema example

Could you please give me an example of custom schema format to be used with this library?

Compatibility Issue

Do we have any Excel or spreadsheet proccessing dependency jar for Spark version 1.6.

I am facing error
"java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;
at com.crealytics.spark.excel.ExcelRelation$$anonfun$4.apply(ExcelRelation.scala:113)"

while processing it with spark 1.6.

unable to download the jar com.crealytics:spark-excel_2.10:jar:0.8.2 even after adding to the porm.xml

jar is not able to be downloaded automatically

Failed to find data source: com.crealytics.spark.excel

com.crealytics spark-excel_2.10 0.8.2

could find below note from the spark packages page

u'Maven coordinate should be groupId:artifactId but got com.crealytics:spark-google-adwords_2.11:0.8.2.'

0.8.3 Version - Parameter "location" is missing in options

Hi,
I am using Spark-Excel 0.8.3 version with Spark 1.6 and Scala 2.10 and getting this exception

Exception in thread "main" java.lang.IllegalArgumentException: Parameter "location" is missing in options.
at com.crealytics.spark.excel.DefaultSource.checkParameter(DefaultSource.scala:37)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:17)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at sparkExcelClass$delayedInit$body.apply(sparkExcelClass.scala:28)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at sparkExcelClass$.main(sparkExcelClass.scala:4)
at sparkExcelClass.main(sparkExcelClass.scala)

Thanks,
Srikanth

getting a List of all sheets in the excel file

It will be very useful for having a function that will give me a list of all excel sheets names from the file, so I will later decide which of them to load as dataframes

I am unable to fix the issue with the code created by Martin Mauch nightscape.. I am also trying to create a scala code for landing excel data from sharepoint to hadoop and then convert it into parquet format.I am getting error saying value InferSchema not found

import java.math.BigDecimal
4 import java.sql.{Date, Timestamp}
5 import java.text.NumberFormat
6 import java.util.Locale
7 import java.io._
8 import scala.collection.JavaConverters._
9 import org.apache.poi.ss.usermodel.{ WorkbookFactory, Row => SheetRow, Cell, DataFormatter }
10 import org.apache.spark.rdd.RDD
11 import org.apache.spark.sql._
12 import org.apache.spark.sql.sources._
13 import org.apache.spark.sql.types._
14
15 import scala.util.Try
16
17 case class ExcelRelation(
18 location: String,
19 sheetName: String,
20 useHeader: Boolean,
21 treatEmptyValuesAsNulls: Boolean,
22 inferSheetSchema: Boolean,
23 addColorColumns: Boolean = true,
24 userSchema: StructType = null
25 )
26 (@transient val sqlContext: SQLContext)
27 extends BaseRelation with TableScan with PrunedScan {
28 val workbook = WorkbookFactory.create(new FileInputStream(location))
29 val sheet = workbook.getSheet(sheetName)
30 val headers = sheet.getRow(0).cellIterator().asScala.to[Vector]
31 override val schema: StructType = inferSchema
32 val dataFormatter = new DataFormatter();
33 override def buildScan: RDD[Row] = buildScan(schema.map(_.name).toArray)
34
35 override def buildScan(requiredColumns: Array[String]): RDD[Row] = {
36 val lookups = requiredColumns.map {c =>
37 val columnNameRegex = s"(.*?)(color)?".r
38 val columnNameRegex(columnName, isColor) = c
39 val columnIndex = schema.indexWhere(.name == columnName)
40
41 val cellExtractor: Cell => Any = if (isColor == null) {
42 { cell: Cell =>
43 val value = cell.getCellType match {
44 case Cell.CELL_TYPE_NUMERIC => cell.getNumericCellValue
45 case Cell.CELL_TYPE_BOOLEAN => cell.getBooleanCellValue
46 case Cell.CELL_TYPE_STRING => cell.getStringCellValue
47 case t => throw new RuntimeException(s"Unknown cell type $t for $cell")
48 }
49 castTo(value.toString, schema(columnIndex).dataType)
50 }
51 } else {
52 _.getCellStyle.getFillForegroundColorColor match {
53 case null => ""
54 case c: org.apache.poi.xssf.usermodel.XSSFColor => c.getARGBHex
55 case c => throw new RuntimeException(s"Unknown color type $c: ${c.getClass}")
56 }
57 }
58 { row: SheetRow =>
59 val cell = row.getCell(columnIndex)
60 if (cell == null) {
61 null
62 } else {
63 cellExtractor(cell)
64 }
65 }
66 }.to[Vector]
67 val rows = sheet.rowIterator().asScala.drop(1).map(row => lookups.map(l => l(row)))
68 val result = rows.to[Vector]
69
70 sqlContext.sparkContext.parallelize(result.map(Row.fromSeq))
71 }
72
73 // Cast a String to a Spark Data Type
74 private def castTo(datum: String, castType: DataType): Any = {
75 castType match {
76 case _: ByteType => datum.toByte
77 case _: ShortType => datum.toShort
78 case _: IntegerType => datum.toInt
79 case _: LongType => datum.toLong
80 case _: FloatType => Try(datum.toFloat)
81 .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).floatValue())
82 case _: DoubleType => Try(datum.toDouble)
83 .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).doubleValue())
84 case _: BooleanType => datum.toBoolean
85 case _: DecimalType => new BigDecimal(datum.replaceAll(",", ""))
86 case _: TimestampType => Timestamp.valueOf(datum)
87 case : DateType => Date.valueOf(datum)
88 case : StringType => datum
89 case _ => throw new RuntimeException(s"Unsupported type: ${castType.typeName}")
90 }
91 }
92
93 private def rowsRdd: RDD[SheetRow] = {
94 parallelize(sheet.rowIterator().asScala.toSeq)
95 }
96 private def parallelize[T: scala.reflect.ClassTag](seq: Seq[T]): RDD[T] = sqlContext.sparkContext.parallelize(seq)
97 private def inferSchema(): StructType = {
98 if (this.userSchema != null) {
99 userSchema
100 } else {
101 val firstRow = headers
102 val header = if (useHeader) {
103 firstRow.map(.getStringCellValue)
104 } else {
105 firstRow.zipWithIndex.map { case (value, index) => s"C$index"}
106 }
107 val baseSchema = if (this.inferSheetSchema) {
108 val stringsAndCellTypes = sheet.rowIterator.asScala.drop(1).map(.cellIterator.asScala.map(c => c.getCellType).toVector).toVector
109 InferSchema(parallelize(stringsAndCellTypes), header.toArray)
110 } else {
111 // By default fields are assumed to be StringType
112 val schemaFields = header.map { fieldName =>
113 StructField(fieldName.toString, StringType, nullable = true)
114 }
115 StructType(schemaFields)
116 }
117 if (addColorColumns) {
118 header.foldLeft(baseSchema) {(schema, header) => schema.add(s"${header}_color", StringType, nullable = true)}
119 } else {
120 baseSchema
121 }
122 }
123 }
124 }

Error when reading excel-2010

If I just use apache poi to read excel-2010, that will be ok. But with spark-excel, it throws the following exception:

The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)
org.apache.poi.openxml4j.exceptions.OLE2NotOfficeXmlFileException: The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)
at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:179)
at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:237)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:134)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:295)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:201)
at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:111)
at com.monitorjbl.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:91)
at com.monitorjbl.xlsx.StreamingReader$Builder.open(StreamingReader.java:263)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$1.apply(ExcelRelation.scala:57)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$1.apply(ExcelRelation.scala:52)
at scala.Option.map(Option.scala:146)
at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:52)
at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala:63)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:270)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:269)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:97)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:

java.lang.RuntimeException: Unknown cell type 2

Hi we have one excel file where it has Hyperlink cells (type2 cells) it is giving us error.can you please help me to walk through this.

Allow reading multiple files specified as a list OR by a pattern

Currently Excel library don't allow to read multiples files specified as a pattern OR as a list ( formats has been shown below). These formats are supported for reading delimited files by DataFrameReader

reader.format( "com.crealytics.spark.excel" )
.option( "useHeader", true )
.option( "timestampFormat", "MM-dd-yyyy HH:mm:ss" )
.load( "/Data*.xlsx" );
OR

reader.format( "com.crealytics.spark.excel" )
.option( "useHeader", true )
.option( "timestampFormat", "MM-dd-yyyy HH:mm:ss" )
.load( "/Data1.xlsx", "/Data2.xlsx" );

It would be really nice to have above features.

Reject bad data rows

I have a Scala app utilizing spark-excel 0.9.17 and am interested in rejecting rows that contain bad data as defined by the schema. E.g. if a row has a column with a string in an Integer column, reject the row. It seems, at least in the case of Integer types, non-numeric values are converted to zero, and the row is loaded into the data frame. Is there anything similar to mode=DROPMALFORMED used for importing csv files that can be used in this case? Such that in the example below, the second row would be rejected instead of converting the "x" to zero? Thanks.

INPUT:

Name | Age | Comment
Joe | 25 | A
John | x | B
Bill | 35 | C

OUTPUT:

dfValid.printSchema()
root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Comment: string (nullable = true)

dfValid.show()
+----+---+-------+
|Name|Age|Comment|
+----+---+-------+
| Joe| 25|      A|
|John|  0|      B|
|Bill| 35|      C|
+----+---+-------+```

FileNotFoundException when try to launch Job with master yarn and deploy-mode cluster

I am using Hadoop cluster with Yarn resource manager. When I run the job in fully distributed mode master yarn & deploy-mode cluster the job fails with below exception. However, when I change "deploy-mode" to client it runs successfully.

ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) Caused by: java.io.FileNotFoundException: /mapr/Helios/cxpdi/datasets/test/data/TEST/INPUT/data.xls (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at java.io.FileInputStream.(FileInputStream.java:93) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:28) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)

I would suggest, instead of java.io.FileInputStream use org.apache.hadoop.fs.FileSystem api to open stream when master yarn and deploy is cluster.

getErrorCellValue ignore NotSupportedException option

It would be awesome if there was an option for errors in excelsheets not to throw exceptions (NotSupportedException), but could be handled as null values with some log message that certain cells have been skipped.

option for inserting userSchema is not possible

HI I am tryin to read a date field from excel, but your API doesn't provide support for same. So i try to provide user provided schema as shown in the example in readme, but still not able to pass as schema is an object not a string.
val df = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", "Worktime.xlsx")
.option("sheetName", "Daily")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
# .option("userSchema", StructType(???)) // Optional// Not possible to pass schema as string
.option("inferSchema", "true")
.option("addColorColumns", "true")
.option("startColumn", 0) // Optional
.option("endColumn", 99) // Optional
.load()

Is it possible to write multiple dataframes to different sheets in the same Excel worksheet

I have a case where I have multiple DataFrames and each DataFrame maps to a sheet in the Excel document. Is it possible to do this? It seems like this needs to support the append write mode for this to work.

Issues with date column in excel with format dd/mm/yyyy

I'm trying to create a dataframe from a large excel and I have a column that it has a format like this.
mm/dd/yyyy
I've tryed to create a custom schema for this column using StringType, and DateType, but none of these have worked.
If I don't set any schema I receive an error like this; "Exception in thread "main" java.lang.IllegalStateException: Cannot get a STRING value from a NUMERIC cell"

I'm loading the dataframe like this:
val df = dataFrameReader.format("com.crealytics.spark.excel")
.option("sheetName", "Hoja1")
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "true") // Optional, default: true
.option("inferSchema", "true") // Optional, default: false
//.option("addColorColumns", "true") // Optional, default: false
.option("startColumn", 1) // Optional, default: 0
//.option("endColumn", 99) // Optional, default: Int.MaxValue
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 1000) // Optional, default None. If set, uses a streaming reader which can help with big files
//.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
//.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.schema(schema)
.load( s"hdfs://${fileName}")

My custom schema:
val schema = StructType(
List(
StructField("NroTramite", LongType, true),
StructField("FchAlta", DateType, true),
StructField("HraAlta", StringType, true),
StructField("UsrAlta", StringType, true),
StructField("Estado", StringType, true),
StructField("Proceso", StringType, true),
StructField("Cliente", LongType, true),
StructField("Razon Social", StringType, true),
StructField("RelCnt", LongType, true),
StructField("Mid", LongType, true),
StructField("Contrato", LongType, true),
StructField("PlanComercial", StringType, true),
StructField("Categ", StringType, true),
StructField("SubCateg", StringType, true),
StructField("Tecnologia", StringType, true),
StructField("AgPromotorPlnPrincipal", StringType, true),
StructField("AgPromotorProducto", StringType, true),
StructField("ProdExt", IntegerType, true),
StructField("NombreProdExterno", StringType, true),
StructField("AgVenta", IntegerType, true),
StructField("NombreAgVenta", StringType, true),
StructField("OrigSisExterno", IntegerType, true),
StructField("FlagAcraTercero", StringType, true),
StructField("RelCntquePaga", IntegerType, true)
)
)

The error that I get when I use this schema:
Exception in thread "main" java.lang.NumberFormatException: For input string: "Fch Alta"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:284)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:29)
at com.crealytics.spark.excel.ExcelRelation.numericValue$lzycompute$1(ExcelRelation.scala:111)
at com.crealytics.spark.excel.ExcelRelation.numericValue$1(ExcelRelation.scala:108)

I look forward for your answers, thanks in advance.

ExcelFileSaver.toCell() doesn't handle null values in DataFrame

When a DataFrame row has a null value in one of the columns, a scala.MatchError exception gets thrown by toCell. I am writing with the following write calls:

    dataFrame.write
      .format("com.crealytics.spark.excel")
      .option("sheetName", config.sheetName)
      .option("treatEmptyValuesAsNulls", true)
      .option("useHeader", true)
      .mode("overwrite")
      .save(config.path)

This is reproducible by loading in an Excel or CSV file with blanks in any of the columns and attempting to save it. I added the treatEmptyValuesAsNulls just to make sure I wasn't missing an option but looking at the code it seems that a case needs to be added to handle if a == null. It is also possible I'm doing something wrong here, and if so, I apologize!

Error when excel file have cell with formular

Hi,
My stack:
-scala 2.10
-spark 1.6
-spark-excel_2.10 version : 0.8.3

My sheet have cell with formular , i have this error:
java.lang.RuntimeException: Unknown cell type 2 for NOW()+A2
at com.crealytics.spark.excel.ExcelRelation$$anonfun$1$$anonfun$3.apply(ExcelRelation.scala:66)
To correct this issues , i get the latest version of source and change scala version to match with my stack and update ExcelRelation.scala.

Please can you check my update and push a new spark-excel_2.10 version : 0.8.3 on maven central?

See attachment.
ExcelRelation.zip

Regard

saving at HDFS errors

Whenever I try something like:

notInGenAge
  .write
  .format("com.crealytics.spark.excel")
  .option("sheetName", "NotInGenAge")
  .option("useHeader", "true")
  .mode("overwrite")
  .save(ml + "/genage.xlsx")

where ml is HDFS folder. I get

java.nio.channels.ClosedChannelException
  at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:291)
  at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:147)
  at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:537)
  at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:492)
  at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
  at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:41)
  at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:69)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  ... 109 elided

Getting Java heap space error while executing it via spark-submit/spark-shell/zeppelin notebook

Hi,
I am using this connector to load my xlsx files and I am getting Java Heap space error everytime i execute my spark job.
I am running my job on 5 node emr spark cluster.
The data size is something around 130 MB.
The same file is getting processed when I convert to tab separated file.
Kindly help me out.

Thanks and regards,
Dipesh Vora

Saving arrays, structs, maps & objects

It seems that saving a DataFrame with any arrays, structs, maps or objects does not work. I'm thinking they'll need to be added to ExcelFileSaver in toCell.

I can take on the work within a few days most likely. What I envision if you had an array of Strings like:

["may", "the", "force", "be", "with", "you"]

It would be presented in the resulting Excel sheet as:

quoteWords
may, the, force, be, with, you

For Maps and Structs, may be just putting a JSON representation of their contents?

If they're objects, I guess just using toString on them would be enough? Any input would be great, and then I can create a PR with the changes.

FileNotFoundException when trying to use com.crealytics.spark.excel

My file is in s3 and I am trying to run this code from databricks. My s3 location is accessible by other libraries. Is it a bug?

val df = sqlContext.read
.format("com.crealytics.spark.excel")
//.option("sheetName", "2018.06.30 GLSFT1 Data Tape") // Required
.option("location","s3location")
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("inferSchema", "true") // Optional, default: false
.option("addColorColumns", "true") // Optional, default: false
.option("startColumn", 0) // Optional, default: 0
.option("endColumn", 99) // Optional, default: Int.MaxValue
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
.option("skipFirstRows", 1) // Optional, default None. If set skips the first n rows and checks for headers in row n+1
//.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.load("GLS_Flow_201806.xlsx")

input_file_name returns empty string

Hello,

I'm trying to create a dataframe from an Excel doc and then append a column with the input file's name as returned by input_file_name. Instead of returning the file path, input_file_name is returning an empty string.

Test code

from pyspark.sql.functions import input_file_name

df = (sql_context.read.format("com.crealytics.spark.excel")
      .option('sheetName', "Sheet1")
      .option('useHeader', True)
      .load("/home/btelle/test.xlsx")
      )

df = df.withColumn('file_name', input_file_name())
df.select('file_name').show(1)

Expected result

+--------------------+
|           file_name|
+--------------------+
|file:///home/btel...|
+--------------------+

Actual result

+---------+
|file_name|
+---------+
|         |
+---------+

saving multiple datasets as different sheets in excel file

How to infer schema for Int column types

I am trying to dig into your codes but I am relatively new to scala, so I don't get why this can't infer schema for Integer column. Is there any specific reason you excluded this type?

Not able to read excel from hadoop location

Hi,

When I am trying to read this Excel from Hadoop using this API, it's giving file not found. As an API to be compatible with SPARK this should work flawlessly with Hadoop files. Please look into the issue.
Code Snippet
DataFrame df = sqlContext.read().format("com.crealytics.spark.excel").option("location", "hdfs:////localhost:54310/home/impadmin/abc.xlsx").
option("sheetName", "Sheet1")
.option("useHeader", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true")
.option("addColorColumns", "true").load();

Exception:
java.io.FileNotFoundException: hdfs:/localhost:54310/home/impadmin/StreamAnalytix/uploadjar/AMEX.csv (No such file or dir
ectory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:28)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)

formulas in sheets do not appear to be supported

The values read from cells in a sheet that are of type FORMULA, are evaluated as the formula itself rather than it's cached value.

Additionally the infer schema option fails on cells of type FORMULA.

OK to submit a PR for this?

crealytics / spark-excel Goto Github PK

spark-excel's Issues

Test code

Expected result

Actual result

Recommend Projects

Recommend Topics

Recommend Org