Comments (6)
Can you find out what the heap size for the master node is and if you can increase it?
from spark-excel.
I dont know how to get that.
Can u help me .
Thanks
from spark-excel.
This is the stacktrace which i am geting.
Job 20180716-132201_1796942326 is finished, status: ERROR, exception: null, result: %text java.lang.OutOfMemoryError: Java heap space
at java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:77)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.(ZipInputStreamZipEntrySource.java:125)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:56)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:100)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:324)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:184)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:149)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:60)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:60)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:60)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:64)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:63)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:257)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:256)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:256)
at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:84)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 7 elided
from spark-excel.
Have you tried using the maxRowsInMemory
option?
https://github.com/crealytics/spark-excel#create-a-dataframe-from-an-excel-file
from spark-excel.
I think that configuration worked for me.
I have executed that on small subset of data.
What should be the ideal number for maxRowsInMemory??
Thanks
from spark-excel.
It's probably not that important which exact number you use. The main difference is that using this setting switches to a streaming parser that does not keep all data in memory.
I would go for something like 100.
from spark-excel.
Related Issues (20)
- [BUG] Old excel files are not supported. HOT 1
- [BUG] <title>Spark Excel reads all Excel files under the file HOT 3
- [BUG] spark-excel library not working as a workspace library HOT 2
- New Case on Large Number Being Captured As Scientific Notation
- [BUG] last Columns with first line value empty not being read from .xlsx HOT 3
- support spark 3.5 HOT 3
- Incorrect Data Frame creation HOT 1
- [BUG] ClassNotFoundException for 'excel.DefaultSource' while using API V2 HOT 13
- Mentioned jar for scala 2.12 does not exist HOT 2
- [BUG] <infer schema should not include the auto generated columns>
- [BUG] Spark Excel is Incompatible with AWS EMR v6.13 and higher HOT 2
- [BUG] ClassCastException: scala.Some cannot be cast to [Lorg.apache.spark.sql.catalyst.InternalRow HOT 6
- [BUG] Incorrect date formatting if I indicate sheet Spark Read Excel HOT 1
- [BUG] Excel File with Macros Detected as "Potentially" Malicious. Unable to read Excel as a result. HOT 1
- [BUG] When Read Excel Files, Several Errors Using Java HOT 2
- Error Handling for Corrupt Files in Chunk Processing HOT 1
- [BUG] No thrown exception if schema is provieded, but there is no workbook/sheet (PDF with XLSX Extension)
- [FEATURE] Optimize JAR size HOT 2
- [BUG] Cannot read files into dataframe in Databricks 13.3 LTS Runtime 3.3.0 Spark HOT 3
- Extract sheet names using pyspark HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-excel.