Comments (12)
Hello @aprasa02 !
What are the value set for --driver-memory
, --executor-memory
and --executor-cores
when submitting your Spark Job ?
How many partitions does your file contains ?
Usually 100mb are not a problem for Spark.
Best,
Gregory.
from spark-excel.
cluster mode execution with driver memory 5g , executor memory 50g executor cores 64 . The partition we are setting as 100.I feel the issue comes when we are doing the union of multiple dataframes resulting in memory issue. we were testing with 28 tabs , means union of 28 dfs and return it. Since there is no option to create df for all the sheets in one go in crealytics API , forcibly I had to use POI to get sheet names and then passing to crealytics api and then union it all.
from spark-excel.
Can you try and serialize the Excel Dataframes to e.g. Parquet and see if you get the same issues when using those?
Putting multiple sheets in one Dataframe would in theory be possible, but all your sheets would need to have the same schema.
from spark-excel.
I am getting an out of memory error with the online retail dataset
from spark-excel.
It seems that Apache POI is not the most memory-efficient library out there, mostly due to the fact that it provides a mutable API where you can read stuff, modify some arbitrary cells and then write back to a file.
We could switch to something like https://github.com/monitorjbl/excel-streaming-reader which does only the reading part and therefore can stream the data.
I guess that would allow you to read almost arbitrarily-sized data.
from spark-excel.
@aprasa02 Not related, but FYI, in most Hadoop\Spark deployments, you should be able to use the s3a file system directly and shorten your code, instead of setting the path option, pass the s3a url into load.
.load("s3a://bucket_name/folders/filename)
from spark-excel.
Hey @litzebauer,
I saw you created a fork and fixed this issue.
Is it working out?
Do you want to create a PR?
from spark-excel.
from spark-excel.
I created a PR for streaming here: #36
It was much harder than expected to make everything really stream and not consume the iterator accidentally and there is still a failing test which I can hopefully fix soon.
from spark-excel.
I just released 0.9.6
to the Sonatype Maven repo.
It features a maxRowsInMemory
option which when used switches to a streaming implementation using https://github.com/monitorjbl/excel-streaming-reader
This library doesn't support all functionality of Apache POI that's why I didn't make it the default (yet).
When running in dev environment, I actually had to override the version of jackson
because of conflicts with the Spark-packaged version. If you run into any problems which look like a jackson
version mismatch let me know, then we need to shade the dependency.
from spark-excel.
I am getting some build errors with 0.9.6...
[error] 71 errors were encountered during merge
java.lang.RuntimeException: deduplicate: different file contents found in the following:
C:\Users\dwb.ivy2\cache\com.crealytics\spark-excel_2.11\jars\spark-excel_2.11-0.9.6.jar:com/fasterxml/jackson/annotation/JsonAutoDetect$1.class
C:\Users\dwb.ivy2\cache\com.fasterxml.jackson.core\jackson-annotations\bundles\jackson-annotations-2.8.0.jar:com/fasterxml/jackson/annotation/JsonAutoDetect$1.class
from spark-excel.
so I fixed the above problem by modifying my assembly merge strategy
but then I ran in to the same problem as you when running unit tests.. I fixed that by adding this to build.sbt..
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.fasterxml.jackson.**" -> "shadeio.@1").inAll
)
from spark-excel.
Related Issues (20)
- [BUG] Old excel files are not supported. HOT 1
- [BUG] <title>Spark Excel reads all Excel files under the file HOT 3
- [BUG] spark-excel library not working as a workspace library HOT 2
- New Case on Large Number Being Captured As Scientific Notation
- [BUG] last Columns with first line value empty not being read from .xlsx HOT 3
- support spark 3.5 HOT 3
- Incorrect Data Frame creation HOT 1
- [BUG] ClassNotFoundException for 'excel.DefaultSource' while using API V2 HOT 13
- Mentioned jar for scala 2.12 does not exist HOT 2
- [BUG] <infer schema should not include the auto generated columns>
- [BUG] Spark Excel is Incompatible with AWS EMR v6.13 and higher HOT 2
- [BUG] ClassCastException: scala.Some cannot be cast to [Lorg.apache.spark.sql.catalyst.InternalRow HOT 6
- [BUG] Incorrect date formatting if I indicate sheet Spark Read Excel HOT 1
- [BUG] Excel File with Macros Detected as "Potentially" Malicious. Unable to read Excel as a result. HOT 1
- [BUG] When Read Excel Files, Several Errors Using Java HOT 2
- Error Handling for Corrupt Files in Chunk Processing HOT 1
- [BUG] No thrown exception if schema is provieded, but there is no workbook/sheet (PDF with XLSX Extension)
- [FEATURE] Optimize JAR size HOT 2
- [BUG] Cannot read files into dataframe in Databricks 13.3 LTS Runtime 3.3.0 Spark HOT 3
- Extract sheet names using pyspark HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-excel.