I am reading big xlsx file of 100mb with 28 sheets(10000 rows per sheet) and creating

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

I am getting an out of memory error with the <a href="http://archive.ics.uci.edu/ml/ma

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

I created a PR for streaming here: <a class="issue-link js-issue-link" data-error-text

Out of memory exception when reading big xlsx file about spark-excel HOT 12 CLOSED

crealytics commented on July 19, 2024

Out of memory exception when reading big xlsx file

from spark-excel.

Comments (12)

gregosaurus commented on July 19, 2024 1

Hello @aprasa02 !

What are the value set for --driver-memory, --executor-memory and --executor-cores when submitting your Spark Job ?

How many partitions does your file contains ?

Usually 100mb are not a problem for Spark.

Best,

Gregory.

from spark-excel.

aprasa02 commented on July 19, 2024

cluster mode execution with driver memory 5g , executor memory 50g executor cores 64 . The partition we are setting as 100.I feel the issue comes when we are doing the union of multiple dataframes resulting in memory issue. we were testing with 28 tabs , means union of 28 dfs and return it. Since there is no option to create df for all the sheets in one go in crealytics API , forcibly I had to use POI to get sheet names and then passing to crealytics api and then union it all.

from spark-excel.

nightscape commented on July 19, 2024

Can you try and serialize the Excel Dataframes to e.g. Parquet and see if you get the same issues when using those?
Putting multiple sheets in one Dataframe would in theory be possible, but all your sheets would need to have the same schema.

from spark-excel.

davidw76 commented on July 19, 2024

I am getting an out of memory error with the online retail dataset

from spark-excel.

nightscape commented on July 19, 2024

It seems that Apache POI is not the most memory-efficient library out there, mostly due to the fact that it provides a mutable API where you can read stuff, modify some arbitrary cells and then write back to a file.
We could switch to something like https://github.com/monitorjbl/excel-streaming-reader which does only the reading part and therefore can stream the data.
I guess that would allow you to read almost arbitrarily-sized data.

from spark-excel.

habelson commented on July 19, 2024

@aprasa02 Not related, but FYI, in most Hadoop\Spark deployments, you should be able to use the s3a file system directly and shorten your code, instead of setting the path option, pass the s3a url into load.

.load("s3a://bucket_name/folders/filename)

from spark-excel.

nightscape commented on July 19, 2024

Hey @litzebauer,
I saw you created a fork and fixed this issue.
Is it working out?
Do you want to create a PR?

from spark-excel.

litzebauer commented on July 19, 2024

Almost, there is a transitive dependency conflict that I haven't gotten a chance to resolve yet.

…

On Tue, Nov 7, 2017 at 6:01 AM Martin Mauch ***@***.***> wrote: Hey @litzebauer <https://github.com/litzebauer>, I saw you created a fork and fixed this issue. Is it working out? Do you want to create a PR? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLc68P9qLrn_PUPKCY-5_N8w-yV6Bw0ks5s0DhygaJpZM4PMB8G> .

from spark-excel.

nightscape commented on July 19, 2024

I created a PR for streaming here: #36
It was much harder than expected to make everything really stream and not consume the iterator accidentally and there is still a failing test which I can hopefully fix soon.

from spark-excel.

nightscape commented on July 19, 2024

I just released 0.9.6 to the Sonatype Maven repo.
It features a maxRowsInMemory option which when used switches to a streaming implementation using https://github.com/monitorjbl/excel-streaming-reader
This library doesn't support all functionality of Apache POI that's why I didn't make it the default (yet).
When running in dev environment, I actually had to override the version of jackson because of conflicts with the Spark-packaged version. If you run into any problems which look like a jackson version mismatch let me know, then we need to shade the dependency.

from spark-excel.

davidw76 commented on July 19, 2024

I am getting some build errors with 0.9.6...

[error] 71 errors were encountered during merge
java.lang.RuntimeException: deduplicate: different file contents found in the following:
C:\Users\dwb.ivy2\cache\com.crealytics\spark-excel_2.11\jars\spark-excel_2.11-0.9.6.jar:com/fasterxml/jackson/annotation/JsonAutoDetect$1.class
C:\Users\dwb.ivy2\cache\com.fasterxml.jackson.core\jackson-annotations\bundles\jackson-annotations-2.8.0.jar:com/fasterxml/jackson/annotation/JsonAutoDetect$1.class

from spark-excel.

davidw76 commented on July 19, 2024

so I fixed the above problem by modifying my assembly merge strategy

but then I ran in to the same problem as you when running unit tests.. I fixed that by adding this to build.sbt..

assemblyShadeRules in assembly := Seq(
      ShadeRule.rename("com.fasterxml.jackson.**" -> "shadeio.@1").inAll
    )

from spark-excel.

Out of memory exception when reading big xlsx file about spark-excel HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent