Giter Club home page Giter Club logo

Comments (12)

gregosaurus avatar gregosaurus commented on July 19, 2024 1

Hello @aprasa02 !

What are the value set for --driver-memory, --executor-memory and --executor-cores when submitting your Spark Job ?

How many partitions does your file contains ?

Usually 100mb are not a problem for Spark.

Best,

Gregory.

from spark-excel.

aprasa02 avatar aprasa02 commented on July 19, 2024

cluster mode execution with driver memory 5g , executor memory 50g executor cores 64 . The partition we are setting as 100.I feel the issue comes when we are doing the union of multiple dataframes resulting in memory issue. we were testing with 28 tabs , means union of 28 dfs and return it. Since there is no option to create df for all the sheets in one go in crealytics API , forcibly I had to use POI to get sheet names and then passing to crealytics api and then union it all.

from spark-excel.

nightscape avatar nightscape commented on July 19, 2024

Can you try and serialize the Excel Dataframes to e.g. Parquet and see if you get the same issues when using those?
Putting multiple sheets in one Dataframe would in theory be possible, but all your sheets would need to have the same schema.

from spark-excel.

davidw76 avatar davidw76 commented on July 19, 2024

I am getting an out of memory error with the online retail dataset

from spark-excel.

nightscape avatar nightscape commented on July 19, 2024

It seems that Apache POI is not the most memory-efficient library out there, mostly due to the fact that it provides a mutable API where you can read stuff, modify some arbitrary cells and then write back to a file.
We could switch to something like https://github.com/monitorjbl/excel-streaming-reader which does only the reading part and therefore can stream the data.
I guess that would allow you to read almost arbitrarily-sized data.

from spark-excel.

habelson avatar habelson commented on July 19, 2024

@aprasa02 Not related, but FYI, in most Hadoop\Spark deployments, you should be able to use the s3a file system directly and shorten your code, instead of setting the path option, pass the s3a url into load.

.load("s3a://bucket_name/folders/filename)

from spark-excel.

nightscape avatar nightscape commented on July 19, 2024

Hey @litzebauer,
I saw you created a fork and fixed this issue.
Is it working out?
Do you want to create a PR?

from spark-excel.

litzebauer avatar litzebauer commented on July 19, 2024

from spark-excel.

nightscape avatar nightscape commented on July 19, 2024

I created a PR for streaming here: #36
It was much harder than expected to make everything really stream and not consume the iterator accidentally and there is still a failing test which I can hopefully fix soon.

from spark-excel.

nightscape avatar nightscape commented on July 19, 2024

I just released 0.9.6 to the Sonatype Maven repo.
It features a maxRowsInMemory option which when used switches to a streaming implementation using https://github.com/monitorjbl/excel-streaming-reader
This library doesn't support all functionality of Apache POI that's why I didn't make it the default (yet).
When running in dev environment, I actually had to override the version of jackson because of conflicts with the Spark-packaged version. If you run into any problems which look like a jackson version mismatch let me know, then we need to shade the dependency.

from spark-excel.

davidw76 avatar davidw76 commented on July 19, 2024

I am getting some build errors with 0.9.6...

[error] 71 errors were encountered during merge
java.lang.RuntimeException: deduplicate: different file contents found in the following:
C:\Users\dwb.ivy2\cache\com.crealytics\spark-excel_2.11\jars\spark-excel_2.11-0.9.6.jar:com/fasterxml/jackson/annotation/JsonAutoDetect$1.class
C:\Users\dwb.ivy2\cache\com.fasterxml.jackson.core\jackson-annotations\bundles\jackson-annotations-2.8.0.jar:com/fasterxml/jackson/annotation/JsonAutoDetect$1.class

from spark-excel.

davidw76 avatar davidw76 commented on July 19, 2024

so I fixed the above problem by modifying my assembly merge strategy

but then I ran in to the same problem as you when running unit tests.. I fixed that by adding this to build.sbt..

assemblyShadeRules in assembly := Seq(
      ShadeRule.rename("com.fasterxml.jackson.**" -> "shadeio.@1").inAll
    )

from spark-excel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.