Light

mathsigit / spark-overflow Goto Github PK

View Code? Open in Web Editor NEW

This project forked from allenfang/spark-overflow

0.0 2.0 0.0 3 KB

A stack overflow for Apache Spark

spark-overflow's Introduction

spark-overflow

Collect a lots of Spark information, solution, debugging etc. Feel Free to open a PR to contribute what you see or your experience on Apache Spark.

Knowledge

Spark executor memory(ref)

spark-submit --verbose(ref)

Always add --verbose options on spark-submit to print following information
- All default properties
- Command line options
- Settings from spark conf file

Spark Executor on YARN(ref)

Following is the memory relation config on YARN
YARN container size - yarn.nodemanager.resource.memory-mb
Memory Overhead - spark.yarn.executor.memoryOverhead

Tunning

Tune the shuffle partitions

Tune the number of spark.sql.shuffle.partitions

Avoid using jets3t 1.9(ref)

it's a jar default on hadoop 2.0
Inexplicably terrible performance

Use reduceBykey() instead of groupByKey()

reduceByKey

groupByKey

GC policy(ref)

G1GC is a new feature you can Use
Used by -XX:+UseG1GC

Join a large Table with a small table(ref)

Default is ShuffledHashJoin, problem is all the data of big one will be shuffled
Use BroadcasthashJoin
- it will broadcast the small one to all workers
- Set spark.sql.autoBroadcastJoinThreshold

Use `forEachPartition`

If your task involve a large setup time, use forEachPartition instead
For example: DB connection, Remote Call etc.

Data Serialization

Default Java Serialization is too slow
Use Kyro
- conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

Solution

java.io.IOException: No space left on device

It's about /tmp is full, check spark.local.dir in spark-conf.default
How to fix? Mount more disk space
- spark.local.dir /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp

java.lang.OutOfMemoryError: GC overhead limit exceeded(ref)

Too much GC time, you can check on Spark metrics
How to fix?
- Increase executor heap size by --executor-memory
- Increase spark.storage.memoryFraction
- Change GC policy(ex: use G1GC)

shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space(ref)

OOM on Spark driver
Usually happened when you fetch a huge data to driver(client)
Spark SQL and Streaming is a typical workload which need large heap on driver
How to fix?
- Increase --driver-memory

java.lang.NoClassDefFoundError(ref)

Compiled ok, but got error on run-time
How to fix?
- Use --jars to upload and place on the classpath of your application
- Use --packages to include comma-sparated list of Maven coordinates of JARs.
  EX: --packages com.google.code.gson:gson:2.6.2
  This example will add jar of gson to both executor and driver classpath

Serialization stack error

Error message likes:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: com.spark.demo.MyClass Serialization stack:
-object not serializable (class: com.spark.demo.MyClass, value: com.spark.demo.MyClass@6951e281)
-element of array (index: 0)
-array (class [Ljava.lang.Object;, size 6)
How to fix?
- Make com.spark.demo.MyClass to implement java.io.Serializable

java.io.FileNotFoundException: spark-assembly.jar does not exist

How to fix?

Upload Spark-assembly.jar to hadoop
Using --conf spark.yarn.jar when spark-submit or conf.set("spark.yarn.jar","hdfs://hostname:port/spark-assembly-upload-at-1st.jar") in application

java.io.IOException: Resource spark-assembly.jar changed on src filesystem (ref)

Spark-assembly.jar does exist.
How to fix?

Upload Spark-assembly.jar to hadoop
Using --conf spark.yarn.jar when spark-submit or conf.set("spark.yarn.jar","hdfs://hostname:port/spark-assembly-upload-at-1st.jar") in application

spark-overflow's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.