Giter Club home page Giter Club logo

nvidia_rapids_setup_guide_in_ri2's Introduction

NVIDIA RAPIDS setup guide in Spark

This is a general guide line about how to setup RAPIDS in RI2 cluster.

Environment

OS:CentOS 7.8
Java:1.8

Tool installation

Due to the oracle policy, one of the easy way to install java is downloading java on the local and upload it to Ri2 cluster via sftp or other methods. Notice that scala have a good support on java 8 and java 11, so it is recommend to download java 8 instead of the newest version of Java. Once the java binary uploaded and uncompressed to the cluster, it is recommended to create a folder in order to manage different java versions:

mkdir java
mkdir maven
mkdir scala
mkdir sbt
mkdir javalib

#you need to upload downloaded jdk tar to the cluster and unzip it

#get maven
wget http://apache.spinellicreations.com/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz ./maven
#get scala
wget -q http://downloads.lightbend.com/scala/2.12.1/scala-2.12.1.tgz ./scala
#get sbt
wget -q https://github.com/sbt/sbt/releases/download/v1.4.0/sbt-1.4.0.tgz ./sbt
#get spark 
wget https://mirror.olnevhost.net/pub/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz ./javalib

#unzip compressed files
tar -xvzf ./maven/apache-maven-3.6.3-bin.tar.gz
tar -xvzf ./scala/scala-2.12.1.tgz
tar -xvzf ./sbt/sbt-1.4.0.tgz
tar -xvzf ./javalib/spark-3.0.1-bin-hadoop2.7.tgz

Set up environment variable

in .bashrc or .bash_profile append these lines:

#set environment variables
export SPARK_HOME=$HOME/javalib/spark-3.0.1-bin-hadoop2.7
export JAVA_HOME=$HOME/java/jdk1.8.0_261
export SCALA_HOME=$HOME/scala/scala-2.12.1
export SBT_HOME=$HOME/sbt/sbt
export MAVEN_HOME=$HOME/maven/apache-maven-3.6.3
export PATH=$PATH:$SPARK_HOME/bin:$JAVA_HOME/bin:$MAVEN_HOME/bin:$SCALA_HOME/bin:$SBT_HOME/bin

Build Spark-Rapids code and jars

pull code from github repo:

git clone https://github.com/NVIDIA/spark-rapids.git

go to the directory of the repo and build the repo

mvn clean install -DSkipTest #some test will fail but it doesn't bother

the built jar for RAPIDS and cuDF can be also get by wget:

wget https://repo1.maven.org/maven2/ai/rapids/cudf/0.15/cudf-0.15.jar
wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.2.0/rapids-4-spark_2.12-0.2.0.jar

Launch spark shell with jars

spark-shell --jars '/%path to jar%/rapids-4-spark_2.12-0.2.0.jar,/%path to jar%/cudf-0.15.jar' --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.rapids.sql.incompatibleOps.enabled=true

for more detailed configuration, please refer to the official guide of NVIDIA RAPIDS guide

for TPCxBB benchmarking, please refer to this guide

Launching Spark in master-slave mode

Allocate nodes for master and slaves, in this step, the Spark frame work is not yet started:

#run this command as many times as you want for the total number of nodes
 srun --partition=batch --mem=16000 -n1 -N1 --pty --preserve-env $SHELL

mapping the localhost:8080 to spark-webui port:

 ssh -N -L localhost:8080:10.3.1.XX:8080 [email protected]

configure $SPARK_HOME/conf/slaves, for the initial installation, just create one file called slaves we will using $SPARK_HOME/sbin/start-slaves.sh to start all slave node at once:

# put the ip address of slaves
10.3.1.XX
10.3.1.XX
10.3.1.XX
10.3.1.XX

create spark-env.sh

#copy from the spark-env.sh.template
cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh

add slave launch configuration, you may just find the corresponding lines and change the line, the amount should not exceed the total number of gpu that the cluster has no the worker node which allocated:

SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=$SPARK_HOME/conf/getGpusResources.sh"

copy GPU discovery script into $SPARK_HOME/conf/:

cp $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh $SPARK_HOME/conf/getGpusResources.sh

Start master node, master node don't have anything to do with GPU, but it is recommended to allocate the master node exclude header node:

$SPARK_HOME/sbin/start-master.sh

Start slave nodes, if GPU access is desired, each slave node should have at least one GPU:

$SPARK_HOME/sbin/start-slaves.sh

One important thing to check before submitting jobs is the log of each node, make sure everything is successfully connected. If any node fails, just use the stop-all.sh to stop them and restart again, for more detailed guide, please refer to this guide.

Once you have the successful connection, then you can go to the localhost:8080 to see the status of the cluster, you should get a web page that is similar like this.

image-20201009115106926

Two things important are here: 1) make sure all worker are show up on the webpage, 2) on the resource slot, the GPU should show up.

Till now, the setup part is done. in order to submit the job, refer this guide.

nvidia_rapids_setup_guide_in_ri2's People

Contributors

kaixinluo avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.