Brought to you by Lesley Cordero.
This guide was written in Python 3.5.
Follow these instructions for PySpark setup!
Spark is a low-level system for distributed computation on clusters, capable of doing in-memory caching between stages, improving the performance. This is in contrast to Hadoop, which instead writes everything to disk. Spark is also a much more flexible system in that it's not only constrained to MapReduce.
Now, it might seem as though Spark is a replacement for Hadoop; and though it sometimes is used as a replacement, it can also be used to complement Hadoop's functionality. By running Spark on top of a Hadoop cluster, you can still leverage HDFS and YARN and then have Spark replace MapReduce.
Spark is composed of a built-in Read-Evaluate-Print-Loop in the form of a shell that can be used for interactive analysis. You can begin by entering the following into your terminal:
$SPARK_HOME/bin/pyspark
Next, you can run the following in Python:
import pyspark
sc = pyspark.SparkContext("local[*]", "demo")
Note that Spark only allows one Spark Context to be active at a time, so you'll need to stop the current spark context before starting a new one.
If this doesn't work, make sure the PYTHONPATH contains the module:
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/:$PYTHONPATH
Now in R, the process is very similar:
library(SparkR,
lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]",
sparkConfig = list(spark.driver.memory = "2g"))