In this chapter, we will understand the environment setup of PySpark.
Note − This is considering that you have Java and Scala installed on your computer.
Let us now download and set up PySpark with the following steps.
Step 1 − Go to the official Apache Spark download page and download the latest version of Apache Spark available there. In this tutorial, we are using spark-2.1.0-bin-hadoop2.7.
Step 2 − Now, extract the downloaded Spark tar file. By default, it will get downloaded in Downloads directory.
# tar -xvf Downloads/spark-2.1.0-bin-hadoop2.7.tgz
It will create a directory spark-2.1.0-bin-hadoop2.7. Before starting PySpark, you need to set the following environments to set the Spark path and the Py4j path.
export SPARK_HOME = /home/hadoop/spark-2.1.0-bin-hadoop2.7 export PATH = $PATH:/home/hadoop/spark-2.1.0-bin-hadoop2.7/bin export PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH export PATH = $SPARK_HOME/python:$PATH
Or, to set the above environments globally, put them in the .bashrc file. Then run the following command for the environments to work.
# source .bashrc
Now that we have all the environments set, let us go to Spark directory and invoke PySpark shell by running the following command −
# ./bin/pyspark
This will start your PySpark shell.
Python 2.7.12 (default, Nov 19 2016, 06:48:10) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 2.7.12 (default, Nov 19 2016 06:48:10) SparkSession available as 'spark'. <<<