Skip to main content

Installing pyspark with Jupyter

Installing pyspark with Jupyter

Check List

Python is a wonderful programming language for data analytics. Normally, I prefer to write python codes inside Jupyter Notebook (previous known as IPython), because it allows us to create and share documents that contain live code, equations, visualizations and explanatory text. Apache Spark is a fast and general engine for large-scale data processing. PySpark is the Python API for Spark. So it’s a good start point to write PySpark codes inside jupyter if you are interested in data science:

IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g

Install Jupyter

If you are a pythoner, I highly recommend installing Anaconda. Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.
Go to https://www.continuum.io/downloads, find the instructions for downloading and installing Anaconda (jupyter will be included):

$ wget https://{somewhere}/Anaconda2-2.4.1-MacOSX-x86_64.sh
$ bash Anaconda2-2.4.1-MacOSX-x86_64.sh
$ python
Python 2.7.11 |Anaconda 2.4.1 (x86_64)| (default, Dec  6 2015, 18:57:58)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>>

You can easily run Jupyter notebook:
$ jupyter notebook # Go to http://localhost:8888

Install Spark

If you are not familiar with spark, you can go to read spark offical documents:

Here is a simply instruction for installing spark:
# MacOS
$ brew install apache-spark
# Linux
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
$ tar zxvf spark-1.6.0-bin-hadoop2.6.tgz
$ vim .bashrc
 export PATH=/{your_path}/spark-1.6.0-bin-hadoop2.6/sbin:$PATH
 export PATH=/{your_path}/spark-1.6.0-bin-hadoop2.6/bin:$PATH
$ source .bashrc
# Run PySpark shell
$ pyspark
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/
Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.
>>>

Launch PySpark inside IPython(jupyter) Launch the PySpark shell in IPython:

$ PYSPARK_DRIVER_PYTHON=ipython pyspark
or
$ IPYTHON=1 pyspark
 Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/
Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.
In [1]:

Launch the PySpark shell in IPython Notebook, http://localhost:8888:

$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
or
$ IPYTHON_OPTS="notebook" pyspark
# You can also specify running memory
$ IPYTHON_OPTS="notebook" pyspark --executor-memory 7g

Run PySpark on a cluster inside IPython(jupyter)
It’s assumed you deployed a spark cluster in standalone mode, and the master ip is localhost.
IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g
# you can add some python modules
IPYTHON_OPTS="notebook" pyspark  \
--master spark://localhost:7077  \
--executor-memory 7g             \
--py-files tensorflow-py2.7.egg
Basic’s:-
--------------------------Download winutils as per window version-------------------

----------------------------------------HADOOP_HOME--------------------------------------
Set HADOOP_HOME to winutils folder 

----------------------------------------SPARK_HOME------------------------------------------
Set HADOOP_HOME to spark folder and add %SPARK_HOME%/bin in Path variable

----------------For ERROR Unknown error in handling PYTHONSTARTUP :- Try to open command promt from spark folder-----------------------

-----------------------------------------Set SPARK_HOME--------------------------------------
# Configure the necessary Spark environment (Window version)
import os
import sys
spark_path = "D:\spark-2.1.0-bin-hadoop2.7"
spark_home = os.environ.get('SPARK_HOME', None)
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")

-------------------------Install Anacondat spark-2.0.2-bin-hadoop2.7---------------------------

-----------------------------RUN RUN-----------------------
To run jupyter with spark :-(Window version)

set PYSPARK_DRIVER_PYTHON=ipython
# if you run your ipython with 2.7 version with ipython2
# whatever you use for launching ipython shell should come after '=' sign
set PYSPARK_DRIVER_PYTHON=ipython2

set PYSPARK_DRIVER_PYTHON_OPTS="notebook"
set IPYTHON_OPTS="notebook"
pyspark --master spark://localhost:4040 --executor-memory 7g

-------------------------------------WIN32 Error on Jupyter notebook kernal-------------------------------
Install 32bit version of anaconda even though you have 64 bit machine
----------------------------------------------------------OTHER-----------------------------------
from pprint import pprint as p
p(sys.path)

-----------------------------spark-submit--------------------------
from pyspark import SparkContext
from pyspark import SparkConf
sc = SparkContext("local", "test")
http://linbojin.github.io/2016/01/27/Hacking-pyspark-in-Jupyter-Notebook/

--Running spark with python file and sample.txt as argument

D:\spark-2.0.2-bin-hadoop2.7\bin>spark-submit --master spark://192.168.0.15:7077 ..\worldcount.py ..\sample.txt 1000

---------------------------To Run ES-Hadoop-----------------------
Add elasticsearch-hadoop-5.2.2.jar into jars folder

pyspark --master local[4] --jars D:/spark/jars/elasticsearch-hadoop-5.2.2.jar

bin\spark-submit --master local[4] --jars jars/elasticsearch-hadoop-5.2.2.jar ..\ESCount.py

pyspark --jars jars/elasticsearch-hadoop-5.2.2.jar

https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying
https://qbox.io/blog/elasticsearch-in-apache-spark-python

--Debug---------
--num-executors 1 --executor-cores 1 --conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=n,address=192.168.0.15:5005,suspend=n"

os.environ['SPARK_HOME'] = "D:\spark"
sys.path.append("D:\spark\python")
os.environ['PYTHONPATH'] ="D:\spark-2.0.2-bin-hadoop2.7\python"


Spark 2.1.0 doesn't support python 3.6.0. To solve this change your python version in anaconda environment. Run following command in your anaconda env
To install python 3.5 on anaconda:- conda install python=3.5.0
To Check python version :- conda search python

jupyter kernelspec list

Installing Toree via Pip:
pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz

This will install a jupyter application called toree:
jupyter toree install
jupyter toree install --user

Installing Multiple Kernels:
jupyter toree install --interpreters=Scala,PySpark,SparkR,SQL

The first is at install time with the --spark_opts command line option.:
jupyter toree install --interpreters=PySpark --spark_opts='--master=local[4]'
jupyter notebook  










Comments

Post a Comment

Popular posts from this blog

Spark A to Z

Spark A to Z Simplicity, Flexibility and Performance are the major advantages of using Spark . Criteria Hadoop MapReduce Apache Spark Memory  Does not leverage the memory of the hadoop cluster to maximum. Let's save data on memory with the use of RDD's. Disk usage MapReduce is disk oriented. Spark caches data in-memory and ensures low latency. Processing Only batch processing is supported Supports real-time processing through spark streaming. Installation Is bound to hadoop. Is not bound to Hadoop. ·  Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD). ·  Spark is easier to program as it comes with an interactive mode. ·  It provides complete recovery using lineage graph whenever something goes wrong. high availability in Apache Spark ·  Implementing single node recovery with local file system ·  Using StandBy Masters with Apa
the man with iron fist  :  RZA