Installing pyspark with Jupyter

Check List

Python is a wonderful programming language for data analytics. Normally, I prefer to write python codes inside Jupyter Notebook (previous known as IPython), because it allows us to create and share documents that contain live code, equations, visualizations and explanatory text. Apache Spark is a fast and general engine for large-scale data processing. PySpark is the Python API for Spark. So it’s a good start point to write PySpark codes inside jupyter if you are interested in data science:

IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g

Install Jupyter

If you are a pythoner, I highly recommend installing Anaconda. Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.
Go to https://www.continuum.io/downloads, find the instructions for downloading and installing Anaconda (jupyter will be included):

$ wget https://{somewhere}/Anaconda2-2.4.1-MacOSX-x86_64.sh

$ bash Anaconda2-2.4.1-MacOSX-x86_64.sh

$ python

Python 2.7.11 |Anaconda 2.4.1 (x86_64)| (default, Dec 6 2015, 18:57:58)

[GCC 4.2.1 (Apple Inc. build 5577)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

Anaconda is brought to you by Continuum Analytics.

Please check out: http://continuum.io/thanks and https://anaconda.org

>>>

You can easily run Jupyter notebook:

$ jupyter notebook # Go to http://localhost:8888

Install Spark

If you are not familiar with spark, you can go to read spark offical documents:

Spark Overview

Spark Programming Guide

Here is a simply instruction for installing spark:

# MacOS

$ brew install apache-spark

# Linux

$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz

$ tar zxvf spark-1.6.0-bin-hadoop2.6.tgz

$ vim .bashrc

export PATH=/{your_path}/spark-1.6.0-bin-hadoop2.6/sbin:$PATH

export PATH=/{your_path}/spark-1.6.0-bin-hadoop2.6/bin:$PATH

$ source .bashrc

# Run PySpark shell

$ pyspark

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/__ / .__/\_,_/_/ /_/\_\ version 1.6.0

/_/

Using Python version 2.7.11 (default, Dec 6 2015 18:08:32)

SparkContext available as sc, HiveContext available as sqlContext.

>>>

Launch PySpark inside IPython(jupyter) Launch the PySpark shell in IPython:

$ PYSPARK_DRIVER_PYTHON=ipython pyspark

$ IPYTHON=1 pyspark

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/__ / .__/\_,_/_/ /_/\_\ version 1.6.0

/_/

Using Python version 2.7.11 (default, Dec 6 2015 18:08:32)

SparkContext available as sc, HiveContext available as sqlContext.

In [1]:

Launch the PySpark shell in IPython Notebook, http://localhost:8888:

$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

$ IPYTHON_OPTS="notebook" pyspark

# You can also specify running memory

$ IPYTHON_OPTS="notebook" pyspark --executor-memory 7g

Run PySpark on a cluster inside IPython(jupyter)

It’s assumed you deployed a spark cluster in standalone mode, and the master ip is localhost.

IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g

# you can add some python modules

IPYTHON_OPTS="notebook" pyspark \

--master spark://localhost:7077 \

--executor-memory 7g \

--py-files tensorflow-py2.7.egg

Basic’s:-

--------------------------Download winutils as per window version-------------------

----------------------------------------HADOOP_HOME--------------------------------------

Set HADOOP_HOME to winutils folder

----------------------------------------SPARK_HOME------------------------------------------

Set HADOOP_HOME to spark folder and add %SPARK_HOME%/bin in Path variable

----------------For ERROR Unknown error in handling PYTHONSTARTUP :- Try to open command promt from spark folder-----------------------

-----------------------------------------Set SPARK_HOME--------------------------------------

# Configure the necessary Spark environment (Window version)

import os

import sys

spark_path = "D:\spark-2.1.0-bin-hadoop2.7"

spark_home = os.environ.get('SPARK_HOME', None)

os.environ['SPARK_HOME'] = spark_path

os.environ['HADOOP_HOME'] = spark_path

sys.path.append(spark_path + "/bin")

sys.path.append(spark_path + "/python")

sys.path.append(spark_path + "/python/pyspark/")

sys.path.append(spark_path + "/python/lib")

sys.path.append(spark_path + "/python/lib/pyspark.zip")

sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")

-------------------------Install Anacondat spark-2.0.2-bin-hadoop2.7---------------------------

-----------------------------RUN RUN-----------------------

To run jupyter with spark :-(Window version)

set PYSPARK_DRIVER_PYTHON=ipython

# if you run your ipython with 2.7 version with ipython2

# whatever you use for launching ipython shell should come after '=' sign

set PYSPARK_DRIVER_PYTHON=ipython2

set PYSPARK_DRIVER_PYTHON_OPTS="notebook"

set IPYTHON_OPTS="notebook"

pyspark --master spark://localhost:4040 --executor-memory 7g

-------------------------------------WIN32 Error on Jupyter notebook kernal-------------------------------

Install 32bit version of anaconda even though you have 64 bit machine

----------------------------------------------------------OTHER-----------------------------------

from pprint import pprint as p

p(sys.path)

-----------------------------spark-submit--------------------------

from pyspark import SparkContext

from pyspark import SparkConf

sc = SparkContext("local", "test")

http://linbojin.github.io/2016/01/27/Hacking-pyspark-in-Jupyter-Notebook/

--Running spark with python file and sample.txt as argument

D:\spark-2.0.2-bin-hadoop2.7\bin>spark-submit --master spark://192.168.0.15:7077 ..\worldcount.py ..\sample.txt 1000

---------------------------To Run ES-Hadoop-----------------------

Add elasticsearch-hadoop-5.2.2.jar into jars folder

pyspark --master local[4] --jars D:/spark/jars/elasticsearch-hadoop-5.2.2.jar

bin\spark-submit --master local[4] --jars jars/elasticsearch-hadoop-5.2.2.jar ..\ESCount.py

pyspark --jars jars/elasticsearch-hadoop-5.2.2.jar

https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying

https://qbox.io/blog/elasticsearch-in-apache-spark-python

--Debug---------

--num-executors 1 --executor-cores 1 --conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=n,address=192.168.0.15:5005,suspend=n"

os.environ['SPARK_HOME'] = "D:\spark"

sys.path.append("D:\spark\python")

os.environ['PYTHONPATH'] ="D:\spark-2.0.2-bin-hadoop2.7\python"

Spark 2.1.0 doesn't support python 3.6.0. To solve this change your python version in anaconda environment. Run following command in your anaconda env

To install python 3.5 on anaconda:- conda install python=3.5.0

To Check python version :- conda search python

jupyter kernelspec list

Installing Toree via Pip:

pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz

This will install a jupyter application called toree:

jupyter toree install

jupyter toree install --user

Installing Multiple Kernels:

jupyter toree install --interpreters=Scala,PySpark,SparkR,SQL

The first is at install time with the --spark_opts command line option.:

jupyter toree install --interpreters=PySpark --spark_opts='--master=local[4]'

jupyter notebook

Technobias

Search This Blog

Installing pyspark with Jupyter

Installing pyspark with Jupyter

Comments

Post a Comment

Popular posts from this blog

Spark A to Z