Installing pyspark with Jupyter
Check List
Python is a wonderful programming language for data analytics. Normally, I prefer to write python codes inside Jupyter Notebook (previous known as IPython), because it allows us to create and share documents that contain live code, equations, visualizations and explanatory text. Apache Spark is a fast and general engine for large-scale data processing. PySpark is the Python API for Spark. So it’s a good start point to write PySpark codes inside jupyter if you are interested in data science:
IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g
|
Install Jupyter
If you are a pythoner, I highly recommend installing Anaconda. Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.
Go to https://www.continuum.io/downloads, find the instructions for downloading and installing Anaconda (jupyter will be included):
Go to https://www.continuum.io/downloads, find the instructions for downloading and installing Anaconda (jupyter will be included):
$ wget https://{somewhere}/Anaconda2-2.4.1-MacOSX-x86_64.sh
$ bash Anaconda2-2.4.1-MacOSX-x86_64.sh
$ python
Python 2.7.11 |Anaconda 2.4.1 (x86_64)| (default, Dec 6 2015, 18:57:58)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>>
|
You can easily run Jupyter notebook:
$ jupyter notebook # Go to http://localhost:8888
|
If you are not familiar with spark, you can go to read spark offical documents:
Here is a simply instruction for installing spark:
# MacOS
$ brew install apache-spark
# Linux
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
$ tar zxvf spark-1.6.0-bin-hadoop2.6.tgz
$ vim .bashrc
export PATH=/{your_path}/spark-1.6.0-bin-hadoop2.6/sbin:$PATH
export PATH=/{your_path}/spark-1.6.0-bin-hadoop2.6/bin:$PATH
$ source .bashrc
# Run PySpark shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.11 (default, Dec 6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.
>>>
|
$ PYSPARK_DRIVER_PYTHON=ipython pyspark
or
$ IPYTHON=1 pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.11 (default, Dec 6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.
In [1]:
|
$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
or
$ IPYTHON_OPTS="notebook" pyspark
# You can also specify running memory
$ IPYTHON_OPTS="notebook" pyspark --executor-memory 7g
|
It’s assumed you deployed a spark cluster in standalone mode, and the master ip is localhost.
IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g
# you can add some python modules
IPYTHON_OPTS="notebook" pyspark \
--master spark://localhost:7077 \
--executor-memory 7g \
--py-files tensorflow-py2.7.egg
|
Basic’s:-
--------------------------Download winutils as per window version-------------------
----------------------------------------HADOOP_HOME--------------------------------------
Set HADOOP_HOME to winutils folder
----------------------------------------SPARK_HOME------------------------------------------
Set HADOOP_HOME to spark folder and add %SPARK_HOME%/bin in Path variable
----------------For ERROR Unknown error in handling PYTHONSTARTUP :- Try to open command promt from spark folder-----------------------
-----------------------------------------Set SPARK_HOME--------------------------------------
# Configure the necessary Spark environment (Window version)
import os
import sys
spark_path = "D:\spark-2.1.0-bin-hadoop2.7"
spark_home = os.environ.get('SPARK_HOME', None)
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")
-------------------------Install Anacondat spark-2.0.2-bin-hadoop2.7---------------------------
-----------------------------RUN RUN-----------------------
To run jupyter with spark :-(Window version)
set PYSPARK_DRIVER_PYTHON=ipython
# if you run your ipython with 2.7 version with ipython2
# whatever you use for launching ipython shell should come after '=' sign
set PYSPARK_DRIVER_PYTHON=ipython2
set PYSPARK_DRIVER_PYTHON_OPTS="notebook"
set IPYTHON_OPTS="notebook"
pyspark --master spark://localhost:4040 --executor-memory 7g
-------------------------------------WIN32 Error on Jupyter notebook kernal-------------------------------
Install 32bit version of anaconda even though you have 64 bit machine
----------------------------------------------------------OTHER-----------------------------------
from pprint import pprint as p
p(sys.path)
-----------------------------spark-submit--------------------------
from pyspark import SparkContext
from pyspark import SparkConf
sc = SparkContext("local", "test")
http://linbojin.github.io/2016/01/27/Hacking-pyspark-in-Jupyter-Notebook/
--Running spark with python file and sample.txt as argument
D:\spark-2.0.2-bin-hadoop2.7\bin>spark-submit --master spark://192.168.0.15:7077 ..\worldcount.py ..\sample.txt 1000
---------------------------To Run ES-Hadoop-----------------------
Add elasticsearch-hadoop-5.2.2.jar into jars folder
pyspark --master local[4] --jars D:/spark/jars/elasticsearch-hadoop-5.2.2.jar
bin\spark-submit --master local[4] --jars jars/elasticsearch-hadoop-5.2.2.jar ..\ESCount.py
pyspark --jars jars/elasticsearch-hadoop-5.2.2.jar
https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying
https://qbox.io/blog/elasticsearch-in-apache-spark-python
--Debug---------
--num-executors 1 --executor-cores 1 --conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=n,address=192.168.0.15:5005,suspend=n"
os.environ['SPARK_HOME'] = "D:\spark"
sys.path.append("D:\spark\python")
os.environ['PYTHONPATH'] ="D:\spark-2.0.2-bin-hadoop2.7\python"
Spark 2.1.0 doesn't support python 3.6.0. To solve this change your python version in anaconda environment. Run following command in your anaconda env
To install python 3.5 on anaconda:- conda install python=3.5.0
To Check python version :- conda search python
jupyter kernelspec list
Installing Toree via Pip:
pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz
This will install a jupyter application called toree:
jupyter toree install
jupyter toree install --user
Installing Multiple Kernels:
jupyter toree install --interpreters=Scala,PySpark,SparkR,SQL
The first is at install time with the --spark_opts command line option.:
jupyter toree install --interpreters=PySpark --spark_opts='--master=local[4]'
jupyter notebook
This comment has been removed by the author.
ReplyDelete