Spark A to Z Simplicity, Flexibility and Performance are the major advantages of using Spark . Criteria Hadoop MapReduce Apache Spark Memory Does not leverage the memory of the hadoop cluster to maximum. Let's save data on memory with the use of RDD's. Disk usage MapReduce is disk oriented. Spark caches data in-memory and ensures low latency. Processing Only batch processing is supported Supports real-time processing through spark streaming. Installation Is bound to hadoop. Is not bound to Hadoop. · Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD). · Spark is easier to program as it comes with an interactive mode. · It provides complete recovery using lineage graph whenever something goes wrong. high availability in Apache Spark · Implementing single node recovery with local file system · Using StandBy Masters with Apa
Installing pyspark with Jupyter Check List Python is a wonderful programming language for data analytics. Normally, I prefer to write python codes inside Jupyter Notebook (previous known as IPython ), because it allows us to create and share documents that contain live code, equations, visualizations and explanatory text. Apache Spark is a fast and general engine for large-scale data processing. PySpark is the Python API for Spark. So it’s a good start point to write PySpark codes inside jupyter if you are interested in data science: IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g Install Jupyter If you are a pythoner, I highly recommend installing Anaconda . Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science. Go to https://www.continuum.io/downloads , find the instructions for downloading and installing Anaconda (jup