Skip to main content

Spark A to Z

Spark A to Z

Simplicity, Flexibility and Performance are the major advantages of using Spark.
Criteria
Hadoop MapReduce
Apache Spark
Memory 
Does not leverage the memory of the hadoop cluster to maximum.
Let's save data on memory with the use of RDD's.
Disk usage
MapReduce is disk oriented.
Spark caches data in-memory and ensures low latency.
Processing
Only batch processing is supported
Supports real-time processing through spark streaming.
Installation
Is bound to hadoop.
Is not bound to Hadoop.

· Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
· Spark is easier to program as it comes with an interactive mode.
· It provides complete recovery using lineage graph whenever something goes wrong.
high availability in Apache Spark
· Implementing single node recovery with local file system
· Using StandBy Masters with Apache ZooKeeper.
disadvantages of using Apache Spark over Hadoop MapReduce
Apache spark does not scale well for compute intensive jobs and consumes large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data. Also, Spark does have its own file management system and hence needs to be integrated with other cloud based data platforms or apache hadoop.
Apache Spark stores data in-memory for faster model building and training. Machine learning algorithms require multiple iterations to generate a resulting optimal model and similarly graph algorithms traverse all the nodes and edges.These low latency workloads that need multiple iterations can lead to increased performance. Less disk access and  controlled network traffic make a huge difference when there is lots of data to be processed.
RDD?
RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –
· Immutable – RDDs cannot be altered.
· Resilient – If a node holding the partition fails the other node takes the data.
SchemaRDD
An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column.
Lazy Evaluation
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.
transformations and actions in the context of RDDs.
Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.
Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.
languages supported by Apache Spark for developing big data applications?
Scala, Java, Python, R and Clojure
Spark to access and analyse data stored in Cassandra, ElasticSearch


· YARN
· Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
· Standalone deployments – Well suited for new deployments which only run and are easy to set up.

Spark driver

The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.
· The driver prepares the context and declares the operations on the data using RDD transformations and actions.
· The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
· The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.
In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. In the case of a local cluster, like is your case, the master_url=spark://<host>:<port>
Its location is independent of the master/slaves. You could co-located with the master or run it from another node. The only requirement is that it must be in a network addressable from the Spark Workers.
This is how the configuration of your driver looks like:
val conf = new SparkConf()
      .setMaster("master_url") // this is where the master is specified
      .setAppName("SparkExamplesMinimal")
      .set("spark.local.ip","xx.xx.xx.xx") // helps when multiple network interfaces are present. The driver must be in the same network as the master and slaves
      .set("spark.driver.host","xx.xx.xx.xx") // same as above. This duality might disappear in a future version

val sc = new spark.SparkContext(conf)
    // etc...

Running Spark on YARN - Spark 2.1.0 Documentation - Apache Spark

Spark Standalone Mode - Spark 2.1.0 Documentation - Apache Spark

Running Spark on Mesos - Spark 2.1.0 Documentation - Apache Spark


various ways in which data transfers can be minimized 

1. Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
2. Using Accumulators – Accumulators help update the values of variables in parallel while executing.
3. The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.
broadcast variables 

These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().

Accumulators 


Lineage graph

The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information


Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

fault tolerance
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

Libraries in Spark

· Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
· Spark Streaming – This library is used to process real time streaming data.
· Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
· Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.

Sliding Window operation

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

DStream
Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations –
· Transformations that produce a new DStream.
· Output operations that write data to an external system.
· Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples – map (), reduceByKey (), filter ().
· Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch. Examples –Transformations that depend on sliding windows.
Parquet file
Parquet file is a columnar format file that helps –
· Limit I/O operations
· Consumes less space
· Fetches only required columns.
persistence in Apache Spark
pache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are -
· MEMORY_ONLY
· MEMORY_ONLY_SER
· MEMORY_AND_DISK
· MEMORY_AND_DISK_SER, DISK_ONLY
· OFF_HEAP
Logging in spark
Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes
Scheduling in spark
Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.
Core components of a distributed Spark application
· Driver- The process that runs the main () method of the program to create RDDs and perform transformations and actions on them.
· Executor –The worker processes that run the individual tasks of a Spark job.
· Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.
Worker node
A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.
Executor Memory in a Spark application
Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.
Spark Engine
Spark engine schedules, distributes and monitors the data application across the spark cluster.

Comments

  1. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache spark mlib, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on Apache spark mlib. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.

    For Free Demo Contact us:
    Name : Arunkumar U
    Email : arun@maxmunus.com
    Skype id: training_maxmunus
    Contact No.-+91-9738507310
    Company Website –http://www.maxmunus.com


    ReplyDelete

Post a Comment

Popular posts from this blog

Installing pyspark with Jupyter

Installing pyspark with Jupyter Check List Python is a wonderful programming language for data analytics. Normally, I prefer to write python codes inside   Jupyter Notebook  (previous known as  IPython ), because it allows us to create and share documents that contain live code, equations, visualizations and explanatory text.  Apache Spark  is a fast and general engine for large-scale data processing.  PySpark  is the Python API for Spark. So it’s a good start point to write PySpark codes inside jupyter if you are interested in data science: IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g Install Jupyter If you are a pythoner, I highly recommend installing  Anaconda . Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science. Go to  https://www.continuum.io/downloads , find the instructions for downloading and installing Anaconda (jup
the man with iron fist  :  RZA