Apache Spark and Hadoop

Spark is intended to enhance, not replace, the Hadoop stack. From day one, Spark was designed to read and write data from and to HDFS, as well as other storage systems, such as HBase and Amazon’s S3. As such, Hadoop users can enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks.

Second, we have constantly focused on making it as easy as possible for every Hadoop user to take advantage of Spark’s capabilities. No matter whether you run Hadoop 1.x or Hadoop 2.0 (YARN), and no matter whether you have administrative privileges to configure the Hadoop cluster or not, there is a way for you to run Spark! In particular, there are three ways to deploy Spark in a Hadoop cluster: standalone, YARN, and SIMR.


Standalone deployment: With the standalone deployment one can statically allocate resources on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR. The user can then run arbitrary Spark jobs on HDFS.

Hadoop Yarn deployment: Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access required.

Spark In MapReduce (SIMR): For the Hadoop users that are not running YARN yet, another option, in addition to the standalone deployment, is to use SIMR to launch Spark jobs inside MapReduce.


Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.

Spark runs on top of existing Hadoop Distributed File System (HDFS) infrastructure to provide enhanced and additional functionality. It provides support for deploying Spark applications in an existing Hadoop v1 cluster (with SIMR – Spark-Inside-MapReduce) or Hadoop v2 YARN cluster or even Apache Mesos.

We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop. It’s not intended to replace Hadoop but to provide a comprehensive and unified solution to manage different big data use cases and requirements.

Once Spark is up and running with Hadoop, you can launch it via one of three modes: local, yarn-client or yarn-cluster:

  • Local mode: this launches a single Spark shell with all Spark components running within the same JVM. This is good for debugging on your laptop or on a workbench. Here’s how you’d invoke Spark in local mode:cd $SPARK_HOME
  • Yarn-cluster: the Spark driver runs within the Hadoop cluster as a YARN Application Master and spins up Spark executors within YARN containers. This allows Spark applications to run within the Hadoop cluster and be completely decoupled from the workbench, which is used only for job submission. An example:cd $SPARK_HOME
    ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn –deploy-mode cluster --num-executors 3 --driver-memory 1g --executor-memory 2g --executor-cores 1 --queue thequeue $SPARK_HOME/examples/target/spark-examples_*-1.2.1.jar

    Note that in the example above, the –queue option is used to specify the Hadoop queue to which the application is submitted.
  • Yarn-client: the Spark driver runs on the workbench itself with the Application Master operating in a reduced role. It only requests resources from YARN to ensure the Spark workers reside in the Hadoop cluster within YARN containers. This provides an interactive environment with distributed operations. Here’s an example of invoking Spark in this mode while ensuring it picks up the Hadoop LZO codec:cd $SPARK_HOME
    bin/spark-shell --master yarn --deploy-mode client --queue research --driver-memory 512M --driver-class-path /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.18-201409171947.jar



admin has written 55 articles