Apache Hadoop: It is an
open – source software framework used for distributed storage and distributed
processing of Big Data’s data sets on computer cluster built from commodity
hardware. Hadoop is an Apache top – level project being built and used by a
global community of users. Hadoop services provide for data storage, data
processing, data access, security and operations.
Apache Spark: It is a
fast and general engine for large – scale data processing. Spark is an open –
source implementation of Resilient Distributed Datasets (RDD). It has an
advanced DAG (Directed Acyclic Graph) execution engine and in – memory computing.
It is fast, flexible and developer friendly.
HADOOP VS SPARK
1) On the Basis of Speed/Performance:
Hadoop uses MapReduce programming model
to process data sets of Big Data. MapReduce reads and write from disk, due to
which it takes more time for processing of data.
Spark is fast because it has in – memory
processing; as a result, Spark is much faster – up to 100 times for data in RAM
and up to 10 times for data in storage.
2) On the Basis of Difficulty:
In Hadoop user or developer has to use codes
for each and every operation which makes Hadoop MapReduce difficult to use. It
provides low level APIs due to which user has to rely on hand coding.
In Spark there are many high level operators
with Resilient Distributed Datasets (RDD) are available which makes programming
easier. It provides rich APIs in Java, Python, Scala and R.
3) On the Basis of Fault Tolerance:
Both Hadoop as well as Spark are fault tolerant. As a result, we do not have to restart
the application from scratch in case of failure.
4) On the Basis of Scalability:
Hadoop MapReduce is highly scalable as we
can add any number of nodes in the cluster. A largest known Hadoop cluster is
of 14,000 nodes.
Spark is also highly scalable as we can
add any number of nodes in the cluster. A largest known Spark cluster is of 8,000
5) On the Basis of Hardware Requirement:
Hadoop MapReduce runs well on commodity
Spark needs mid level to high level
6) On the Basis of Machine Learning:
Hadoop requires machine learning tools.
For example – Apache Mahout.
Spark has its own set of machine learning
7) On the Basis of Real – Time Analysis:
Hadoop is mainly designed to perform batch
processing on large amount of data. Therefore, Hadoop MapReduce fails in real –
Spark can process real – time data
efficiently. Programmers can also modify the data by real – time processing.
8) On the Basis of Security:
Hadoop MapReduce is more secure because it
supports Access Control Lists.
Spark is less secure because it supports
the only authentication through shared secret password authentication.
9) On the Basis of The Line of Code:
Hadoop 2.0 has 1, 20,000 lines of codes.
Spark is developed in merely 20,000 lines
10) On the Basis of Cost:
Hadoop is a cheaper option available in
terms of cost.
Spark requires a lot of RAM to run. Thus
increases the cluster and hence cost.