Apache Hadoop: It is anopen – source software framework used for distributed storage and distributedprocessing of Big Data’s data sets on computer cluster built from commodityhardware. Hadoop is an Apache top – level project being built and used by aglobal community of users. Hadoop services provide for data storage, dataprocessing, data access, security and operations. Apache Spark: It is afast and general engine for large – scale data processing.
Spark is an open –source implementation of Resilient Distributed Datasets (RDD). It has anadvanced DAG (Directed Acyclic Graph) execution engine and in – memory computing.It is fast, flexible and developer friendly.HADOOP VS SPARK1) On the Basis of Speed/Performance: Hadoop uses MapReduce programming modelto process data sets of Big Data. MapReduce reads and write from disk, due towhich it takes more time for processing of data.Spark is fast because it has in – memoryprocessing; as a result, Spark is much faster – up to 100 times for data in RAMand up to 10 times for data in storage. 2) On the Basis of Difficulty:In Hadoop user or developer has to use codesfor each and every operation which makes Hadoop MapReduce difficult to use.
Itprovides low level APIs due to which user has to rely on hand coding.In Spark there are many high level operatorswith Resilient Distributed Datasets (RDD) are available which makes programmingeasier. It provides rich APIs in Java, Python, Scala and R.3) On the Basis of Fault Tolerance: Both Hadoop as well as Spark are fault tolerant. As a result, we do not have to restartthe application from scratch in case of failure.4) On the Basis of Scalability:Hadoop MapReduce is highly scalable as wecan add any number of nodes in the cluster. A largest known Hadoop cluster isof 14,000 nodes.Spark is also highly scalable as we canadd any number of nodes in the cluster.
A largest known Spark cluster is of 8,000nodes.5) On the Basis of Hardware Requirement: Hadoop MapReduce runs well on commodityhardware.Spark needs mid level to high levelhardware.6) On the Basis of Machine Learning:Hadoop requires machine learning tools.For example – Apache Mahout.Spark has its own set of machine learningi.e. MLlib.
7) On the Basis of Real – Time Analysis:Hadoop is mainly designed to perform batchprocessing on large amount of data. Therefore, Hadoop MapReduce fails in real –time processing.Spark can process real – time dataefficiently. Programmers can also modify the data by real – time processing.8) On the Basis of Security:Hadoop MapReduce is more secure because itsupports Access Control Lists.Spark is less secure because it supportsthe only authentication through shared secret password authentication.
9) On the Basis of The Line of Code:Hadoop 2.0 has 1, 20,000 lines of codes.Spark is developed in merely 20,000 linesof codes.
10) On the Basis of Cost:Hadoop is a cheaper option available interms of cost.Spark requires a lot of RAM to run. Thusincreases the cluster and hence cost.