Hadoop-1-2-3 Know Differences Quickly
Apache Hadoop’s accessibility and simplicity give it an edge over writing and running large distributed programs. The progress from a Hadoop 1’s more restricted processing model of batch oriented MapReduce jobs, to more interactive and specialized processing models of Hadoop 2 positioned the Hadoop ecosystem as the dominant big data analysis platform. Hadoop 2 was a major change as it stabilized Hadoop from initial solution to one of the most important Big Data solution platform.
Apache Hadoop 3 is going to incorporate a number of enhancements over the Hadoop-2.x.. Several major upgrades on shell scripts, dependencies, NameNodes, Replication and major functionalities are expected, with the beta release already catching the imagination of big data engineers worldwide. The new version of the world’s most popular Big Data platform is set to become leaner and more agile, boost storage and retrieval capabilities, not to mention the much needed Support for Erasure Coding in HDFS. Here’s a look at what the Big Data Engineer can expect with this new exciting release of one of the world’s most popular Big Data Engineering platform.
Comparison Between Hadoop 2.x vs Hadoop 3.x
[quads id=1]
Hadoop 3.0 feature will save hadoop customers big bucks on hardware infrastructure as they can reduce the size of their hadoop cluster to half and store the same amount of data or continue to use the current hadoop cluster hardware infrastructure and store double the amount of data
HADOOP 3.0 Additional features
• JDK 8 the minimum runtime for Hadoop
• Erasure Coding
• Shell Scripts rewrite
• More Powerful YARN
• API Simplification with REST APIs
• Multiple NameNodes to maximize Fault Tolerance
JDK 8 The Minimum Runtime
Hadoop 3.0 JAR files are compiled to run on JDK 8 version. This gives Hadoop 3.0 a dependency upgrade to modern versions as most of the libraries only support Java 8 .As a hadoop user, if you are still working with lower version of JDK then it is time to upgrade to JDK 8 to make the most of the enhancements coming in Hadoop 3.0. Java 7 was a roadblock as it was not supported by Oracle. Hadoop 3.0.0 addresses this issue by using the advanced version of Java, i.e., Java 8 as many of the library files do not support Java 7 anymore and work better with Java 8.
Erasure Coding
Replication method consumes much of the storage space which is reduced drastically with the help of Erasure coding which was traditionally used for accessing less frequent data. Till Hadoop 2.0, HDFS used to store data through striping to a factor of three. This meant that for every piece of data stored, it was replicated three times, thus ensuring an almost 100% reliability against loss of data (99.99%) With support for erasure coding in Hadoop 3.0, the physical disk usage will be cut by half (i.e. 3x disk space consumption will reduce to 1.5x) and the fault tolerance level will increase by 50%.
SHELL SCRIPTS Rewrite
In previous versions the shell scripts were facing bugs, documentations errors, etc., which is resolved in this version by rewriting the shell scripts. Hadoop 3.0 will feature thousands of rewritten shell scripts, which will include new improvements, addes features and several new features as well as improvements on existing features.
More Powerful YARN
A major improvement in Hadoop 3.0 is related to the way YARN works and what it can support. YARN will be different in the new Hadoop 3.0 :
YARN Timeline Service v2.0 is the next major iteration of the Timeline Server, following v1.0 and v1.5.
It handles and manages the cluster in a better way. This service is equipped with the ability to improve the scalability, reliability and usability by means of flows and aggregation. Yarn now supports better resource isolation for disk and network, resource utilization, user experiences, docker opportunities and elasticity.
API Simplification with REST APIs
Application Programming Interfaces are the common information exchange platforms within different online and offline applications, and the creators have emphasized simplification and ease of connectivity between Hadoop and other applications, especially REST-compliant web services.
Multiple NameNodes to maximize Fault Tolerance
Fault tolerance was limited in Hadoop 2.0 with as HDFS could run only a single standby and a single active NameNode. This limitation has been addressed in Hadoop 3.0 to enhance the fault tolerance in HDFS. Hadoop 3.0 supports 2 or more Standby nodes to provide additional fault tolerance