2.10.2 / May 31, 2022; 21 months ago (2022-05-31)[2]
3.2.x
3.2.4 / July 22, 2022; 20 months ago (2022-07-22)[2]
3.3.x
3.3.6 / June 23, 2023; 9 months ago (2023-06-23)[2]
Repository
Hadoop Repository
Written in
Java
Operating system
Cross-platform
Type
Distributed file system
License
Apache License 2.0
Website
hadoop.apache.org
Apache Hadoop (/həˈduːp/) is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.[vague] It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use.[3] It has since also found use on clusters of higher-end hardware.[4][5] All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.[6]
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality,[7] where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.[8][9]
The base Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN – (introduced in 2012) is a platform responsible for managing computing resources in clusters and using them for scheduling users' applications;[10][11]
Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.
Hadoop Ozone – (introduced in 2020) An object store for Hadoop
The term Hadoop is often used for both base modules and sub-modules and also the ecosystem,[12] or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Apache Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.[13]
Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on MapReduce and Google File System.[14]
The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with Hadoop Streaming to implement the map and reduce parts of the user's program.[15] Other projects in the Hadoop ecosystem expose richer user interfaces.
^"Hadoop Releases". apache.org. Apache Software Foundation. Retrieved 28 April 2019.
^ abc"Apache Hadoop". Retrieved 27 September 2022.
^Judge, Peter (22 October 2012). "Doug Cutting: Big Data Is No Bubble". silicon.co.uk. Retrieved 11 March 2018.
^Woodie, Alex (12 May 2014). "Why Hadoop on IBM Power". datanami.com. Datanami. Retrieved 11 March 2018.
^Hemsoth, Nicole (15 October 2014). "Cray Launches Hadoop into HPC Airspace". hpcwire.com. Retrieved 11 March 2018.
^"Welcome to Apache Hadoop!". hadoop.apache.org. Retrieved 25 August 2016.
^"What is the Hadoop Distributed File System (HDFS)?". ibm.com. IBM. Retrieved 12 April 2021.
^Malak, Michael (19 September 2014). "Data Locality: HPC vs. Hadoop vs. Spark". datascienceassn.org. Data Science Association. Retrieved 30 October 2014.
^Wang, Yandong; Goldstone, Robin; Yu, Weikuan; Wang, Teng (October 2014). "Characterization and Optimization of Memory-Resident MapReduce on HPC Systems". 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE. pp. 799–808. doi:10.1109/IPDPS.2014.87. ISBN 978-1-4799-3800-1. S2CID 11157612.
^"Resource (Apache Hadoop Main 2.5.1 API)". apache.org. Apache Software Foundation. 12 September 2014. Archived from the original on 6 October 2014. Retrieved 30 September 2014.
^Murthy, Arun (15 August 2012). "Apache Hadoop YARN – Concepts and Applications". hortonworks.com. Hortonworks. Retrieved 30 September 2014.
^"Continuuity Raises $10 Million Series A Round to Ignite Big Data Application Development Within the Hadoop Ecosystem". finance.yahoo.com. Marketwired. 14 November 2012. Retrieved 30 October 2014.
^"Hadoop-related projects at". Hadoop.apache.org. Retrieved 17 October 2013.
^Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. John Wiley & Sons. 19 December 2014. p. 300. ISBN 9781118876220. Retrieved 29 January 2015.
^"[nlpatumd] Adventures with Hadoop and Perl". Mail-archive.com. 2 May 2010. Retrieved 5 April 2013.
ApacheHadoop ( /həˈduːp/) is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving...
Apache Hive is a data warehouse software project, built on top of ApacheHadoop for providing data query and analysis. Hive gives an SQL-like interface...
remote procedure call and data serialization framework developed within Apache'sHadoop project. It uses JSON for defining data types and protocols, and serializes...
Apache Parquet is a free and open-source column-oriented data storage format in the ApacheHadoop ecosystem. It is similar to RCFile and ORC, the other...
Java. It is developed as part of Apache Software Foundation's ApacheHadoop project and runs on top of HDFS (Hadoop Distributed File System) or Alluxio...
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running ApacheHadoop. Impala...
past, many of the implementations use the ApacheHadoop platform, however today it is primarily focused on Apache Spark. Mahout also provides Java/Scala...
have been spun out into their own subproject, called Hadoop. In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject...
testing), Hadoop YARN, Apache Mesos or Kubernetes. For distributed storage, Spark can interface with a wide variety, including Alluxio, Hadoop Distributed...
implementation that has support for distributed shuffles is part of ApacheHadoop. The name MapReduce originally referred to the proprietary Google technology...
platforms such as Apache Spark Beam, an uber-API for big data Bigtop: a project for the development of packaging and tests of the ApacheHadoop ecosystem. Bloodhound:...
Apache Pig is a high-level platform for creating programs that run on ApacheHadoop. The language for this platform is called Pig Latin. Pig can execute...
Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. Workflows in Oozie are defined as a collection of control flow and action...
single computer cluster, including big data workloads such as ApacheHadoop and Apache Spark, a distributed file system, a multi-model database management...
6, released Apr 12 2010, added support for integrated caching, and ApacheHadoop MapReduce 0.7, released Jan 08 2011, added secondary indexes and online...
more advanced customization. Apache Solr is developed in an open, collaborative manner by the Apache Solr project at the Apache Software Foundation. In 2004...
Trafodion was a relational database management system that ran on ApacheHadoop, providing support for transactional or operational workloads in a big...
Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink and Apache Hadoop...
projects. Portions are used by a wide variety of Apache projects, including ApacheHadoop and Apache HBase. It consists of the following components: Precommit...
The Apache Ambari project intends to simplify the management of ApacheHadoop clusters using a web UI. It also integrates with other existing applications...
Apache Kudu is a free and open source column-oriented data store of the ApacheHadoop ecosystem. It is compatible with most of the data processing frameworks...
Apache Software Foundation, which supports community projects such as the open-source framework ApacheHadoop and the open-source HTTP server Apache HTTP...
Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix...
Hadoop Development". The New York Times. VentureBeat. October 27, 2010. Rao, Leena (7 November 2011). "Ignition, Accel, Greylock Put $40M In Apache Hadoop...
are now managed through the Apache Software Foundation. Cutting and Cafarella are also the co-founders of ApacheHadoop. Cutting graduated from Stanford...
Apache Accumulo is a highly scalable sorted, distributed key-value store based on Google's Bigtable. It is a system built on top of ApacheHadoop, Apache...
from OND natively into Hadoop MapReduce jobs. One use for this class is to read NoSQL database records into Oracle Loader for Hadoop. Oracle Big Data SQL...