Saturday, January 21, 2012

Apache Hadoop


Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

General Information

  • HBase, a Bigtable-like structured storage system for Hadoop HDFS
  • Apache Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
  • Hive a data warehouse infrastructure which allows sql-like adhoc querying of data (in any format) stored in Hadoop
  • ZooKeeper is a high-performance coordination service for distributed applications.
  • Hama, a Google's Pregel-like distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.
  • Mahout, scalable Machine Learning algorithms using Hadoop

User Documentation

Setting up a Hadoop Cluster

Tutorials

MapReduce

The MapReduce algorithm is the foundational algorithm of Hadoop, and is critical to understand.

Contributed parts of the Hadoop codebase

  • These are independent modules that are in the Hadoop codebase but not tightly integrated with the main project -yet.

Developer Documentation

No comments:

Post a Comment

Please feel free to contact or comment the article

Search This Blog