Once upon a Time I needed this information...: Apache Hadoop

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

General Information

Official Apache Hadoop Website: download, bug-tracking, mailing-lists, etc.
Overview of Apache Hadoop
FAQ Frequently Asked Questions.
What Hadoop is not
Distributions and Commercial Support for Hadoop (RPMs, Debs, AMIs, etc)
Presentations, books, articles and papers about Hadoop
PoweredBy, a growing list of sites and applications powered by Apache Hadoop
Support
- Getting help from the hadoop community.
- People and companies for hire.
Hadoop Community Events and Conferences
- HadoopUserGroups (HUGs)
- HadoopSummit
- HadoopWorld
- HadoopMeetupAtApacheCon

HBase, a Bigtable-like structured storage system for Hadoop HDFS
Apache Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
Hive a data warehouse infrastructure which allows sql-like adhoc querying of data (in any format) stored in Hadoop
ZooKeeper is a high-performance coordination service for distributed applications.
Hama, a Google's Pregel-like distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.
Mahout, scalable Machine Learning algorithms using Hadoop

User Documentation

Available Java Runtime Environments for Hadoop
ImportantConcepts
GettingStartedWithHadoop (lots of details and explanation)
QuickStart (for those who just want it to work now)
Command Line Options for the Hadoop shell scripts.
Hadoop Code Overview
Troubleshooting What do when things go wrong

Setting up a Hadoop Cluster

Setting up a Hadoop Cluster
Running_Hadoop_On_OS_X_10.5_64-bit_(Single-Node_Cluster)
HowToConfigure Hadoop software
WebApps for monitoring your system
How to handle name node failure
How to get metrics into ganglia
Tips for managing a large cluster
Disk Setup: some suggestions
Performance: getting extra throughput
Topology Scripts / Rack Awareness
Virtual Clusters including Amazon AWS
- Virtual Hadoop -the theory
- How to set up a Virtual Cluster
- Running Hadoop on AmazonEC2
- Running Hadoop with AmazonS3

Tutorials

Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) A tutorial on installing, configuring and running Hadoop on a single Ubuntu Linux machine.
Cloudera basic training
Hadoop Windows/Eclipse Tutorial: How to develop Hadoop with Eclipse on Windows.
Yahoo! Hadoop Tutorial: Hadoop setup, HDFS, and MapReduce

MapReduce

The MapReduce algorithm is the foundational algorithm of Hadoop, and is critical to understand.

HadoopMapReduce
HadoopMapRedClasses
HowManyMapsAndReduces
TaskExecutionEnvironment
HowToDebugMapReducePrograms
Examples
Benchmarks
- Hardware benchmarks
- Data processing benchmarks

Contributed parts of the Hadoop codebase

These are independent modules that are in the Hadoop codebase but not tightly integrated with the main project -yet.
- HadoopStreaming (Useful for using Hadoop with other programming languages)
- DistributedLucene, a Proposal for a distributed Lucene index in Hadoop
- MountableHDFS, Fuse-DFS & other Tools to mount HDFS as a standard filesystem on Linux (and some other Unix OSs)
- HDFS-APIs in Perl, Python, PHP and other languages.
- Chukwa a data collection, storage, and analysis framework
- The Apache Hadoop Plugin for Eclipse (An Eclipse plug-in that simplifies the creation and deployment of MapReduce programs with an HDFS Administrative feature)
- HDFS-RAID Erasure Coding in HDFS

Saturday, January 21, 2012

Apache Hadoop

General Information

User Documentation

Setting up a Hadoop Cluster

Tutorials

MapReduce

Contributed parts of the Hadoop codebase

Developer Documentation

No comments:

Post a Comment

Search This Blog

Popular Posts

Saturday, January 21, 2012

Apache Hadoop

General Information

Related-Projects

User Documentation

Setting up a Hadoop Cluster

Tutorials

MapReduce

Contributed parts of the Hadoop codebase

Developer Documentation

No comments:

Post a Comment

Search This Blog