Wednesday, June 27, 2012

A little bit about Big Data - Hadoop

One of the ways that Big Data of the sort discussed in this last post is implemented is with an open source technology stack called Hadoop.

Hadoop consists of a two main parts:
  • HDFS - Hadoop File System
  • MapReduce infrastructure
These allow data processing jobs to be divided among multiple nodes and then aggregated into a single result. In essence, this constructs a large, parallel computer from many smaller computers - basically the opposite of virtualization.

Schematically, a Hadoop cluster looks like this:

10,000 ft view of how it works

  • The Job Tracker on the master server gets a job
  • The Job Tracker breaks up the job using the map function
    • Basic queuing ensures that any one node is not overloaded with tasks
    • The are tasks preferentially distributed to the nodes nearest the data on which the task must operate to minimize file transfer overhead*.
      • nearest = same node as the data resides.
      • next nearest = different node but behind the same switch (so that data transfer is localized to that network segment).
  • The Job Tracker gets status for all tasks via the Task Trackers as they run. 
    • If a node stops reporting, the Job Tracker will redistribute that node's tasks to another node.
  • When all tasks are complete for a job, the Job Tracker has the nodes execute the reduce function to generate a single result from the tasks' output.
  • The final output may then be used by other applications directly or as the input to another MapReduce iteration.

I am certain that I missed almost every important detail in terms of the actual engineering implementation of a Hadoop cluster. But for those who just need a cursory understanding of the technology in order to make sufficient sense of what engineering is actually talking about to sanity check proposals,  I hope this hits the mark.

* HDFS and the awareness of data's physical location is very important in dealing with large data.
To make compute and storage capacity scale linearly with cost, the data must be spread around all the nodes in the cluster and a record kept of where all the data actually is. It can't be kept centrally or data transfer becomes a serious bottleneck to computational performance. I.e. it could take longer to move the data to a compute node over a network than it takes to actually process it. 
This distribution is managed by HDFS.

HDFS holds and replicates data in the system in order to minimize the chance of a bottleneck. It does this by keeping at least 3 copies of the data:

  • an original
  • a copy on another node behind the same switch as the original
  • a copy on another node on a different switch from the original

This replication attempts to strike a balance between:

  • keeping enough copies of the data to minimize the queue size on each node (to ensure timely completion of jobs) and to ensure robust execution despite failed tasks on dead nodes 
  • reducing the total storage capacity of the system by duplicating data.