Cursory Knowledge: June 2012

Wednesday, June 27, 2012

A little bit about Big Data - Hadoop

One of the ways that Big Data of the sort discussed in this last post is implemented is with an open source technology stack called Hadoop.

Hadoop consists of a two main parts:

HDFS - Hadoop File System
MapReduce infrastructure

These allow data processing jobs to be divided among multiple nodes and then aggregated into a single result. In essence, this constructs a large, parallel computer from many smaller computers - basically the opposite of virtualization.

Schematically, a Hadoop cluster looks like this:

10,000 ft view of how it works

The Job Tracker on the master server gets a job
The Job Tracker breaks up the job using the map function

Basic queuing ensures that any one node is not overloaded with tasks
The are tasks preferentially distributed to the nodes nearest the data on which the task must operate to minimize file transfer overhead*.

nearest = same node as the data resides.
next nearest = different node but behind the same switch (so that data transfer is localized to that network segment).

The Job Tracker gets status for all tasks via the Task Trackers as they run.

If a node stops reporting, the Job Tracker will redistribute that node's tasks to another node.

When all tasks are complete for a job, the Job Tracker has the nodes execute the reduce function to generate a single result from the tasks' output.
The final output may then be used by other applications directly or as the input to another MapReduce iteration.

I am certain that I missed almost every important detail in terms of the actual engineering implementation of a Hadoop cluster. But for those who just need a cursory understanding of the technology in order to make sufficient sense of what engineering is actually talking about to sanity check proposals, I hope this hits the mark.

---------------------------------------------------------------------------------------------------
* HDFS and the awareness of data's physical location is very important in dealing with large data.
To make compute and storage capacity scale linearly with cost, the data must be spread around all the nodes in the cluster and a record kept of where all the data actually is. It can't be kept centrally or data transfer becomes a serious bottleneck to computational performance. I.e. it could take longer to move the data to a compute node over a network than it takes to actually process it.
This distribution is managed by HDFS.

HDFS holds and replicates data in the system in order to minimize the chance of a bottleneck. It does this by keeping at least 3 copies of the data:

an original
a copy on another node behind the same switch as the original
a copy on another node on a different switch from the original

This replication attempts to strike a balance between:

keeping enough copies of the data to minimize the queue size on each node (to ensure timely completion of jobs) and to ensure robust execution despite failed tasks on dead nodes

-and-

reducing the total storage capacity of the system by duplicating data.

Thursday, June 21, 2012

A little bit about Big Data

The inspection tools that I work with are capable of churning out enormous amounts of data - on the order of terabytes an hour. To handle that data volume, we have done what every company did (up until now) which was sample from the data and reformat it to fit into a gigabyte sized database. This lets the data be accessed for useful analyses but creates a problem in that much of the data is actually lost, ultimately limiting what can be learned.

This is traditional data processing.

To store and handle more data, we swap out the existing hardware with bigger (read: more expensive) hardware. This works only up to a point as the cost of bigger h/w does not rise linearly with capacity. So you reach a limit to what is cost effective pretty quickly.

Schematically it looks something like this:

Several data sources structure the data and put it into a database. Programs running on the compute resources access the data from the database and provide some analysis. Scaling the system means getting bigger h/w.

Big Data changes how this can be done.

At its heart, Big Data is about making the data storage size and computing power scale in a linear way with cost. This is done using a few technologies which I will describe in more detail later.

Schematically it looks something like this:

One system coordinates the actions of many nodes in order to generate a desired computing result. Each node contains both compute and storage.
The entire system works in the same basic way regardless of how many nodes are present. So if more data storage or more computing power is required it can be added by provisioning more nodes instead of replacing the entire system with larger nodes. This makes it easy for a company to scale its costs with actual business volume or to handle burst loads via a hybrid cloud approach (i.e. provisioning additional nodes on demand as an IaaS offering) to avoid large capital expenditures due to over provisioning.

Through this architecture, Big Data brings significant change to the limits of how much data can be handled in a timely manner. IBM has a great summary of this principle in its "three Vs".

Volume : Petabytes instead of terabytes.
Velocity : Analyzed in seconds rather than in minutes, hours or days.
Variety : Coming from many sources, including unstructured data sources (i.e. things that don't fit into a relational database very well).

So, instead of throwing away most of the inspector data as we do today, we could keep the data and build a system at reasonable cost which could actually process it. With hard work on new algorithms which could take advantage of the new data would come new insights into the phenomena behind the data.

Not trivial but newly possible.

Friday, June 8, 2012

What is Agile Development an Answer to?

I got to thinking about some of the problems I see at work around software development & roadmap and decided to apply some systems thinking to the situation. This is the result.

The key learning:

Really good use case validation is probably the largest leverage point.
Agile development can be an alternative to really good use case validation.

No surprises there but it is interesting to see the dynamics that lead to those conclusions. The feedback loops suggest alternative paths to address the customer acceptance problem when neither use case validation improvements nor agile development are feasible. For example:

What if you refused to add late features and managed the initial urgency to gain product acceptance? As long as the gaps are fixed in the medium term, the improved roadmap credibility may be enough to gain acceptance in the face of gaps next time because the customer believes your roadmap claims.
If apps and product managers are failing to validate use cases sufficiently, can you increase scrutiny on requirements by engineering and increase insistence on complete test case details by SQC to minimize factors which cause schedule slips and perhaps offset the slips caused by feature adds?