Thursday, June 21, 2012

A little bit about Big Data

The inspection tools that I work with are capable of churning out enormous amounts of data - on the order of terabytes an hour. To handle that data volume, we have done what every company did (up until now) which was sample from the data and reformat it to fit into a gigabyte sized database. This lets the data be accessed for useful analyses but creates a problem in that much of the data is actually lost, ultimately limiting what can be learned.

This is traditional data processing.

To store and handle more data, we swap out the existing hardware with bigger (read: more expensive) hardware. This works only up to a point as the cost of bigger h/w does not rise linearly with capacity. So you reach a limit to what is cost effective pretty quickly.

Schematically it looks something like this:

Several data sources structure the data and put it into a database. Programs running on the compute resources access the data from the database and provide some analysis. Scaling the system means getting bigger h/w.

Big Data changes how this can be done.

At its heart, Big Data is about making the data storage size and computing power scale in a linear way with cost. This is done using a few technologies which I will describe in more detail later.

Schematically it looks something like this:

One system coordinates the actions of many nodes in order to generate a desired computing result. Each node contains both compute and storage.
The entire system works in the same basic way regardless of how many nodes are present. So if more data storage or more computing power is required it can be added by provisioning more nodes instead of replacing the entire system with larger nodes. This makes it easy for a company to scale its costs with actual business volume or to handle burst loads via a hybrid cloud approach (i.e. provisioning additional nodes on demand as an IaaS offering) to avoid large capital expenditures due to over provisioning.

Through this architecture, Big Data brings significant change to the limits of how much data can be handled in a timely manner. IBM has a great summary of this principle in its "three Vs".

  1. Volume : Petabytes instead of terabytes.
  2. Velocity : Analyzed in seconds rather than in minutes, hours or days.
  3. Variety : Coming from many sources, including unstructured data sources (i.e. things that don't fit into a relational database very well).
So, instead of throwing away most of the inspector data as we do today, we could keep the data and build a system at reasonable cost which could actually process it. With hard work on new algorithms which could take advantage of the new data would come new insights into the phenomena behind the data.
Not trivial but newly possible.