Sunday, September 23, 2012

InfoSphere Big Data - A Cursory Overview

After a half day of talking with IBM reps about Big Data products and some use cases, here is how I summarized how the pieces fit together.

At time 0 you collect everything and analyze it for correlations to determine which data items are valuable and how they relate to each other (Big Insights platform). Then you build a control model.

Learning from time 0 is used to configure a "real time" strategy for the data analysis and system control.

  • Streams provide real time processing of data "on the wire" - nothing need be stored. The output of this is three fold:
    • "Live" reports for users
    • A data subset to feed to the data warehouse
    • Control signals to feed back to the data collectors to adjust behavior (if needed).
  • Netezza (Data Warehouse) provides a location where "fast" analysis on a "limited" subset of the data can occur.
Hadoop holds everything else so that longer term analysis with full data sets is possible. This could be used to:

  • Adjust the control models
  • Change which data subsets are warehoused 
  • Perform ad hoc deep dive analysis. 
  • Perform regular analysis on data sets which are too large to reasonably warehouse (e.g. raw scan data).