Cursory Knowledge: August 2012

Friday, August 31, 2012

Yes! We Have NoSQL Today.

Once you have a system which can process huge amounts of data (big data), you need a place to store all of that data. This is what databases are for.

Traditionally, this has meant a relational database. But relational databases place many constraints on how the data is modeled ("normal forms") which are inconsistent with the high volume data sources which need to be analyzed (e.g. all the webpages in the world, all the legal documents in your company or all the tweets being posted each day).

Relational DBs require that data be modeled into a set of tables that contain unique entities (rows), described by attributes (columns) which are arranged in such a way as to describe one aspect of each entity in each table with no redundancies.

said another way:

each row has a primary key made from one or more columns. Column data contains single values (1NF).
All columns in a table relate only to the complete primary key (2NF)
All columns in the table contain data which is not derived from other columns in that table (3NF)
To add more columns which do not fit these constraints, you must put them in another table and join them together.

A good example of what this means is here.

said yet another way: The key, the whole key and nothing but the key.

These restrictions allow for optimal query structuring and performance while minimizing anomalies due to data changes. However they do not easily support the lack of simple structure between the contents of many data sets.

Non-Relational (NoSQL) DBs remove the restrictions on data normalization and focus, instead, on optimizing around data that does not fit well into the normalized structure which relational DBs (mostly) require. Because there are different analyses of interest and different data sources which "best" embody the data of interest, there are different types of NoSQL databases.

Below is a diagram showing the various database types.

Key Value (aka Big Table)

Data is stored in a GIANT ordered table of rows and columns.

Rows and columns still serve the same general purpose as in a relational DB case

rows = unique entities
columns = attributes.
...but normalization is not required (or expected)...

Data may be sparsely populated in the columns.

I.e. a given row may only have data values for a small fraction of the columns (because most the columns don't apply to the entity this row describes).
Columns may be VERY large in number and depend on what the DB is structured to query for.

e.g. all unique word pairs for the entities in the database

Google originally developed this technology for searching through web pages to fulfill search criteria. Roughly speaking:

rows = web pages
columns = search terms

Document

Entire documents are stored in a searchable format.

JSON (JavaScript Object Notation)
BSON (Binary JavaScript Object Notation)
XML (eXtensible Markup Language)

Queries search through the documents to identify the information of interest and return statistics or the document IDs.
Good for finding actual documents which contain specific information or summarizing the information contained in a set of documents.

Graph

Stores information about relationships between entities (objects) in the DB
Good for finding objects that are related to each other according to certain criteria.

e.g. find people (entities) who are members of the YMCA (another entity) who lived in New York in 1999.

How does this relate to Big Data?
Many NoSQL DBs are built to operate on distributed file systems and process queries via distributed computing. In fact, the very nature of the data being looked at is so large

operating on >100 PetaBytes and ingesting >500 TeraBytes/day for Facebook
processing >20 Petabytes / day for Google
Processing > 340 Million tweets / day for Twitter (~44GB/day but a huge number of entities)

and so unstructured (hard to normalize) that the data could be stored and handled in no other way.

Sunday, August 19, 2012

Can't Test This... A/B Testing

A/B testing is a very powerful tool for developing certain kinds of products. Here are a few thoughts on where it does and doesn't work.

Below is a high level flow of the testing cycle.

A/B testing works when:

The cost of implementing ver A & B +
The cost of collecting "enough" data about A&B +
The cost of fanning out the "best" version,

is less than:

The cost of visiting "enough" of your key customers +
The cost of spending "enough" time with each of them to understand the full requirements.

Or framed a different way...

If it costs a significant amount to develop, test or deploy the thing you want to evaluate or

If you cannot get adequate information back from the customer base or

If you cannot get information back in a reasonable amount of time,

then there may be better development approaches than A/B testing.

Monday, August 13, 2012

The Essence of a Marketing Requirements Document (MRD)

A few thoughts about writing MRDs.

NOTE: The MRD should describe WHAT needs to happen overall and between parts. The MRD should not (usually) describe HOW all the parts get implemented - that is for the engineering design document.

Describe the end-to-end scope of the problem to be solved
Break the problem into logical sub-problems
Describe the inputs required to resolve each sub-problem. This includes:

human interfaces for data input

one time
interactive / iterative

machine / data inputs from external data
machine / data inputs from internal (transient) data

Describe what output should be generated by resolving each sub-problem. This includes:

which data is needed as "the" output. i.e. the "permanent" data.

What is the expected input format of the consumer(s)?

which data is needed to address another sub-problem. i.e. transient data.

All of this should be written with an eye to the system in which the functionality described by the MRD lives.

Every input is the output of another system, ideally described by an MRD (reference it if you can).
Other systems may need the output of the system described by your MRD. Include these systems as examples in your MRD to give color to the bigger picture problem being solved.
Human input interfaces (User Interfaces), Machine Input interfaces (APIs) and Permanent Data stores (HDDs or Databases) may be shared between multiple systems. If they are, or should be, note that explicitly.

One obvious challenge, given the recursive approach to MRD writing given here, is figuring out where to stop.

How big should the scope of THE problem be?

My experience: when in doubt, make the scope too big. Then scale back the scope during reviews based on feedback from the stakeholders.