Once you have a system which can process huge amounts of data (big data), you need a place to store all of that data. This is what databases are for.
Traditionally, this has meant a relational database. But relational databases place many constraints on how the data is modeled ("normal forms") which are inconsistent with the high volume data sources which need to be analyzed (e.g. all the webpages in the world, all the legal documents in your company or all the tweets being posted each day).
- Relational DBs require that data be modeled into a set of tables that contain unique entities (rows), described by attributes (columns) which are arranged in such a way as to describe one aspect of each entity in each table with no redundancies.
- said another way:
- each row has a primary key made from one or more columns. Column data contains single values (1NF).
- All columns in a table relate only to the complete primary key (2NF)
- All columns in the table contain data which is not derived from other columns in that table (3NF)
- To add more columns which do not fit these constraints, you must put them in another table and join them together.
- said yet another way: The key, the whole key and nothing but the key.
- Non-Relational (NoSQL) DBs remove the restrictions on data normalization and focus, instead, on optimizing around data that does not fit well into the normalized structure which relational DBs (mostly) require. Because there are different analyses of interest and different data sources which "best" embody the data of interest, there are different types of NoSQL databases.
Below is a diagram showing the various database types.
- Key Value (aka Big Table)
- Data is stored in a GIANT ordered table of rows and columns.
- Rows and columns still serve the same general purpose as in a relational DB case
- rows = unique entities
- columns = attributes.
- ...but normalization is not required (or expected)...
- Data may be sparsely populated in the columns.
- I.e. a given row may only have data values for a small fraction of the columns (because most the columns don't apply to the entity this row describes).
- Columns may be VERY large in number and depend on what the DB is structured to query for.
- e.g. all unique word pairs for the entities in the database
- Google originally developed this technology for searching through web pages to fulfill search criteria. Roughly speaking:
- rows = web pages
- columns = search terms
- Entire documents are stored in a searchable format.
- XML (eXtensible Markup Language)
- Queries search through the documents to identify the information of interest and return statistics or the document IDs.
- Good for finding actual documents which contain specific information or summarizing the information contained in a set of documents.
- Stores information about relationships between entities (objects) in the DB
- Good for finding objects that are related to each other according to certain criteria.
- e.g. find people (entities) who are members of the YMCA (another entity) who lived in New York in 1999.
Many NoSQL DBs are built to operate on distributed file systems and process queries via distributed computing. In fact, the very nature of the data being looked at is so large
- operating on >100 PetaBytes and ingesting >500 TeraBytes/day for Facebook
- processing >20 Petabytes / day for Google
- Processing > 340 Million tweets / day for Twitter (~44GB/day but a huge number of entities)