In dealing with "big data" projects at work I have come to a definition that seems useful.
Big Data is composed of several systems that work together:
- The hardware that enables all the other algorithms needed to handle and analyze the data of interest.
- The algorithms which efficiently load, transform and store data from varied, high volume data flows.
- The algorithms which retrieve and manage the stored data in a way that allows other algorithms and s/w tools to act on it efficiently.
- Correlation algorithms which automatically comb through the data looking for data items which are related in ways which might be "interesting" to the end user.
- The visualization tools which are used to look at the automatically flagged "interesting" subset of data in order to determine if the data is actually useful and if so, how.
- The actionable information that the end user extracts from the system which he uses to further whatever goals originally justified implementing the big data system.
The hardware, ETL and data storage parts have been addressed by a fairly large number of vendors using proprietary and open source methods. You could say that the data handling platform is becoming a commodity because the movement of data is an undifferentiated requirement for all big data users.
What is still hard is algorithmically filtering the flood of incoming data to pull out the nuggets of interest so that someone can confirm their meaning. The aspects which qualify something as interesting differ from industry to industry and company to company so coming up with a common, turn-key solution may not be possible. So, until that statement is proven false, the seller's market for data scientists will continue.