As anyone that's spent any time in IT knows, buzzwords are big. And nowhere are they bigger than in data management. Big data hype -- driven by phrases like "Data scientists running real-time, in-memory predictive analytics on big data will surely be a game-changer for your business!" are commonplace in today's IT market. In reality, however, the actual product or service behind these data buzzwords may disappoint. Gleaning important insights from your data is actually a difficult, labor-intensive and often tedious process.
This is not to say that these phrases are meaningless, but it can be tough to tell the difference between meaningful technical terms and buzzwords. The first is a marker of expertise while the latter is indicative of sloppy thinking. I don't consider myself an expert in statistics or data science, but I know enough about the actual concepts and techniques behind these buzzwords to know that each come with their own trade-offs and pitfalls. If we're not clear on what these techniques really are when embarking on data-driven projects, we run a real risk of false conclusions and project failure.
For example, what we often refer to as "predictive analytics" are actually algorithms that find potential correlations and trends in data. Predictive algorithms aren't actually predictive. At best, they tell you what will probably happen if the future is like the past. At worst, they are highly susceptible to false positives (type 1 errors), in which correlations that don't actually exist are mistakenly identified.
False correlations appear due to the random distribution of the data or through an error in the analysis method. They don't indicate a real-world phenomenon. Tools that make predictive algorithms accessible and easier to run can exacerbate the false positive problem because the more analyses are run, the more likely random error will result in the appearance of a correlation. Untrained operators (and even trained operators, as described in this extensive and interesting article) have a tendency to forget about the analyses that didn't find significant correlation in preference to the analyses that got a "result." But finding correlations is inevitable when running enough analyses. True analytics software should be smart enough to recognize this.
For more on SAP and data management:
Read how integration between SAP BW and SAP Data Services is growing
How does SAP's BI stack measure up against third party tools?
What is SAP Lumira?
Worse, error is rarely random. There is always a process by which the data was gathered and consolidated. At every step in that process there is the opportunity to introduce errors in the data. These errors will tend to introduce false correlations. For example, you might do an analysis on profitability data, but that data happens to be missing sales figures for several products from the western U.S. due to a bug that was introduced to the system earlier in the year. Your predictive analytics software will show that the region's profit contribution has been going downhill and will probably continue down that path. In reality, the problem is that your analysis is missing sales data but including cost and overhead data for these products. Your software might make it look like you should cut overhead to compensate when it should really remind you to check your data or suggest that the software deployment introducing the bug seems to be correlated with the change in performance for the region and disappearance of revenue for several products. Software vendors haven't included this kind of analysis functionality in their software because it's very hard to engineer and it doesn't address the buzzwords that are currently driving software industry sales.
On that note, I'd like to introduce a few buzzwords of my own that could help unlock the potential of data if it was a part of our software and our methods:
- Honesty: Always showing the data as it is to the best of our ability. For example, showing error bars on our charts and making sure that we don't imply a level of accuracy that doesn't exist, both in the data and in visualizations based on our data. The predictive software mentioned above might give the impression that the data is reliable when it is not. This software would fail an honesty test.
- Integrity: Making sure that our data directly reflects reality. This means expending effort to avoid a situation in which the measurement, collection and preparation methods we use introduce their own trends into our data. The missing SKU example above shows lack of integrity in our data preparation.
- Transparency: Ensuring the honesty and integrity of our data. Ideally, when a person is looking at any data, the details of every step of the process -- from measurement, to collection and aggregation, to visualization -- should be available in an accessible manner so that the viewer can assess the quality of the data. For example, our analytic software mentioned above that shows a profit margin trend-line for the western U.S. should also show information about the source of the data, which might lead an operator to notice that the beginning of the downward trend correlated with the introduction of a new software deployment. This kind of transparency requires maintaining meaningful data lineage information and making that information directly available in the analytic context.
The bar I'm setting here is high, perhaps, but here's the takeaway: The persistent and most common problem in the data management business isn't handling size, providing speed or automatically predicting the future. The problem is getting quality data in front of experts in an honest, transparent format that provides good interactions with the data and helps them draw their own conclusions with confidence.
Buzzwords like "big data," "real-time," "in-memory" and "predictive analytics" don't provide business value on their own, but in the service of honesty, integrity and transparency, they can make a major contribution to the value of our business data.
This was first published in January 2014