Would you be surprised if you flipped a coin seven times and got seven heads? You should be since the probability of that occurring with a fair coin is less than 1% (.0078125). Should you still be surprised if a told you that that streak of seven heads was part of a sequence of 1,000 coin tosses? You shouldn’t be since it’s a virtually certainty that you’d get that result. And therein lies one of the problems with trying to make decisions around big data (not that 1,000 data points has much to do with big data).
I recently wrote about my concern about people misinterpreting what superficially appear to be actionable results that come from mining big data. That was just part of the problem since big data more and more relies on complex analysis and extraction methods that can obscure thelogic behind the results. There is a school of thought that argues that big data results eliminate the need for theory. This is evident in the search for new drugs via high-throughput screening where candidates are discovered by testing for reactions in literally millions of micro-experiments. While this may be valuable in some circumstances, I believe that there on some inherent dangers that must be addressed.
The best course of action is to think carefully about the purpose of your work and act in a way that’s consistent with that purpose. If you’re looking for interesting hypotheses, think about the methods for testing them. If you’re looking to test and implement an intervention, consider the costs of being wrong. The degree of certainty needed to act should consider the consequences of all possible outcomes. If you’re using complex tools, you may need to develop simpler models to justify the logic of even valid findings. These are not insurmountable obstacles; they are reasonable activities that go hand in hand with big data and their associated tools.
In short, powerful tools that one can unleash on interesting data repositories are not substitutes for thinking. A few years ago I was working in a facility with huge technology resources. I carefully set up an experiment to test an hypothesis, using essentially every tool we had to work overnight on the data. When I came in the next morning and scanned the results I realized that had I thought about the problem for five minutes, the answer was obvious and easily articulated. Just because we have cannon doesn’t mean that they are the right tool.