A really informative article about the role of research design and causal theorizing in big data. It's interesting - these "problems" with big data are the same issues social scientists have been dealing with for decades. Describing the actions of individuals is not as easy as aggregating what they say, for many reasons that can be boiled down to one concept - we're HUMAN. Welcome to our world, tech community.
What we say or what we intend is not always what we really do - just ask yourself how often you go to the gym...then keep track for a week or so. We run into that problem when asking people who they voted for or how they voted in the last election.
We are subject to social pressures, even by complete strangers (or survey-takers. *especially* survey-takers.) We're socialized into giving acceptable answers that are often not the truth. I saw that issue when using data from a 2007 Pew study of Muslim-Americans. How do you think a Muslim living in America during the Bush administration would answer a random-digit dial asking them if they felt discriminated against by the government? Survey designers spent much time and effort in addressing that issue (see the section on Survey Methodology here). However, that is the exception, and not the norm, in the current trend of acquiring as much data as you can and crunching it as soon as possible.
This is more than just "correlation does not imply causation." In my Research Design class, I spend more time talking about theory craft and proper design than I do teaching them to slog away at STATA or R. Good research is like anything else; you need a proper foundation. Simply put, does your hypothesis pass a logic test? Does it manifest itself in reality? Is there a measurable counterfactual? What's the story, in words, not numbers? Does your data measure what you're looking for? How externally valid (i.e., broadly applicable) are your findings? Is your sample representative of your population?
In the Big Data universe, I see that last issue manifesting itself in online reviews. Our responses to situations, both good and bad, are functions of how our brains work. Unless something is superlatively great or absolutely terrible, we don't think about it. When was the last time you went online to review your tried and true brand of toothpaste? This can lead to perfectly acceptable establishments, or products, or whatever, simply not getting the kinds of reviews that reflect its actual usage or potential.
Brace yourselves, Big Data backlash is coming (and already started). Arm yourselves with proper design and causal theory!