Did Big Data Fail Us in the Presidential Election?

At Accel.ai's Demystifing AI conference, I gave (what was supposed to be) a lightning talk on Big Data and the Presidential Election. The awesomely engaged audience kept it going for well over time, and I came out of it with great insights and thoughts to put down in a blog post. 


What appealed to the audience was my 'post-mortem' approach, a name that I find deliciously macabre. The goal of the approach is to look at a project after the fact and analyze it's successes and failures at every step of the way. In this case, I looked at the presidential election, and called the 'project' of predicting the outcome a 'failure' in our astounding inability to predict that Donald Trump would win more Electoral College votes than Hillary Clinton. Let's unpack the discussion, slide by slide. 

SLIDE 1: Big Data Failed Us This Election

The premise of the talk. What we know is that not a single major poll was able to accurately predict the outcome of this election. In fact, we failed SPECTACULARLY. 

Worth understanding is we have standalone polls - like Gallup- groups who conduct their own polling, and we have metapolls, which are amalgamations of other polls, such as RealClearPolitics. This is an important distinction, because the former are responsible for executing their own surveys and locating their own samples, while the latter is an aggregate that relies on an 'ensemble' of polls. 

Also worth note is that this election was supposed to be the launch of Votecastr, which was supposed to provide real-time, and accurate, insights into the election outcome. This also failed to accurately predict the outcome. 

As a result, there has been a lot of post-election big data backlash, some of the more dramatic headlines being the inspiration for this talk's title. Let's unpack why. 

SLIDE 2: Our Understanding of Big Data Failed Us This Election

Even within the polling community, there was inconsistent prediction. The New York Times Upshot model, for example, gave Clinton ~85% chance of winning, while FiveThirtyEight gave her a ~72% chance. That's a pretty big difference. 

Let's look deeper. While the polls all agreed that she would win, they disagreed on the methodology to predict how. Most publicly, Nate Silver got into a heated battle with Huffington Post on his use of trend adjustment, which HuffPo called "changing the results of polls to fit what he thinks the polls are, rather than simply entering the poll numbers into his model and crunching them." 


Rather than taking a simple average -- like RealClearPolitics does -- Silver’s model weights polls by his team’s assessment of their quality, and also performs several “adjustments” to account for things like the partisanship of a pollster or the trend lines across different polls. Yet other models take historical trend into account, and demographic shifts. There is no clear consensus on a 'best' model.

SLIDE 3: Our Explanation of Big Data Failed Us This Election

Talking data to media outlets is a dangerous game of telephone. In my opinion, it is the data scientist's responsibility to be as clear as possible and as unambiguous as possible on the true meaning of their model. What is the error margin? What is the degree of confidence? What does a "75% chance" mean (hint, it doesn't mean that there is a guarantee of winning).

Of course, this is sometimes at odds with current trends of clickbait journalism. "Clinton win likely with a 62 to 89 percent probability" is not as eye-catching or click-inducing as "Clinton 90% likely to win." What to some may be semantics is to us the meat of the discussion.  

As scientists, we got caught up with selling precision. Polls are notoriously flawed, and predictive models that result from polls have a wide margin. At best, we over-reported how good our models were, at worst, people used that margin of error to their advantage. 

SLIDE 4: Our Understanding Of How We Collect Big Data Failed Us This Election

Polling, as I mentioned above, is notoriously flawed. As a masters student (about a decade ago!) I sat in on many discussions and panels about declining response rates, sample biases, the rise of do-not-call lists and how to get people to tell the truth in polls. The share of households that agreed to participate in a telephone survey by the Pew Research Center dropped to 14 percent by 2012 from 43 percent in 1997. This was before the contention and mistrust sown by the current social and political climate.

Long story short, these problems have (some) methodological workarounds, but are far from solved. In fact, some of them are worse. Depending on where you lived, there may be strong incentive to lie about your vote to align with your region's preference. 

In other words - GIGO, or garbage in, garbage out. If our data going into our models was flawed, our analyses coming out are not trustworthy. 

SLIDE 5: We Failed Big Data This Election

There was a great post-election quote by Erik Brynjolfsson along the lines of "if you understand how models work, you weren't surprised by this election" - apologies that I can't find the source. He was referring to understanding that a probability isn't a certainty, but it globally applies to this election as a prediction project. 

Ultimately, if we understand this as a data science project, we failed on all counts: 

- we failed to bring in good data that we had faith in
- we failed to build a model that was accurate and delivered good results
- we failed to validate our model
- we failed to communicate our results properly to our audience

What is our takeaway? Humility and introspection. We are only as good as the models we build and the quality of work we produce. 

The Case for Humanity in Data Science

This is a post I've been working on for some time, and is sparked by a lot of different undercurrents in data science. First is the "will algorithms replace us?" question. The next was our current talk of 'racist' algorithms. 

First, let's discuss how amazing this data science thing is. Data science is undoubtedly having a significant impact in all aspects of our lives and will continue to. At least, I hope so...I'm a data scientist. In order to continue this progress, we have to have a degree of trust in the system. We have to share our data, provide personal information, and have faith that the people and the artificial intelligence behind our constant technological advances will protect us. 

When I put it that way, being a data scientist sounds more like being a superhero. To paraphrase Uncle Ben, with big data comes big responsibility. 

When I frame data science this way, it's easy to see how I feel about the "will algorithms replace us" question. Short answer - no. Long answer -  for a job to be fully machine-replaceable, it has to fit the following criteria: it cannot make non-interfereable decisions that could have negative repercussions on a person. In other words, the decisions it makes cannot have potential to negatively affect a human being, even via 'butterfly effect.'

While that sounds easy enough, when we give more consideration to this stipulation, we are hard pressed to find cases in which this is true. One of the most well-publicized cases was by Pro Publica, where an algorithm predicted black criminals to be more likely to re-offend. In their language "There’s software used across the country to predict future criminals. And it’s biased against blacks." 

Similarly, other quotes from articles: 
"We’ve Hit Peak Human and an Algorithm Wants Your Job. Now What?" - Wall Street Journal
"Can Computers Be Racist? The Human-Like Bias Of Algorithms" - NPR
"It's no surprise that inequality in the U.S. is on the rise. But what you might not know is that math is partly to blame." - CNN Money

Political philosophy me (yes, that was my subfield) cringes at the language. It's a very subtle shift of responsibility called moral outsourcing. The subject of my talk this Thursday at the Women Catalyst group, moral outsourcing is the shifting of moral decisionmaking to another entity (in this case, algorithms).

The humanizing language in the sentences above have the convenient outcome of shifting blame from humans. Algorithms and artificial intelligence is only as unbiased as the human behind it. Data scientists have internalized the mantra borrowed from our engineering halves- GIGO (garbage in, garbage out). But the comfortable and convenient thing about code is that we know when it doesn't work - we error out, the thing we've programmed doesn't happen. 

Algorithms also suffer from GIGO, but the 'garbage out' part is significantly more difficult to discern, and sometimes cannot be understood until post-deployment if we don't know what to look for.

This leads me to the second part of this post. The data science community is wonderful in it's desire to do good. In fact, most of these 'biased' algorithms are the product of well-intentioned data science teams. 

When I started Metis for Good, our internal pro-bono group, we our first project was with Invisible Institute, the group behind the Citizens Police Data Project. Behind CPDP is public domain complaint data against police officers - complete with badge numbers and names.

One of our first considerations as a group was to think through how to use our data in modeling. We decided immediately not to build any sort of classification or predictive model on officer-level data. In data science-speak, we're not willing to have any model error. Let's say we had a model to 'predict' violent officers. I'm not willing to have our model misclassify a single officer. Instead, a team member suggested we move a level up in our unit of analysis - rather than officer-level, we use precinct-level predictions. 

We're hosting a hackathon this Saturday. We'll be building on the great work Invisible Institute has already done, and contributing some of our own. I'm proud of our strong ethical data consideration that goes into constructing and developing our data science at Metis. All data scientists know that bias is nearly unavoidable. What is not unavoidable is our moral responsibility to implement our models ethically. 



Pain Free Data Science: Recommender Systems and how to Perfect Them

Coming off a data science buzz from Open Data Science Conference (ODSC) in Santa Clara.  I was a huge fan of the crowd that was there. Genuinely curious people with very substantive questions. We had some wonderful conversation, and I appreciate the insight and thought a lot of my colleagues are putting into their work.  I was invited for two talks: one 4-hour training session on Data Science 101, and a 45-minute talk on recommendation systems. 

The goal of my talk was to go through the thought process and critical thinking in developing a recommender system and refining the model. My slides are below, and pretty self-explanatory. What I wanted to talk about were some of the great followup questions and emails I got, particularly around developing a baseline and normalizing your data. 

There are a lot of tutorials about how to execute a basic recommendation system, but few on how to refine them. Let's assume you've built out your recommendation system, and are able to validate. One way of measuring validity is by a basic root means squared error (RMSE) measure. I won't get into the debate on whether it's a *good* error measurement or not, but let's say it's the one we use. Here's an article on Data Science Central about it, if you're so inclined.

What you will likely find is an okay model that has more error than you'd like. That is, your model is good, but not great, at predicting what your audience wants. Normalization is a great first step (and in some cases, the most impactful step) in fine-tuning your model. Normalization will be the most effective if you have a very diverse group of users or items, and/or you do not have many data points per item/user. In data science speak, both contribute to your variance. 

Normalization is one way to approach this problem. The idea is to create a baseline for 'normal' and then produce an offset for that item or user. The theory is that each user and item can vary from the expectation. Think about it this way - some people are just naturally more cheerful or more grumpy. In terms of items, some items are just perceived better or worse than others - for example, there is a huge cult of Mac that will just love every Apple branded product. Brand recognition will impact an item's perception even before the rating happens.

Normalization is a fairly easy process. For all users, get some mid-range (probably median) rating value across all users. For all products, get some sort of median rating value across all products. So you'll come out with "people tend to rate products at 3.5 on average" or "products tend to get a 3.5 rating on average".

Next, take the median score for each user across all products, and the median score for all ratings for a products. Yes, that means you will need some min number of ratings per product and per user, so this will work if you have more data. You'll come out with "user x tends to rate products at a 3.8" or "product x tends to have a rating of 3.3". 

Third, you subtract the score from the baseline to get the offset. In my example above User x would have an offset of 3.5 - 3.8 = -0.3, and product x would have an offset of 3.5 - 3.3= 0.2. 

To implement, you take your predicted value from your model - so that's a user/item prediction in your matrix - and add in the offset. Let's say I predicted that User x would give a rating of 4.0 to product x in my naive model. With normalization, I would contribute my offset and arrive at <4.0 - 0.3 + 0.2> = 3.9. My actual predicted value for User x reviewing Product x is 3.9. 

Of course this nuance has even further nuance. It's a bit hand-wavy to say 'baseline is average across all users or products.' There may be category specific characteristics that are a better way to develop a baseline. For example, maybe a new laptop should be judged against the baseline of all laptops, and not all products. This is particularly important if you're, say Amazon and have everything from crayons to chandeliers. 

Normalization is just one of many ways to refine your model, and I talk about others in the presentation. I'll also post the video from my talk when it's posted. 


Pain-Free Data Science Tutorial: Principal Components Analysis

PCA is a much-used and poorly understood way of reducing the number of features in your analysis. At this year's PyBay, I gave a lecture on the intuition behind PCA. What drew me to this lecture was the challenge of explaining something that appears mathematically intimidating in a more accessible way. 

You can look through for the lesson, but what I enjoyed about designing this talk was actively trying to stay away from equations and code. It's easy to fall back on yet another python tutorial or yet another linear algebra explanation. Let's be honest, most newcomers to data science will often apply tools like PCA without fully understanding what it does and why. Most will tell you the following: 

1) You reduce dimensions the same way you do feature selection, but without 'losing information' (whatever that means!)

2) You type in a line of code and voila! instant dimensions reduced, and plow ahead with your classification or clustering. 

lda = LDA(n_components=2)
X_train_lda = lda.fit_transform(X_train, y_train)


A deeper understanding helps you become a better data scientist. It's easy and tempting to plug and play code, but a good data scientist knows when not to use a tool:

First, a data scientist should understand the appropriate use of dimensionality reduction. In many cases, it's approached incorrectly - a user will say, "I want to project my data down to n dimensions." Instead, check your scree plot for the cumulative percent variance explained with each additional dimension, to make sure you're picking up an adequate amount of variance. 

Second, PCA doesn't change your data. It's just a shift in perspective used to combat the curse of dimensionality. In my talk, I use the example of the duck/rabbit optical illusion to illustrate how a different perspective can make a projection look completely different. 

PCA = Ducks and Bunnies


We comprehend dimensionality reduction every day - on our phones, televisions, laptops and movie theaters. All an image is is a 2 dimensional representation of a 3 dimensional image. Think about the picture below - it is clearly three dimensional, but we are clearly able to understand what is imparted, as the third dimension, though useful, is not necessary for our understanding of what the image portrays. In PCA terms. we're able to collapse the data by removing one dimension without sacrificing information: 


Finally, it's important to note that you do lose interpretability for a non-data science audience. A simple way to explain it is that you weed out the signal from the noise. However, it's much easier to explain a beta coefficient of a linear model than to explain the eigenvalues that compose the eigenvector of each component. It's quite important to keep that in mind when choosing to perform dimensionality reduction on your model. For people in sensitive data environments (banking or healthcare), you may have to stick to basic feature selection.


"Woman-Lecturing" won't help women in the workplace - solidarity will.

I'm currently getting involved with organizations and individuals who promote a diverse and inclusive environment for women and minorities. Metis has been an amazing and pro-diversity (of all kinds) environment - we offer a $2000 scholarship for women, military, and underrepresented minorities. However, I want us to push the envelope even more. Financial assistance is great, but we need to change culture.

I'm getting involved with networking groups and organizations that actively promote women and minorities. Ellevate Network is one of the groups that I reached out to - they have assertive, powerful, and intelligent women driving their leadership, and it seemed like a great place to put my influence in, as well as bring in female students from my classes. 

And then I read this article, written by Sallie Krawcheck, the founder of Ellevate. It starts with the not-so-true 'truism' that asserts that women don't help each other in the workplace (don't we? let's see the numbers!), and then proceeds to tear apart women as the primary source of blame for not getting ahead. 

Ladies, take note - our obstacle #1 is "Queen Bee"! No, not the systematic boys club that creates a negative environment, not society's pressures to be a 'good mom' or 'good wife' - your #1 obstacle is...another woman. At this point, it's obliquely mentioned that yes, okay, there are some sort of obstacles to women in positions of power, but it's clearly prioritized that this woman is the problem. 

I posted more here on Linked In


Thought Experiment

I'm currently working on a fascinating project to test the effectiveness of privately-owned vs. government-operated sanitation facilities/businesses - whether or not privately-owned businesses improve access to sanitation facilities. Like all non-laboratory experiments, it's a bit imperfect. Here's what we've got: 

1) Treatment villages started at different times - ranging from 2008 to 2013

2) We have sanitation access data, monthly, from 2008 to 2014

3) We have identifying information and a plethora of demographics, descriptors, and the like.

So, we are going with a propensity score matching and difference-in-differences model here. What that means is we will have a test and control group, we will match up the two on multiple characteristics, then do a simple A/B test to see if there is a statistically significant difference in access to sanitation in treatment v control groups. 

Here's the issue we are running into. Since we have staggered start times, we don't have a clean time 0 (as it were). We thought of simply getting rid of chronological time and lining up the villages by start time, time n+1, time n+2 (measurements taken quarterly). 

We run into the following problem - how do we then pair the treatment groups with control groups, since the control groups have no time 0? 

The answer...as soon as I figure it out :) 

The Importance of Research Design and Theory

A really informative article about the role of research design and causal theorizing in big data. It's interesting - these "problems" with big data are the same issues social scientists have been dealing with for decades. Describing the actions of individuals is not as easy as aggregating what they say, for many reasons that can be boiled down to one concept - we're HUMAN. Welcome to our world, tech community. 

What we say or what we intend is not always what we really do - just ask yourself how often you go to the gym...then keep track for a week or so. We run into that problem when asking people who they voted for or how they voted in the last election. 

We are subject to social pressures, even by complete strangers (or survey-takers. *especially* survey-takers.) We're socialized into giving acceptable answers that are often not the truth. I saw that issue when using data from a 2007 Pew study of Muslim-Americans. How do you think a Muslim living in America during the Bush administration would answer a random-digit dial asking them if they felt discriminated against by the government? Survey designers spent much time and effort in addressing that issue (see the section on Survey Methodology here). However, that is the exception, and not the norm, in the current trend of acquiring as much data as you can and crunching it as soon as possible.  

This is more than just "correlation does not imply causation." In my Research Design class, I spend more time talking about theory craft and proper design than I do teaching them to slog away at STATA or R. Good research is like anything else; you need a proper foundation. Simply put, does your hypothesis pass a logic test? Does it manifest itself in reality? Is there a measurable counterfactual? What's the story, in words, not numbers? Does your data measure what you're looking for? How externally valid (i.e., broadly applicable) are your findings? Is your sample representative of your population? 

In the Big Data universe, I see that last issue manifesting itself in online reviews. Our responses to situations, both good and bad, are functions of how our brains work. Unless something is superlatively great or absolutely terrible, we don't think about it. When was the last time you went online to review your tried and true brand of toothpaste? This can lead to perfectly acceptable establishments, or products, or whatever, simply not getting the kinds of reviews that reflect its actual usage or potential. 

Brace yourselves, Big Data backlash is coming (and already started). Arm yourselves with proper design and causal theory! 


Grand internet debut!

I'm adding to the already well-developed discussion on women and the work-life balance. TPM is one of my favorite political methods blogs to follow, so I'm pleased to be making my grand internet debut with a splash. 

Diversity and Political Methodology: A Graduate Student’s Perspective

The evolution of our academic and work institutions to reflect the changing role of women (and women as mothers) is fascinating. My thoughts are currently on the role of perception. One point I wish I could have elaborated on was the fact that the majority opinion, frankly, doesn't matter in these situations. 

I think that's a hard pill for any of us to swallow. We like to think that (a) we are nice people who do not discriminate, and that (b) our opinions matter. To elaborate - when a minority (and I use minority because this applies to racial minorities as well) is "inadvertently" marginalized, for example, as the only person of that gender or color in a room, the sentiment is that unless someone is being overtly discriminatory, that it's the onus of the minority individual to "get over it." In more extreme cases, you can be accused of being hyper-sensitive - that's a favorite rebuttal for women. 

Here's a thought, people in power (which, yes, can sometimes be me) - maybe what YOU think of the minority in the situation isn't what matters. Maybe it's what the minority feels that matters more. Maybe it's the role of institutions and individuals in power to recognize that and reform so that maybe next time, that email from Mei Chen or Deepak Patel actually is answered?