How Dumb is Your Data?

Recently there seems to be an escalation of hype around AI. One recent article by Nan Li entitled, The New Intelligence starts like this

Modern AI and the fundamental undoing of the scientific method

The days of traditional, human-driven problem solving — developing a hypothesis, uncovering principles, and testing that hypothesis through deduction, logic, and experimentation — may be coming to an end. A confluence of factors (large data sets, step-change infrastructure, algorithms, and computational resources) are moving us toward an entirely new type of discovery, one that sits far beyond the constraints of human-like logic or decision-making: driven solely by AI, rooted in radical empiricism. The implications — from how we celebrate scientific discovery to assigning moral responsibility to those discoveries — are far-reaching.

I share this passage because it embodies some of the popularist trends in thinking about AI, namely:

  • Human driven problem solving is becoming less important.
  • Machine learning tools can automatically make sense of the data.
  • That strong AI is on the door step.

In Part One of this series we saw how a deep learning neural network can take large quantities of unstructured data with little human help and answer useful queries such as which breed of dog is this? And in these contexts, the answers are often better than human level performance. As significant as these breakthroughs are, these problems are typically about correlating unstructured inputs (pixels, text, audio) with associated outputs. There are no spurious correlations or confounding variables here - challenges that require us to reason about the world, and ultimately fall back to us as humans.

The limitations of the data centric approach are discussed by Judea Pearl in his latest book The Book of Why. Pearl, one of the pioneers of Bayesian Networks, points out that much of our knowledge as humans is based on causal understanding rather than mere data or facts. He identifies three levels of causal sophistication, placing modern techniques such as deep learning on the lowest rung (association):

  1. Association: The level of seeing and observing. How would seeing X change my belief in Y?
  2. Intervention: The level of doing and intervening. What would Y be if I do X?
  3. Imagining: The level of retrospection and understanding. What if X had not occurred? What if I had acted differently?

In the words of Pearl:

Some readers may be surprised to see that I have placed present-day learning machines squarely on rung one of the Ladder of Causation, sharing the wisdom of an owl. We hear almost every day, it seems, about rapid advances in machine learning systems–self-driving cars, speech-recognition systems, and, especially in recent years, deep-learning algorithms (or deep neural networks). How could they still be only at level one?

The successes of deep learning have been truly remarkable and have caught many of us by surprise. Nevertheless, deep learning has succeeded primarily by showing that certain questions or tasks we thought were difficult are in fact not. It has not addressed the truly difficult questions that continue to prevent us from achieving humanlike AI. As a result the public believes that “strong AI”, machines that think like humans, is just around the corner or maybe even here already. In reality nothing could be further from the truth.

Whether or not you agree with Pearls assessment that, “…tasks we thought were difficult are in fact not”, the causal approach he champions is a profound and recent development in the history of statistics. Historically statistics has been a data centric discipline that emphasised correlation, but avoided any notion of causation.

To appreciate the limitations of the data centric approach, consider a simple example. Table One shows data for a hypothetical observational study which considers the impact of a drug on the risk of heart attack.

Table One: Observational Study One

Control Group
Treatment Group
Heart Attack No Heart Attack Heart Attack No Heart Attack
Total 13 47 11 49

We don’t need a deep neural network or machine learning to analyse the data. The proportion of people in the treatment group who suffered a heart attack was 18.3% (11/60) as compared with the proportion who had heart attacks in the control group 21.7% (13/60). So it seems intuitive to conclude that the drug reduces the risk of heart attack.

But what about if you had some more information? Table Two includes gender for the same study.

Table Two: Observational Study One

Control Group
Treatment Group
Heart Attack No Heart Attack Heart Attack No Heart Attack
Female 1 19 3 37
Male 12 28 8 12
Total 13 47 11 49

Notice the totals are the same, but now we also have the results based on gender. For women the rate of heart attack was 7.5% (3/40) under treatment vs 5% (1/20) in control and for men 40% (8/20) under treatment vs 30% (12/40) under control. So now it seems the drug increases the risk for women and for men, but overall decreases the risk.

How can that be?

This is the first element of the problem that confronted statistician Edward Simpson in 1951 and became known as Simpson’s Paradox1

The first part of the paradox is the reversal of proportions when comparing the data in aggregate to the data stratified on gender. This may seem surprising at first, but if you look at the proportion of men in the treatment group and the control group you can clearly see how this has happened. Men are under represented in the treatment group and over-represented in the control group. They are also at much higher risk of heart attack. This is what has caused the aggregate treatment effect to look encouraging by comparison to the control group.

Viewed in this light it’s clear that in this observational study, gender is a confounding variable. Moreover we can remove the confounding effect by stratifying on gender and normalizing the treatment and control groups based on their size. This practice is know as Inverse Probability of Treatment Weighting (IPTW)

Table Three: Observational Study One (IPTW)

Control Group
Treatment Group
Heart Attack No Heart Attack Heart Attack No Heart Attack
Female 1 x 2 = 2 19 x 2 = 38 3 37
Male 12 28 8 x 2 = 16 12 x 2 = 24
Total 14 66 19 61

Adjusting for gender the treatment effect is 23.8% (19/80) vs control of 17.5% (14/80), as compared with unadjusted rates of 18.3% and 21.7% respectively. More significantly our conclusion that the drug is good for heart attacks has reversed to the drug being bad for heart attacks.

In Part One of this series we saw the magic of deep learning in action. In just 30 lines of code we shoveled in the data and the neural network figured out all the answers, better than most humans could. But operating on the level of association, here the deep neural network can’t help. If we feed it the wrong associations the conclusions will also be wrong.

OK. But what if we adjust the data before doing machine learning? Let’s consider one final example. The data in table 4 is exactly the same as the previous study, except this time the scenario is different. The drug in question lowers blood pressure. The sub groups are now blood pressure, rather than gender.

Table Four: Observational Study Two

Control Group
Treatment Group
Heart Attack No Heart Attack Heart Attack No Heart Attack
Low Blood Pressure 1 19 3 37
High Blood Pressure 12 28 8 12
Total 13 47 11 49

In the previous experiment, gender was a confounding variable, but what about blood pressure? Looking at the data in table four, it’s clear that the drug appears to be lowering blood pressure. Two thirds of participants in the treatment group have low blood pressure as compared with 1/3 in the control group. Additionally the rate of heart attack in the low blood pressure group is significantly lower. This appears consistent with an effective drug.

Does it make sense to adjust the data as we did above, and conclude that the drug is bad?

Clearly not. In this case there is no confounding and it doesn’t make sense to adjust for blood pressure. This is the second part of Simpson’s paradox. Same data, different conclusion. The differences between the two scenarios are indicated by causal relationships shown in figures one and two.

Figure One: Causal Diagram - Observational Study One

Figure Two: Causal Diagram - Observational Study Two

In Figure One Gender is a confounder because it affects both the outcome (heart attack) as well as the likelihood of taking the drug. In Figure Two blood pressure is a mediator and is likely the key mechanism driving the outcome. Adjusting for blood pressure would completely block this effect.

These examples show that given the same data, the conclusions can be different depending on the causal model. Which brings us to this. What matters is not the data, but the data generating process. Understanding the latter allows us to interpret the former. Radical Empiricism has had some spectacular successes. But always rely on it at your peril!


1 Simpson’s original paper The Interpretation of Interaction in Contingency Tables published in the Journal of the Royal Statistical Society presents examples using cards and mortality rates but is somewhat esoteric. Peal’s examples from The Book of Why are more enlightening, and have been used here instead of Simpson’s.

comments powered by Disqus