'Objective Ignorance' and the Limits of Predictability

Jackson Curtis
Sep 6, 2021
3 min read

I'm currently listening to the book Noise: A Flaw in Human Judgment. It's not a book I can wholeheartedly recommend (Thinking Fast and Slow by one of the same authors was much better), but it gives a name to a concept I think a lot about: objective ignorance.

Objective ignorance is an acknowledgement of an upper bound on any prediction problem. No matter how many observations we have and how good our statistical model is, our predictive ability is going to have some natural upper limit. Let's use the following table to demonstrate objective ignorance:

Gender	Height (inches)	Weight (pounds)
Male	74	165
Female	65	120
Male	68	180
Female	67	140
Male	67	140

Now consider a prediction model to predict gender from height and weight. It is quite obvious to see that any model you could possibly produce that uses height and weight as inputs and outputs a prediction for gender will output the same prediction for rows four and five. This data has inherent objective ignorance -- In other words, we don't have enough information to make accurate predictions.

Thinking in terms of objective ignorance makes me a better data scientist. I find it useful to think about objective ignorance as a limit I'm trying to get as close as possible too. Conceptually, we can think of "objective ignorance" as the inaccuracy of a prediction, given an infinite number of observations and the best possible model. For the above prediction problem, an infinite amount of data would give us the ability to fit some statistical model which would optimally define the cutoff region between predicting male or female, but if we applied that model to a large out of sample population, our predictive accuracy would be far from 100% (perhaps like 75%). Our goal in modeling should not necessarily be to maximize accuracy as it should be to build a model that efficiently gets as close as possible to the objective ignorance level of 75%.

Ways Data Scientists Fool Themselves

I think it's a very common problem that data scientists fool themselves into thinking there is much less objective ignorance than there actually is. There are two primary ways people deceive themselves into thinking they can get around objective ignorance, both of which boil down to the problem of modeling noise:

They fit overly complex models to simple datasets, increasing the model's vulnerability to noise.
They introduce additional variables with no underlying relationship to the variable of interest, which increases the model's availability to fit noise.

Consider the following adjustment to the table above:

Gender	Height (inches)	Weight (pounds)	Hair Color
Male	74	165.5	Brown
Female	65	120.0	Blond
Male	68	180.5	Blond
Female	67	140.5	Brown
Male	67	140.0	Red

We've "enriched" our dataset in two ways: (1) We've measured weight on a more fine scale, increasing the possibility of modeling noise and (2) We've introduced a completely spurious variable (hair color) that we can use as a predictor.

Consider the number of different approaches we can take to make the problem of objective ignorance seen in the first example disappear. By more finely specifying the weight, we've opened up possibilities for non-linear models to distinguish between cases 4 and 5. A decision tree trained on this data could give us: if height > 67 predict male, if height < 67 predict female else if weight > 140 predict female else male. Similarly, we could get the same results by incorporating the spurious variable of hair color (if height = 67 then if color = red then male else female).

Obviously there's a fleet of tools for identifying and correcting overfitting to noise, the most important being cross-validation and regularization, but the real problem is in the mind of the data scientist. If the model builder doesn't believe that there is some level of objective ignorance in his data, nothing will stop him from finding more and more ways to hide the true amount of objective ignorance. Our minds LOVE to see patterns that aren't real ("if males and females are the same height, it appears the females will weigh a little more on average") and our minds LOVE to generate hypotheses to fit data ("red hair is possibly a little more common in male, so maybe his red hair makes him more likely to be male"). The art of data modeling is to be able to reasonably select which hypotheses are worth pursuing and then properly evaluating those hypotheses in a way that avoids noise and increases accuracy.

Key Takeaways

Every prediction problem is going to have some upper limit of predictability, rarely is this limit 100%
Adding additional predictor variables to your model may increase this limit, but it also might fool you into thinking this limit is much higher than it is.
Be wary of people who think that the right statistical model or right variables will dramatically increase predictive accuracy. This is a recipe for failure and missed expectations.
Be wary of using the measuring stick of predictive accuracy for every problem, even cross-validation and regularization can be fooled with small datasets and flexible models
Embrace the noisy world! Prediction is hard and being humbly cautious is better than being over-confidently wrong.

'Objective Ignorance' and the Limits of Predictability

Ways Data Scientists Fool Themselves

Key Takeaways

Recent Posts

Comentarios