Ranking NBA Refs - pt. 2 Building a Bayesian Model

Jackson Curtis
Jun 28, 2021
7 min read

Updated: Jan 16, 2022

In Part 1 of my work on ranking the NBA refs, I showed the results of a Bayesian model to rank NBA refs by accuracy. In this post, I want to go in depth on the model building process and explore how the model was conceptualized, the assumptions that go into the model, and the code to fit the model using Stan. I hope that this post will be helpful to people who have some knowledge of Bayesian statistics, but wouldn't call themselves model fitting experts.

Conceptualization

As mentioned in my previous post, my Twitter bot that was posting about the refs for each Jazz game was just using a simple average of how many bad calls were made while each ref was on the court. This was naive for two reasons: (1) Different refs spent dramatically different amount of time on the court during Last Two Minute Reports, and (2) Just because a ref was on the court doesn't mean he was to blame for the bad call; getting paired with crappy refs would unfairly hurt your ranking.

By building a model, I knew we could address these issues, but had to decide what kind of model to build. Wanting to understand the "bad call rate" was what motivated me to explore poisson models. If events (e.g. bad calls) happen at a constant rate, then the number of events that happen in a given time period (e.g. last two minutes) will be poisson distributed. Therefore, a poisson likelihood seemed like a natural fit for the data.

I originally explored using a (frequentist) poisson generalized linear model. However, I realized this didn't match nicely with my desire to decompose the total bad calls into a share of the blame for each individual ref. If we gave each ref his or her own parameter in the model, the best fitting model would have negative parameters for some refs, which causes conceptual and practical difficulties (if you have three refs in one game with negative rates, what do you predict for the number of bad calls). I realized taking a Bayesian approach would allow me to solve two problems at once: (1) I could constrain each parameter to be positive using the prior, so it really would decompose the overall bad calls into three separate chunks, and (2) I could take advantage of Bayesian Shrinkage to ensure those refs for whom we have few observations would regress to the mean and not be in the extremes of the rankings.

Assumptions

The model I am proposing comes with plenty of assumptions, both about the game of basketball and statistical. I like to lay these out as neatly as possible so they can be debated and discussed. While I don't believe all these assumptions, I have not thought of a more realistic model that could be practically fit given the data we have (but I am open to suggestions!)

Basketball Assumptions

The Last Two Minute Report is an unbiased source of truth - this is probably the biggest and least realistic assumption. We are using the L2M report to define what an incorrect call is. In reality, the L2M is just one set of people criticizing another set of people with the help of slow-motion cameras and multiple angles. If you don't put much stock in the L2M reports (Andy Larsen sure doesn't), you probably shouldn't put much stock in my final rankings.
The last two minutes of close games is a random sample of all reffing performance - this assumption assumes we can apply what we learn about refs from their L2M reports to their performance in general. If we want to constrain our inferences to "this is how referees do in the L2Ms" then we don't need this assumption, but if we want to generalize and say things like "we expect this ref to make 15 bad calls a game" then we need to assume a ref's last two minute behavior tells us something about their performance overall.
Each call is exactly one ref's to make (or not), and there's no spillover help from the other two refs on the court - This makes the total bad calls decomposable. By assuming that each ref has his or her own "bad call rate" and we can get the game's total bad call rate by adding up the three refs' bad call rates, we make the math tractable. From a basketball perspective, this means that refs do not "cover" for each other (if Ref A misses a foul, Ref B will not blow his whistle and call it). While that may not seem very realistic, any model that poses an "interaction" effect between refs on the court is going to need orders of magnitude more data to make meaningful inferences. And while that more complex model may seem more desirable, I think most people would agree that my proposed model and the more complex model would have highly correlated ranking of the refs, so it seems like a directionally safe assumption even though it's not strictly true.

Mathematical Assumptions

Each ref makes bad calls at a constant rate, independently of others on the court - This ties directly into the last point above, by assuming a constant rate we can assume a poisson distribution for each individual ref. By assuming independence we can assume a poisson distribution for each game (since the sum of three independent poisson variables is another poisson variable).
Exponential priors are appropriate for this data - in my model we assume

where y_i is the number of bad calls, minutes_i is the length of the report (usually two minutes, but sometimes longer with overtime), and r_1, r_2, and r_3 are the rate parameters for each of the three refs on the court. This makes up our likelihood. Now we just need to add priors to our model parameters. We make our model hierarchical by adding one parameter that controls "league-wide reffing ability" and so define the remaining priors as follows:

Theta is the mean of the rate of the bad calls per minute, and we set the prior mean for this to 0.1. These priors say that reffing skill is distributed exponentially among the population of all potential NBA refs. This is probably too right skewed (the NBA would fire those too far in the right tails) for inferences regarding the population of refs, but our primary question of interest is the individual refs' ranks, not the hypothetical population distribution of all refs that could exist in the NBA. I wish I could cite an authority figure here, but I believe many Bayesians would agree that a model can still be good and useful if it gives you unreasonable predictions from one aspect of the model (the population distribution) but reasonable predictions from another part of the model (the individual ranks), as long as you're question of interest lies in the reasonable part of the model, but feel free to disagree with me on that!

(Editor's note: if you compare the graph at the top this week to the graph at the top last week, you'll notice some slight differences. This is because I had a bug in my code setting the prior for the rs equal to exp(theta) instead of exp(1/theta). A very dramatic difference in priors resulted in a very small change in the final result. I think this is because both priors shared the same mode, and the data quickly ruled out the huge mean I accidentally set in the prior, so it ended up being not so different after all.)

Code

One reason this model was exciting was that it was one of my first experiences using Stan where (after fixing typos) the model ran quickly and without convergence issues. This is very rare for me! The Stan file looks like this:

data {
  int<lower=0> N_games;
  int<lower=0> N_refs;
  int bad_calls[N_games];
  vector[N_games] minutes;
  matrix[N_games, N_refs] refs;
}

The data passed to the Stan program was an integer for the number of games (400), an integer for the total number of refs (79), a vector of the number of bad calls per game, a vector of the number of minutes judged in each game, and a matrix of 1s/0s where each row was a game and if there was a 1 in the column it means that ref participated in that game.

Now we define our parameters:

parameters {
  vector<lower=0>[N_refs] r;
  real<lower=0> theta;
}

transformed parameters{
  vector<lower=0>[N_games] mu;
  for (i in 1:N_games){
   mu[i] = minutes[i]*dot_product(r, row(refs, i));
  }
}

We declare one r for each ref and theta, and then we declare a transformed vector of mu parameters, where mu is the sum of the three rs for each game in our data.

The last step is to set up our model. This is the first time I really grasped and used Stan's "target" concept. The idea of target is that there is all these things going into your final posterior calculation (likelihood, prior, hyperprior) and they can all be neatly defined as contributing to the final target posterior. I'm not sure if there are computational benefits to using target instead of alternative notation, but the syntax (once I understood it) seemed conceptually elegant in defining the model:

model {
  for (i in 1:N_games){
    target += poisson_lpmf(bad_calls[i] | mu[i]);
  }
  target += exponential_lpdf(r | 1/theta);
  target += exponential_lpdf(theta | 10);
}

That's it, the whole Stan file comes in to a little less than 30 lines of code. From there I use the Rstan package to pass it the data and a new package I just learned about called shinystan to diagnose the MCMC:

library(rstan)
fit1 <- stan(seed = 2,
  file = "PoissonModel.stan",  # Stan program
  data = stan_data,    # named list of data
  chains = 4,             # number of Markov chains
  warmup = 4000,          # number of warmup iterations per chain
  iter = 8000,            # total number of iterations per chain
  cores = 4,              # number of cores 
)

shinystan::launch_shinystan(fit1)

The shinystan function launches a Shiny app with some really awesome ways to make sure your chains converged and see the estimated parameters and associated uncertainty (see below). It also seems like it will work for any Stan model out of the box, so I highly recommend giving it a try.

Conclusion

I hope this post is helpful in getting a glimpse of the kind of process I go through when model building. One thing I love about statistics, and particularly Bayesian statistics, is understanding how our assumptions about the real world find their way into a mathematical model, and, conversely, how assumptions we have to make on the modeling side (for convenience or mathematical tractability) may effect the real-world application of our model.

Next week I plan to do one more post on this model about confidence intervals. The great thing about a fully Bayesian model is the simplicity of calculating complex credible intervals. Not only can we calculate credible intervals on our parameters, but also on other quantities of interest like the ranks themselves. Spoiler alert: there is a huge amount of uncertainty in the model due to the sparsity of the data. We can use our model to show just how much more data the NBA would need to provide to be reasonably confident about the overall rankings of the officials.

As always, you can check out the full code for this project on Github.

Ranking NBA Refs - pt. 2 Building a Bayesian Model

Conceptualization

Assumptions

Code

Conclusion

Recent Posts

Comentarios