Although the full operation is fascinating, this post will primarily focus on the evaluation of the predictions. The particular model in question has had a decent start to the summer. So how can we judge the accuracy of these picks? And what does that tell us about the feasibility of betting on sports?

While much of this post will seem straightforward, answering these questions gave me an increased appreciation for the variability in sporting outcomes with respect to gambling. I’ve posted the code here, in case anyone else is interested in using a similar process with their own projections.

*********

First, some background. The data consists of 659 picks made versus the game’s opening money line since the start of the 2017 season. Each pick is based on a model-estimated probability for each team in each game, which is then compared to that team’s market probability. There have been about 950 MLB games thus far, which means that the model has taken a team in about 7 of every 10 contests. On the remaining games, probabilities for each team are too close to the market’s price to have an edge. Those games were dropped from the data.

The data also contain the observed differences between the model estimated probability and implied probability, relative investments (made assuming an equal balance prior to all games), the amount to be won or lost depending on the game’s result, the actual game results (win or lose), closing money line prices, and the difference in implied team probabilities between the opening and closing odds. Note that bets are made on “units” – this could be dollars, pistachio shells, or whatever your mind can imagine. Generally, higher units are placed on bigger edges; the average unit per pick is about 0.60. Note that the highest unit is capped at 1.0, which is done given the non-zero chance that probabilities are off on account of lineup or pitching changes.

Next, some summary statistics. While a nearly identical number of picks have backed the away team as have backed the home team (51% to 49%), nearly twice as many underdogs have been backed compared to favorites (64% to 36%). Altogether, the model is up about 27 units thus far, which roughly reflects about a 7% return on investment. Game results have been most kind towards backing the home team (+24.5 units) compared to the visiting team (+2.5 units), with underdogs slightly more profitable than favorites (+19.9 to +7.1 units). While a deeper investigation could look into if these differences are meaningful, that’s not a primary goal.

*********

One immediate anecdote that I picked up quickly is how variable things could appear in small periods of time. Here’s the cumulative profit from day one of the season (shown in red). In the background are 200 simulated season-to-date profits, done using the given market implied probabilities as the true probabilities for each team.

Within any given week (say, 75 picks), profits could vary by as much as 15 or so units. And at certain time points (say, between picks 100 and 210), all appears lost, with picks going into a deep dive. Even for me, as someone whose job entails having a decent understanding of randomness, it’s tempting to look for patterns in the red line, even though none likely exist.

Relative to random season outcomes simulated using the opening market probabilities, model picks currently stand in the 96th percentile. That is, only about 4% of sequences using random game outcomes would be doing this well if the opening market probabilities reflected the true probabilities. And note the center of the above sequences: roughly -10 units, which accounts for vig taken in by betting markets.

*********

In addition to the chart above, I made a similar one (not shown) with one important difference; instead of market-implied prices as the truth, I used the model-generated probabilities. In expectation, this simulation will yield positive profits. But in what was a total shocker for me, it was still reasonable – it happened about 5% of the time – for such a model to turn a *negative* profit through 650 picks. That is, even with known, better than market probabilities for each game outcome, it’s still feasible to lose money across 650 games. First thoughts that went through my mind:

-650 games is *three* NFL seasons worth. That is, an NFL bettor taking every game could have three straight losing seasons in a row while still having better than market odds for each of his or her picks.

-Related: I could not be a professional gambler.

*********

I thought it would be interesting to take a look at which team the model has picked most often (both for and against). Here’s that plot. On the x-axis is the total investment made, either for (on the left) or against (on the right) each team, and the y-axis is the season-to-date profit.

This particular model continues to back the Padres and Mets at most opportunities, while picking against the Red Sox. Altogether, those picks have mostly broken even.

Meanwhile, the model has had some success taking the Rockies, White Sox, and Rays, while likewise performing well when fading the Indians, Giants, and Blue Jays. Picking the Phillies has not been so fruitful, nor has picking against the Diamondbacks.

*********

Our final check looks at how the model has done relative to line movement. If the model can “predict” the direction where prices will go in the moments leading up to the game, that would generally be a good thing. From what I’ve been told, closing market prices are generally more efficient than opening numbers.

Here’s a histogram showing line movement (on the probability scale). Positive changes reflect movement in the direction of the model’s chosen team.

Among the picks to date, about 1 in 20 opening lines precisely match closing lines. A tick under 58% of games have moved in the direction of the model’s team, while about 37% have moved against.

Across all contests, the average price has moved about 0.6% in the direction of the model’s chosen team. While this seems like a small number, across several hundred games, that type of advantage would seemingly add up.

There’s also a decent link between the model’s projected edge for a team and the likelihood of movement in the direction of that team. The average game moved 0.25% among games with smaller-sized edges, 0.5% on games with medium-sized edges, and a full 1.0% on games with the largest edges (putting about 200 games in each of these categories).

*********

Assorted final notes:

-Log-loss is a proper scoring rule for binary outcomes, but it is less evident how log-loss can precisely evaluate this model, given that some picks are made with more of an edge than others (perhaps a weighted log-loss?). Additionally, there’s no immediate interpretability to log-loss. In any case, the average log-loss is -0.6845 for the market implied probabilities and -0.6836 for the model estimated probabilities (closer to 0 is better).

-It is tempting to tie team allocations (as far as supporting or fading) to changes to the game that have been seen this summer. This includes the supposed juiced ball and increases to HR/FB ratio. Something to keep an eye on.

-How do others’ evaluate picks, either their own or from others? My prior is to trust the market until proven otherwise, and that’s a very strong prior.

]]>

The amount of preparation that team scouts and analysts put in is overwhelming. This includes sleeping at the office, 20-hour work days, and hours upon hours of poring over film and interviewing players and their coaches. With teams wanting to learn just about everything there is to know about a player, no stone is left unturned.

Additionally, the effort that teams place on evaluating players has grown leaps and bounds over the past half century. In the 1970s, for example, Washington famously went a decade without a first round pick. Given the differences between now and then, one would expect that at some level, teams can better draft players now than they could decades ago.

But have teams improved at drafting?

In this post, we’ll look into the evolution of NFL drafting ability over time, and compare it to other North American Leagues.

**************************************************

Our interest lies in the link between where a player was drafted (pick number) and how well he performs. No player-level metric is perfect, but Pro Football Reference’s career approximate value (CAV) provides a decent snapshot of a player’s talent. We’ll use that as our outcome.

Not surprisingly, the the distribution of CAV is strongly skewed right, with most players between 0 and 20 but a handful of stars rating above 100. Thus, we’ll prefer a non-parametric tool to a parametric one, as to avoid making assumptions about CAV’s underlying distribution.

One possibility would be to set a binary cutoffs, as Chase does here, to assess the percentage of a draft’s CAV that falls within a certain range of picks. Alternatively, as to maintain all of a draft’s information, rank correlations (Spearman, Kendall’s Tau) can help us assess the (hopefully monotonic) dependency between draft spot and performance while also maintaining all of a draft’s information as far as which players were ranked better.

Looking back, here are the yearly Spearman rank correlations between draft position and CAV, separated by round. Values of 1 would reflect a perfectly monotonic link between draft spot and performance, while values near 0 would reflect no link. The blue line reflects possible non-linear trends over time, with the grey area reflecting our uncertainty.

There’s no evidence that in any round, NFL teams are doing a better job at selecting the best players early. In 2013, for example, the 1st round rank correlation coefficient was about -0.2, indicating a year where picking earlier in the draft was linked to *worse* performance. Of the first 10 picks from that draft, only 1 has made a Pro Bowl, compared to five of the final ten selections.

Additionally, note that for rounds 4-6, there’s little evidence of a difference between rank correlation and 0, which suggests that by that point, there’s not a big benefit to picking earlier in the round.

We can also add a positional separation allows us to both assess if any changes over time identify with a specific type of player, and to account for the fact that if teams draft for positional need, that may supercede taking the best overall player available.

Per CAV, there’s been a *slight* improvement in the drafting of running backs and wide receivers, and, after a dip in the 1980’s, tight ends. The latter is potentially related to how teams may be more apt to draft receiving tight ends earlier in the draft, with less of an emphasis on blocking tight ends. Receiving tight ends may be easier to evaluate, for example, or may score higher on CAV.

For most positions, though, the link between positional CAV rank and draft position is as noisy as it was four decades ago. Interestingly, there does not appear to be any one specific position where teams are better at identifying talent.

Note that in using CAV, I was able to chart rank correlation’s all the way through the 2016 draft. However, if anything, this likely overestimates the recent link between draft position and performance – teams are more likely to give their earlier picks playing time in their first few years. Once lower drafted players have more time to establish themselves, we would expect the link between draft position and performance to lessen, which could lower the recent scores.

**************************************************

While it’s easy to pick at the the NFL’s inability to noticeably improve player evaluation over time, it’d be more telling if we could find that other professional leagues *have* gotten better over time.

Using the same metrics described here, I charted the link between pick number and player performance in MLB, the NBA, NFL, and NHL. I focused on each league’s first 60 choices (64 for the NFL), which matches the current length of the NBA draft.

A few things stand out.

First, the NBA bests all other leagues as far as an overall performance, which isn’t surprising given the steepness of its draft curve and the differences between the importance of the best players relative to the league average.

Second, over time, the link between performance and draft position has grown stronger in both MLB and the NHL. While improved drafting ability is one possibility in both sports, in MLB, changes to the draft structure may also be responsible. Specifically, big market teams are no longer allowed to award big bonuses to players later in rounds, which could have been pulling correlations closer to 0.

In the NBA, after an early possible spike, there doesn’t seem to be any improvement over time. However, given that the NBA is already starting with rank correlations closer to 1, there’s also less room for improvement.

Altogether, it’s certainly feasible that the NHL and MLB have gotten better at drafting, while the NBA may have already reached its peak. In the NFL, meanwhile, drafting ability has either reached its limited peak, or involves so much noise that it’s difficult to identify a substantial, league-wide improvement.

**************************************************

Postscripts:

-I dropped specialists in the NFL for the position level chart, given that so few are drafted each year.

-For those of you who gave Chase’s article a read, there was a tangible difference between using rank correlation (as I did above) and the traditional correlation coefficient (Pearson’s). With the latter, there does appear to be an improvement over time, potentially linked by a few outlying observations drafted early.

-There may be other reasons for the NHL’s apparent improvement over time besides an improvement in player evaluation, or it could be tied to my choice of player outcome (games played). Separating by rounds, the greatest efficiency improvement appeared to be in round 1.

-Code is available here. This includes the scraping code, so it could take a few minutes. Feel free to play around.

]]>

Tiebreakers and divisional qualification rules not withstanding, both the Islanders and Lightning finished a point out of the playoffs. That’s a difference between what would likely be at least a 1 in 25 chance at a Stanley Cup *and* at least two games of home playoff revenue, or an early start to golf season. That point difference was immense.

But there’s a problem with using the standings above – the points aren’t equivalent. Specifically, I’ll argue in this post that the 94 points from the Islanders is, all else equal, likely more impressive than the 95 points for the Leafs, given the caliber of each team’s schedule.

**The NHL’s unbalanced schedule.**

First, some background. NHL teams play intra-division opponents either four *or* five times, inter-division/intra-conference opponents three times, and all inter-conference opponents two times.

This is a small but notable difference. The Islanders play in the NHL’s Metro Division, one stacked this year with two really good teams (Pittsburgh, Columbus) and one of the best teams in the last decade (Washington). Moreover, the Islanders faced the unenviable task of being one of two teams this season to face the Capitals five times (NYI added five games against Carolina, too). Meanwhile, the Leafs faced each Metro team only three times apiece, while adding five-game sets against the Florida Panthers and Montreal Canadiens.

Does that make a difference? Surely it does.

Here’s a chart showing the estimated impact of the NHL’s unbalanced, division and conference-loaded schedule. Each team is shown on the *x*-axis, and the *y*-axis corresponds to the net benefit (or loss) in standings points, in expectation, comparing the NHL’s unbalanced schedule to one in which opponents are randomly assigned (and allowing for the fact that teams cannot play themselves). The plot is faceted by division.

The differences are small (note the *y*-axis), but they are notable and follow our intuition. The Islanders’ schedule difficulty likely cost the team about 1.3 points, on average, relative to a league-average schedule. Meanwhile, the Leafs’ schedule was worth somewhere around +0.6 points. That difference, of course, is larger than the gaps that we observed in the standings. If each franchise had played a balanced schedule, ignoring all other information, we’d have expected the Islanders to finish a nose ahead of Toronto.

On a division level, all Metro teams faced a more difficult than average schedule, led by the Devils, who faced the Rangers and Penguins five times apiece. Meanwhile, nearly all Western Conference teams benefitted from being in the same conference as the Colorado Avalanche, Vancouver Canucks, and Arizona Coyotes (and from not being in the same conference as the Capitals). In particular, it was a good year to be in the Pacific – the contrast between that division and the Metro division is startling.

We know already that the NHL’s divisional format for the playoff qualification has put an unfair burden on teams in top divisions. It appears that the scheduling format, to a far lesser degree, only works to make that burden more difficult to overcome.

**Final notes**

-Results stem from 100,000 simulated season point totals in which opponents were generated at random, relative to simulated season point totals using this year’s actual schedule. As a result, they may not reflect the true, actual differences in schedule difficulty. I used such a large number of simulations because at smaller numbers, there was a bit too much inconsistency in the resulting charts for my preference. Additionally, the goal (for now) here is just the typical difference in points. Across simulations, it’s not uncommon for teams to jump by as many as 10 points in one direction or the other.

-The Islanders finished as a top-8 team in the conference (what I’ll call making the playoffs) about 4% more often (48%, versus 44%) during iterations when I used random scheduling, compared to the current version. Not a big difference, but roughly what I’d expect.

-In 100,000 simulated seasons with the current schedule, the Avalanche made the playoffs 8 times, and the Capitals missed the playoffs 278 times.

-Code is here.

]]>

In particular, I left curious as to the end impact of the NHL’s randomness. If most game outcomes are near coin flips, what impact does that have on season outcomes? In this post, I’ll reflect back on the NHL’s final standings, with the goal of better understanding the underlying differences in team strengths, and what that means about where teams finish.

**Why final standings?**

Sports leagues use final regular season standings to both determine playoff eligibility and provide a rough sense of where other teams will pick in future player drafts. Indeed, there’s no simpler mechanism by which we judge a team’s regular season success than by where it finished in the standings. So when the Washington Capitals finish the season with 118 points, we are left to assume that the Capitals are a 118 point team.

However, standings are function of several hard-to-define inputs, including team talent, schedule difficulty, timing, injuries, and, of course, luck. And so while the Capitals finished with 118 points, it’d also be exceedingly unlikely for the Capitals to finish on that exact 118 number if we were to somehow replay the regular season again under an identical set of circumstances.

But how many points could that 118-point Capitals team finish with? And what meaningful differences between teams can we extract from where they finish in the standings?

**Resampling a season**

Even though the NHL only plays each season one time, it doesn’t mean that we also face the same restriction. Indeed, perhaps the only way to revisit differences in league standings that could have been observed would be to replay games. If we can assume that the probability of each game’s outcome is known — admittedly, an unjustifiable assumption — then the resampling of each season is quite straightforward. Given each game’s probability, we can simulate each regular season contest, impute the corresponding final standings, and repeat this process several times.

Here’s what the 2016-17 NHL season would look like when replayed 1000 times (assumptions are provided at the end). The chart below shows imputed point totals for each team, provided in overlaid, Joy Division inspired density plots.

**What do we learn?**

To start, that 118-point Capitals team could have easily been a 118-point Capitals team or a 128-point Capitals team. In fact, for all teams, swings of 10 points in either direction are not that surprising. For the Capitals, those 10 points could be the difference between being the top-seed in the Metro Division and finishing as the third seed. For several other franchises, 10 points is the difference between making the playoffs and staying home.

Additionally, while the standings tell us that the Capitals were the league’s best team in 2016-17, there’s enough overlap between Washington and several other teams, including Chicago, Minnesota, Pittsburgh, and even Montreal, that it’s insufficient to use standings alone to justify arguing that Washington is the best team. Given the standings and the format of the league’s schedule, we know that the Caps were better than Vancouver and that they were probably better than Toronto. However, we’d hardly have any idea if they were better than Chicago when looking at the standings alone.

**Postscripts**

Here are the assumptions I used to replay the 16-17 NHL season.

-Team strengths were estimated using a Bradley Terry model with a fixed home advantage. While this posits that wins alone are the best way to analyze hockey teams, I’m okay with that for this exercise, as it means that our imputed standings will roughly be centered around this year’s observed standings. Game-level probabilities were extracted using each team’s estimated team strength, while providing a fixed advantage to the home team.

-If anything, I’m likely underestimating the amount of randomness in league standings. In estimating probabilities, I assumed that team strengths, as estimated using the Bradley-Terry model, were known. Of course, that’s not the case, and a more detailed imputation would account for the uncertainty in these parameter estimates.

-I assumed that overtime outcomes are random, with OT occurring in 24% of games. This rate matches the fraction of 16-17 contests that have gone to OT. Note that OT outcomes are not entirely random (see here), but they are probably close enough for our purposes here. Recall that the NHL’s scoring system awards a point to the overtime loser, so this is our way of accounting for that.

-Stay tuned for a future post, in which I’ll look at the role of the NHL’s unbalanced schedule in determining final standings. I’ll also share the code at that point.

]]>

Unfortunately, not everyone was enamored.

While it’s tempting to deride conclusions like Pete’s, it’s also too easy of a way out. And, to be honest, I share a small piece of his frustration, because there’s a lingering secret behind win probability models:

Essentially, they’re all wrong.

But win probabilities models can still be useful.

To examine more deeply, I’ll compare 6 independently created win probability models using projections from Super Bowl 51. Lessons learned can help us better understand how these models operate. Additionally, I’ll provide one example of how to check a win probability model’s accuracy, and share some guidelines for how we should better disseminate win probability information.

**So, what is a win probability? **

A win probability is the likelihood that, given any time-state in the game, a certain team will win the game.

Win probabilities can be both subjective (“This game feels like a toss-up”) or objective (“My statistical model gives the Falcons a 50% chance of winning”). This post focuses on the latter type, which have become increasingly popular across sports over the last decade.

**What are some NFL win probability models?**

Here are the models that I’ll compare in this post.

*Pro Football Reference* (PFR): Stemming from research by Wayne Winston and Hal Stern, PFR’s model uses the normal approximation and expected points to quantify team chances of winning. Read more in Neil Paine’s post here.

*ESPN: *ESPN’s predictions, provided by Henry Gargiulo and Brian Burke, are derived from an ensemble of machine learning models.

*PhD Football*: An open-sourced creation of Andrew Schechtman-Rook built using Python, this model uses logistic regression to predict game outcomes.

*nflscrapR*: An R package from graduate students at Carnegie Mellon, win probabilities stem from a generalized additive model of game outcomes.

*Lock and Nettleton*: Probabilities generated via a random forest, as done by Dennis Lock and Dan Nettleton in the Journal of Quantitative Analysis in Sports, implemented with data from Armchair Analysis.

*Gambletron: *Created by Todd Schneider, Gambletron uses real time betting market data to impute probabilities.

Before we start, a particular thanks to PFR for this and all of their public work, Brian and Hank, Andrew, Ron and Maksim, and a student of mine (Derrick) for their help in either sharing or pulling in the data. I greatly appreciate their work and/or willingness to share. Sadly, not everyone was so helpful.* Additionally, note that the 6 models used 6 unique approaches, which demonstrate the variety of ways that people have thought about win probability.

Finally, R code – and predictions from a few models – are up on my Github page.

**How’d probabilities look in the Super Bowl? **

One interesting way to start is to visualize how each model viewed the Super Bowl. Here’s a chart of New England’s play-by-play win probabilities, using a different color for each set of predictions.**

*Super Bowl win probabilities (New England’s).*

For most of the game, there’s at least a 5% gap between New England’s lowest and highest projections, and at several points, the gap is as high as 10%.

With six unique models, it’s not surprising to see these differences, but I’d also argue that this type of variation is not an attractive property for disseminating win probability information.

**How big of a comeback was it? **

It’s obvious that by the third quarter, New England’s chances were slim. Of course, with probabilities clustered near zero, it’s a bit difficult to precisely identify differences between the models in the initial chart. So, I converted the probabilities to odds to get a better sense of how the models viewed New England’s comeback.

Here’s that chart, and I added dotted lines to identify the point in the game when each model gave New England its longest odds of a comeback.

*Odds of winning Super Bowl 51, relative to New England. Second half only shown. *

Here, gaps between models are more substantial, which is not surprising given that odds are not robust to small changes for probabilities near 0. At multiple points, *PFR* gave New England about a 1 in 1000 chance of winning (1000:1 odds) while projections from *Gambletron* (which is arguably serving a different purpose with its numbers) barely crossed 25:1.

Thus, the wow-factor of the Patriots comeback depends on your source. If you choose *Gambletron*, it’s a one Super Bowl every two or three decades type of comeback. If you choose *PFR’s*, it’s a Super Bowl comeback we’ll only see once every millennium. From a communication perspective, this is a weakness to win probability models, and one that shows up frequently given that, for better or worse, people most often look at win probability charts after a major comeback.

Finally, it’s worth noting that the moments in the game when New England was given the longest odds of winning also differ, varying from midway through the third quarter to midway through the fourth. Indeed, your definition of how much of a comeback it was isn’t just limited by your choice of a model, but by your identification of which time in the game to start at, too.

**How about win probability added? **

Brian Burke makes an important point on the Ringer that win probability models are perhaps best used for understanding in-game decision making. Often, this is done by comparing win probability from from one play to the next using a metric called win probability added (WPA).

In the Super Bowl, leaps or drops in New England’s WPA are also somewhat dependent on model choice. As an example, New England’s second play from scrimmage, a 9 yard completion on 2nd-10 from Tom Brady to Julian Edelman, helped the Patriots according to *PhD Football* (+3%) but hurt the Patriots according to *nflscrapR* (-2%).

Here’s a chart that compares each pair of models’ WPA.***

*Win probability added, contrasted between 5 NFL win probability models in Super Bowl 51. *

The figures in the bottom left show pairwise scatter plots between each models’ WPA, with correlations listed in the top right. Histograms of WPA (relative to New England) are shown on the diagonal.

There’s a moderately strong link in WPA between each pair of predictions; the strongest correlation coefficient is with *ESPN* and *PhD Football* (0.85), with the weakest between *PFR* and *nflscrapR* (0.45).

However, there’s still a decent amount of variability with respect to how each model sees the helpfulness of each play. For example, of the 125 plays shown, on fewer than half (57, or 46%) did all five models agree that the outcome either helped or hurt the Patriots. This is another humbling aspect of win probability models — there’s both uncertainty in team chances at any one point in time, but also from one play to the next.

**So how can we know if a model is useful?**

Let’s take a break for a fun anecdote.

The typical NFL season has about 40,000 plays. Let’s imagine that you flip a fair coin 40,000 times to find the proportion of heads. We know the true probability of heads — it’s 50% — but if we use the results from our 40,000 flips, the average distance we can expect between our estimate of heads and the truth is about 0.2%. That is, we can’t predict a fair coin much better than +/- 0.2% in 40,000 trials. And if we can’t get precise probability estimates from coin tosses — which don’t have variables like the offensive team, defensive team, score, down, distance, spread, clock time, and timeouts attached to them — how can we expect our NFL win probabilities to be any more accurate?

So, whether or not a win probability at a certain time in is off by 10% or 0.2%, it’s off. Humbly, it’s why all models are wrong.

So how can we know if a model is useful?

Well, the best way to judge projections of any outcome is use events that are yet to take place. For football games, this correlates to using past games (termed training data) to derive predictions for future games (test data). If the probabilities are reasonable, those predictions should match future game outcomes.

So, that’s what I did using Lock and Nettleton’s random forest model. I stated by using the 2005-2015 seasons as training data. Next, I sampled 5 plays in each quarter in each game from the 2016 season to use as test data (5340 total). Sampling plays in this manner will ensure that I have the same number of plays from each game (to weigh games equally) and that I haven’t overfit (there are no overlapping plays in the test and training data). It’s also how the Super Bowl projections above were made.

Here’s a chart of how well the Lock & Nettleton model predictions did in 2016, aggregated by quarter. I included points that average offensive team probabilities to the nearest 0.05, as well as the corresponding fraction of games in that bin when the offensive team won. The closer projections are to the diagonal line, the better. If you want to see the R code for this chart (and the ones above), see my Github page.

*Observed versus estimated win probability for a sample of 2016 NFL plays. Predictions derived from Lock and Nettleton’s random forest model. By and large, projections match reality, as demonstrated by the line of best fit roughly corresponding to the line y = x. *

In Lock and Nettleton’s model, results are fairly reasonable. Across most bins in most quarters, probabilities reflect reality. It’s not that projections are perfect – teams with low win probabilities in the first quarter win more often than expected, for example – but it’s difficult to identify any precise location where the model is off by more than what we’d expect due to chance alone. Third quarter probabilities, as an example, look particularly reasonable. I’d also argue that this model’s performance is more impressive given that no games from 2016 were used in its evaluation, which may have helped the model more reasonably pick up on recent changes to the game.

Charts like the above don’t ensure that our probabilities are correct, as that’s impossible. Instead, they are there to provide warning signs if, for certain types of game situations, probabilities were off.****

**Practical recommendations**

Given the above, here is a set of recommendations for those of us creating or citing win probabilities.

- Avoid over-precision. Using too many digits (e.g., 60.51%) belies the true difficulty of predicting unrealized outcomes in sports. Cap probabilities to the nearest percentage. (Excellent example: 538).
- Embrace uncertainty. Instead of “There’s a 2% chance”, use “There’s about a 2% chance” or “About 1 in 50.”
- Take extra care when presenting surprising results. It’s difficult to believe that New England’s comeback was a once a millennium type of result, but it was often presented as such.
- Model check, and share results. This is an easy thing to do, and it’s the only way to know if predictions are close to accurate.
- Update models over time. Sports leagues are ever-evolving — as examples, NBA teams shoot more 3’s and NFL teams pass often than ever before — and so if a model isn’t updated over time, predictions could go from wrong to really wrong.

***************************

*I also emailed numberFire and asked for their projections. The response was as follows:

*Unfortunately we will not be able to share with you our predictive model. However you can review the perks from our premium services to see how it all works and what we have to offer. If you have any further questions do not hesitate to reach out to us.*

That’s bullshit. The chart’s literally right here, with the probabilities shown when hovering over. Those probabilities are shown to four decimal places.

**I dropped overtime plays. There’s enough extrapolation in win probabilities as it is, and extending to rarely played overtime events seems unwise. Additionally, note that there may not be perfect alignment in the charts with respect to Gambletron’s data, which works by real time (and not clock time).

***This chart only shows runs and passes, as there were too many irregularities in how each model ordered and timed special teams plays. Gambletron is not shown given that its’ time stamps reflect real time, and not clock time.

*****One of my goals this summer will be to make Lock & Nettleton’s model more public, but I’ll need to check with the authors, first. It’s a fairly reasonable model to fit in R, and it would be great to have an NFL win probability Shiny app where those unfamiliar with R could enter in constants to get probabilities.

*Note: An earlier version of this post pointed to possible limitations of PFR’s win probability model. However, after some offseason tuning, things appear to be more readily in order. *

]]>

It’s called the replication crisis, and it’s an issue that has challenged psychology, engulfed economics, and been identified as a disease in field full of them (medicine).

One area where replication has not been widely discussed is sports analytics, which, while more limited in scope than the disciplines listed above, takes center stage at this weekend’s Sloan Sports Analytics Conference (SSAC), with more than 3000 practitioners, fans, and professional staffers gathering in Boston.

One of the more attractive aspects of SSAC is its research paper contest, which generally features outstanding papers, provides researchers widespread press for their work, and awards a top prize of $20000 to one submission. As a result, it was with optimism that I read that in 2017, SSAC would be doing its best to ensure the validity of its contest submissions.* Via the rules page, “research will be evaluated on but not necessarily limited to the following: novelty of research, academic rigor, and reproducibility.”

Specifically, for reproducibility, the conference asks: “Can the model and results be replicated independently?”

This is an important definition, and one that mimics the work of Prasad Patil, Roger Peng, Jeff Leek, who recently went to extensive lengths to precisely define both reproducibility and replicability. Argue Patil et al: research is *reproducible* if a different analyst can generate the same results using the same code and data, and research is *replicable* if a different analyst can obtain consistent estimates when recollecting the data and re-doing the analysis plan. In other words, look for data and code, and ideally you’ll see both.

So, how did the 2017 finalists fare by these definitions?

Not great.

Here’s a chart summarizing the 2017 contest.** Each paper is identified by keywords from its title, and the columns reflect the data source, whether or not the data is (or appears to be) publicly available, if code was provided, and whether or not the overall paper is, by definition, *reproducible*. Note that two papers are yet to be posted on the SSAC site.

*Summary** of the 2017 Sloan research papers, including data source, if the data is publicly accessible, if code is provided, and if the paper meets the definition of reproducible*

Of this year’s 21 listed finalists, less than half cite publicly available data that could be used by outsiders, as most submissions use proprietary data or do not give sufficient detail behind how the data was gathered. Even among those obtaining public data, however, only two (the Lahman database in an MLB paper, and a google doc from an NHL paper) are accessible without writing one’s own computer program (note that the scrapers to obtain the data were also not shared) or doing extensive searching. At best, five or six papers boast any chance of being *replicable*, which, sadly, is only a few more than the number of papers that don’t share any information about where their data came from.

As for code, only Adam Levin, writer of a PGA tour paper, shared some (link here). Adam also deserves credit as his data is available from ShotLink with an application. In fact, that application is as close as we get in the SSAC contest to *reproducibility*. With a publicly shared passing project data, Ryan and Matt’s NHL paper would appear to be the next closest. Additionally, a separate NHL paper made reference to code, but none was shown on the author’s website.

There are several consequences to the lack of openness. First, it increases the chances of mistakes. While most of these errors have likely been innocuous, there’s no way of knowing what’s real and what’s bullshit at Sloan, which means that the latter is sometimes rewarded. As one example, a 2015 presentation showed an impossible-to-be-true chart about profiting on baseball betting, capped with a question-and-answer session in which the speaker handed out free tee-shirts.*** Next, it stunts growth of the field, which is a shame because, as Kyle Wagner wrote, sports analytics been stuck in the fog for a few years running. Finally, while citations aren’t the end-goal for many SSAC paper writers, the lack of reproducible research means lower chances of paper’s being referenced in the future.

SSAC likes to point out that it’s a pioneer in its domain. Given that the growth of the sports analytics is to the best of everyone in attendance, I’d recommend that the conference either start enforcing one of the criteria it claims to look for, or lose the disguise that it cares about properly advancing and vetting research.

* In full disclosure, I’ll note that I was part of a paper with Greg & Ben (code and details here) that was rejected from the 2017 contest.

** If I’ve made a mistake in table, please let me know and I’ll update. There may be links or explainers that I missed.

*** If you were making money by betting on sports, the last thing you’d do is get up on stage at a famous conference and tell anyone about it.

*Note*: Thanks to Gregory Matthews for his help with this post

]]>

Turns out, NHL players may not be the only ones who come to the defense of their own; there’s a decent chance that its referees do, too.

Let’s return to January of 2016, where Calgary’s Dennis Wideman leveled linesman Don Henderson with a vicious cross-check.

Despite — or perhaps given — his reputation as a high character player and the fact that post-concussive symptoms may have played a role in his mental state, Wideman was given a 20-game suspension, eventually returning to Calgary in March of last season.

But losing Wideman for 20-games wasn’t the only way in which Calgary felt the impact of the Wideman hit. From that game on, the Flames also found themselves in the penalty box significantly more often than beforehand.

Using data from Micah McCurdy (as well as some visual inspiration), I plotted the number of taken and drawn non-matching minor penalties in all Flames games since the start of the 2014-15 season. This includes the Calgary’s 129 regular season games prior to and the 92 games since the Wideman hit.

Each game is represented by a point, and the curved lines reflect local polynomial regression curves, shown separately for games before and after the hit (along with errors).

A few things stand out.

Prior to the Wideman hit, Calgary was consistently called for about one fewer penalty per game than opponents. However, while the rate of penalties drawn by Calgary has remained fairly consistent over the last 2+ seasons (shown in grey), after Wideman’s hit, there’s an immediate bump in penalties taken by Calgary (red). Comparing games pre and post-hit, the Flames jumped from 2.1 to 3.3 non-matching minors per-game. That’s … substantial, and a practically significant increase.

As additional evidence, we note that in that prior to the Wideman hit, roughly 1 in 10 Calgary games included no taken penalties. In the 92 games after the Wideman hit, the Flames only had one such game. Moreover, the jump in Calgary’s penalties corresponded with a *drop* in the league-wide infraction rate.

In addition to the comparison of the curves above, we can assess the significance of the Flames’ increase using the Poisson distribution. Initially linked to hockey more than a decade ago by Alan Ryder, the Poisson distribution is appropriate for penalty outcomes given the fixed amount of time in each game and the discrete counts. Sure enough, the 55% rate increase is statistically significant when comparing mean penalties pre and post-hit, and it is quite unlikely that the difference could be accounted for by chance alone. For those scoring at home, the p-value is less than 0.0001, and the 95% confidence interval for the rate increase goes from 19% to 106%.

Assuming we can rule out luck (or bad-luck) of the draw, what does this suggest?

- Officials are implicitly making more calls against Calgary to get revenge. This is more feasible given how much subjectivity is involved in several NHL violations. We already know that refs are prone to make-up calls, and that they base penalty decisions on other silly factors – so we shouldn’t be surprised that they’d take a measure of revenge, either. Wideman’s hit was egregious, and refs may be punishing his team for it.
- Another variable is responsible for the jump, but we don’t know what that variable is. As a related sports example, a few years back, an NFL analyst argued that time-varying Patriots fumble rates were a sign that New England was cheating. What was missing in the initial analysis is that there were several
*other*reasons (e.g., more kneel downs, red zone plays, plays with the lead) driving the Pats’ low rates. Indeed, part of the reason why the Patriots didn’t fumble is because they were running plays that generally did not lead to fumbles. Could we be missing a similar confounding variable here, one that is artificially responsible for the increased penalties? Maybe. I’m open to ideas. In this respect, it’s particularly interesting the Calgary’s drawn penalties have stayed the same. The Flames don’t appear to be playing a more aggressive game since January of last year. - Calgary wasn’t the team same after the Wideman suspension. This would stand if the jump in penalties matched the length of the Wideman suspension. In other words, perhaps Wideman’s replacements were aggressive players. However, Wideman was suspended 20 games, and the spike in penalty calls has remained much longer.

As one additional sign that (1) is responsible for part of Calgary’s jump in taken penalties, its worth revisiting the chart. If you look at the last month of play, Calgary’s taken penalty average has dipped.

At more than one penalty a game across more than a full season of play, it’s easy, albeit unsafe, to extrapolate and estimate that Calgary’s jump in penalties was worth about 20 goals against. This is an incredible total. Even if only part of Calgary’s increase in penalties was due to a revenge factor, the biggest impact of Wideman’s hit on the Flames wasn’t felt in his suspension, but in the penalty box.

*Postscript: A loyal reader asked me to compare the rates of all teams, both pre and post Wideman hit. Here’s that chart, using data from the nhlscrapr package in R. *

*We can also look at all team-seasons across several years to get a distribution of changes in penalty rates before and after when Wideman’s hit occurred (roughly halfway through the season, on January 27). Here’s a histogram of those differences. No one’s near Calgary in 15-16. *

]]>

In the past decade, sports analytics moved from the fringes of popular consciousness to the mainstream. The typical media narrative tells us that data is changing the game. To some extent, that’s true. The majority of professional teams in the five major sports leagues have at least one person on staff or on retainer tasked with delving into details and applying numbers to performance, and nearly all NBA, NHL and MLB franchises have sent at least one representative to the Sloan Sports Analytics Conference.

Noah and I wanted to find out more details about the job, the lifestyle and how analytics are being used, so we developed an informal survey and asked people who work or had worked on the sports analytics staffs of professional teams to participate. We used a combination of social media and personal email to contact staffers who we knew worked with teams or who were mentioned in ESPN’s analytics feature. A total of 61 respondents answered questions anonymously. A pdf of the survey can be found here. We also interviewed a half-dozen by phone, either on or off the record, to get anecdotes about what it’s like to be part of a professional franchise.

Some of the responses were predictable. Our survey wasn’t perfect, and our sample isn’t necessarily representative of the industry, but the respondents were 95 percent male, 95 percent white, and 92 percent fell into the 19-to-45 age group. Additionally, respondents from Major League Baseball teams report the highest average number of full-time staffers working for their teams – 3.6 – compared with roughly two apiece per team in the NFL, NHL, and NBA. (These numbers could be skewed because teams without any current or former analytics staffers would not be able to respond to the survey.) That makes sense, considering that everyone we spoke with believes that baseball teams are generally the furthest along in the development of their analytics departments. But professional teams across all major sports are increasingly investing in their analytics departments, slowly but surely adding to their budgets as executives place more trust in numbers.

Respondent breakdown:

NBA 18

MLB 16

NHL 7

NFL 6

Professional soccer 5

Other/multisport 4

(Five respondents left “sport” blank.)

**For the love of the game**

Working for a team isn’t a 9-to-5 job. The hours are long, with analytics staffers reporting that they work anywhere from an average of 53 hours per week in soccer to 66 in the NHL, with MLB, the NFL and the NBA averaging 60. (One MLB staffer reported working 95 hours per week.) “There are no holidays,” Bill Petti, a consultant for a number of MLB teams, said about the life of an analytics staffer. “You’re working nights. You’re sitting in the office at 10 p.m. in case somebody has a question.”

Aaron Barzilai, who worked as director of analytics at the Philadelphia 76ers until last February, agreed. “Salaries are depressed. You can’t be working for a team if you don’t love it because you need to be getting some psychic benefit from working for a team.”

Salaries are decent and occasionally well above the typical white-collar worker. According to our survey, medians ranged from $75,000 (NFL) to $100,000 (MLB) per year. Three respondents said their annual wages were greater than $200,000. For most, it’s not a bad living, but consider that nearly anyone with the skills to get one of the few jobs as an analytics person on a professional sports team could also make significantly more working at Google, Facebook, Microsoft, or dozens of other firms. This leads to lots of turnover, especially at more junior positions, which are lower paid and there’s little, if any, opportunity to advance into a more senior role because those jobs rarely become available. Many recent college graduates work for teams for a few years before moving on to tech firms or other companies where they can earn more for their skill set.

**Determining value**

The people we spoke with said that teams undervalue their analytics staff and invest accordingly, not unlike employees anywhere who think their departments deserve more resources.. While any reasonable employee would say that their department deserved more money and more resources, it’s not unreasonable to think that more computing power or another data set could produce results that were cheap by comparison. “They are spending hundreds of millions on players, tens of millions on coaches and staff, but $10,000 is a large expenditure to get a computer or some data,” a sports analytics expert who’s worked with NBA teams, said. “It’s ridiculous. It’s two different budgets.” (For what it’s worth, one study found that the average price of an MLB win is $1,016,674 in player salary, an NBA win is $1,572,768, and an NFL win is $11,878,369.)

As some teams mature and develop systems to handle routine reporting like data gathering, they may be able to think about building teams to handle some of the other stuff. Those that don’t will find themselves hitting a wall. In the past, teams could get away with having Excel, a computer and a staffer or two. Advances in sports analytics overall mean that groundbreaking work requires increasing talent levels, computing power and time to experiment.

One NHL analyst told us about this dream staff:

*“You need a couple different people unless you are a one-man team willing to put in 20-hour days. Just the data handling alone — the NHL is pretty archaic in their data to begin with — is one full-time person. If you wanted to expand from coaching and tactical to GM trades and to the draft/strategic long term, you need one data person, two or three analytical people whose job is formulating and communicating analysis, and then I would have two or three developers working on dashboards and tool sets. If you’re going to give a GM a sheet of paper with a recommendation, he may or may not pay attention. But if you give him a tool that he can play with, then it’s not your idea. It’s his idea.”*

A staff of one wouldn’t have the resources to develop and build that tool.

But rather than hire a number of positions, many teams still seek unicorns. Teams want one person who can fill all of those roles and then also have the skills to communicate results to others. This Marlins job posting asks for an intern with scripting ability, database management and statistical proficiency. Those tools sometimes overlap, but an actual combination of all three is rare, and most people with such skills can make a substantial amount of money elsewhere. And keep in mind, baseball is *ahead* of most other sports. A college student with this background can get paid $7,200 or more a month with housing to work at Microsoft or Facebook, or make $12.50 an hour for the Marlins.

**Nothing matters if no one is listening**

Finally, analytics can only be effective if the decision makers use what they are given. “There seems to be too much focus on results, and not enough focus on the quality of the process,” one respondent said.

Some analysts expressed concern that teams didn’t pay attention to their work.

“If the GM or president of basketball operations doesn’t want to read the results, it doesn’t really matter how talented your Ph.D. in the basement is,” said Barzilai. “You can have organizations that are using analytics well even if they aren’t doing cutting-edge analytics just by relying on what people might think of as fundamental analytics or the stuff that was coming out five years ago or stuff that is public.”

Another added that it’s frustrating: “When the numbers are so overwhelmingly in favor of one decision and it doesn’t happen due to someone’s feelings about public perception or a ‘well, it’s always been this way’ attitude.”

Consider The New York Times’s Fourth Down Bot, a simple formula that tells readers when teams should go for it on fourth down. The bot believes that coaches are too conservative, and would universally benefit from going for it more often. If a coaching staff listened to the bot, they’d benefit in the long term.

Generally, the decision makers in the front office are more receptive than others to input from the analytics staffers. In the NBA, NFL and NHL, at least 50 percent report weekly correspondence with the general manager, while just 20 percent in MLB do. Just 10 percent of the overall sample says they have weekly correspondence with players.

Team officials say they are continually working to refine the processes, to incorporate all the information they receive. “We don’t see ourselves as having an ‘analytics team’ or ‘process of incorporating analytics,’” an assistant general manager of an NHL team said. “We look at the best information we have when making decisions, and everything’s assimilated similarly whether it’s one scout’s eye test or another (or even the same) scout’s or an analyst’s tracking data or historical comparisons that some might call ‘analytics.’”

Some teams are better than others at applying what they learn. According to our respondents, the Spurs (NBA), Maple Leafs (NHL), Dolphins and Browns (NFL) are leading the analytics push in the less-than-quick-to-adapt leagues, with the Browns the most open about their ambitions. Said one: “A team like Toronto is doing it by the book in terms of how analytics should be impacting teams. They are … eating everyone’s lunch. It will pay off in the next two to five years and teams will start to say wow, look at their talent pool. Teams will start copying them.”

Despite analytics’ increasing role and media attention, it’s still early days. There are algorithms to write, data to dissect and knowledge to create. We asked respondents to say what percentage of the most important questions in their sport have been answered. The results, averaged by sport:

MLB – 56 percent

NBA – 38 percent

NHL – 32 percent

NFL – 31 percent

Soccer – 17 percent

Sports analytics developed a great deal in the past few decades, but there’s still plenty more to discover.

]]>

The photos begin after page 177. While perfect for putting names to faces, the insert’s location also means that if you read too fast, you’ve missed my favorite part of the book.

In the pages before and after the pictures, Lindbergh and Miller link their situation in running the Stompers’ season to quotes made by Huston Street, the Angels closer who was asked about pitching in non-save situations.

“I’ll retire if [pitching in non-save situations] ever happens,” Street is quoted as saying. “It’s a ridiculous idea, it really is.”

The quote hits home for Lindbergh and Miller, who, until that point in the season, had been similarly using their closer Sean Conroy in only save situations. Although Lindbergh and Miller had wanted to bring Conroy in during any high leverage spot, the team’s manager, Feh Lentini, was to that point resistant. Lindbergh, author of this particular chapter, writes [emphasis mine]:

*That last comment really rankles, because it cuts to the heart of why we’ve come to Sonoma: to put “on paper” ideas like this into practice. Thus far, it seems as if those who side with Street are right. Not because the idea of less restrictive roles doesn’t work — it did work, before league bullpens became hyper-specialized — but because everyone is so convinced it wouldn’t work that they aren’t even willing to try it. The idea is disqualified because it can’t pass a test that no one will allow it to take.*

Street’s comments weigh on Lindbergh for days, leading to friction with Feh, eventually ending with the manager’s firing halfway through the Stompers season.

Altogether, Lindbergh and Miller’s entire book is worth reading. It’s the game all of us stats-nerds wish we had the chance to play – actually calling the shots, instead of just commenting about them. And the authors make the characters and events personal. You’re rooting for Conroy, you’re rooting for a random infield shift to work, you’re rooting for Christopher Long’s spreadsheet to spit out the best possible players, and you’re rooting for a Stompers title, all things you otherwise wouldn’t have known existed.

But it was Lindbergh’s summary of Street’s comments – and the personal and oft-tempestuous back and forth between baseball tradition and analytics strategy — which has stayed with me months after reading their book.

Indeed, statistical findings have changed the way in which teams across baseball, and to a lesser extent other sports, have assessed free-agents, implemented game-day strategy, and drafted future players. Stat-heads are also a relative bargain, as Rob and Ben suggested here, boasting strong ties to future improvement.

It’s not that one can guarantee that implementing analytical strategies will lead to success, for the same reason that Lindbergh and Miller couldn’t guarantee that every infield shift would work to the Stompers’ advantage. Moreover, not every statistical thought that is once assumed true will end up being correct. But it’s the actuality of implementing a test – the trying of something to see that whether or not it will work, and the ability to live either way with the consequences – that eats at us in sports analytics on a near daily basis. Don’t disqualify an idea because you’re not willing to try it.

And as a result, it’s when the most basic of tests can’t be taken that frustration boils over.

Statistics can beat the smartest of us in Chess and poker, know what friend you’ll connect with on Facebook, predict what you’ll buy on Amazon, and finish what you are searching for on Google. If implemented properly, it will also allow sports teams to make better decisions across nearly all facets of the game.

It just needs the ability to take the test.

*You can read more about Ben and Sam’s book here, or buy it on Amazon here. *

]]>

The 2016 NFL regular season has ended, and with it has come the usual coaching carousel in which many franchises have opted to fire their head coach.

As of January 2nd, six of the league’s 32 teams have openings, with five of those coming by way of a fired predecessor (Denver’s retiring Gary Kubiak being the lone exception). But it’s not like 2016 is any type of outlier; roughly 4 coaches per year have been canned since the early 1980’s.

What’s interesting, though, is that despite the frequent, franchise-altering decisions made across the league, it’s mostly unknown whether or not this choice benefits longterm franchise prospects. (*Postscript: Today, Brian Burke looks at the identical question here, finding similar answers to what I find below). *As one exception to the rule in soccer, one study found that sacking a manager in soccer offered no tangible benefit to the future performance of a club.

So, does firing a coach cause teams to improve?

The point of this blog post will be to look back at past firings, and to use some standard causal inference tools to help us identify if the choice of whether or not to fire a coach has been a helpful one.

Estimating causes and effects when it comes to coach firings, unfortunately, is in no way straightforward.

The easiest strategy would be to compare the performance of franchises who fired their coaches in the seasons pre and post firings. For example, since 1982, the 130 teams who have fired their coach (using end-of-season firings) boasted an average improvement in winning percentage of 0.10, or the equivalent of about 1.6 games in a 16 game season. That’s a notable and statistically significant improvement.

Of course, that simple strategy is also a misleading one. The teams who got rid of their head coach only averaged about 5 wins per season prior to the firing, so on account of reversion towards the mean, we would’ve expected most of these teams to improve, anyways.

We can and should do better.

Let’s introduce some causal inference lingo.

In an ideal world we’d observe two outcomes, (i) the future performance of a team that fired its coach and (ii) the future performance of that same team that kept its coach. These are termed potential outcomes, and if we knew both potential outcomes, it would of course be easy to pick an optimal strategy.

Alas, short of building a time machine, knowing both potential outcomes is infeasible, and we’re only left with knowing the path chosen by each franchise. This is what’s known as the fundamental problem of causal inference; we want to be able to contrast an observed outcome with something that can’t be observed.

As it turns out, this also makes causal inference a missing data problem – the missing data is the missing potential outcome. In our case, for a team that fired its coach, the missing outcome is the path that would have been observed had that team kept its coach. Likewise, for a team that kept it’s coach, the missing outcome is what would’ve occurred had the coach been canned.

Causal inference tools, initially stemming from Jerzy Neyman’s work in the 1920’s with randomized designs, have become quite popular for estimating these missing potential outcomes. Under certain – but important – assumptions, if we can estimate the missing potential outcomes, we can likewise estimate the causes and effects of a treatment, including those from observational data.

The most popular causal tools are individual or full matching, subclassification, and weighting, each of which has its own strengths and weaknesses. In the sections below, I’ll overview how to use 1:1 matching with a data set of NFL coach firings.

If you are in search of a broader look of causal inference tools, I’d start with Elizabeth Stuart’s excellent review in Statistical Science.

The data I’m using comes courtesy of Harrison Chase and Kurt Bullard, former and current members of the Harvard Sports Analytics Club. Along with Harvard professor Mark Glickman, Harrison helped write an article on coaching turnover in sports, published recently in Significance Magazine. In addition to their data, their model assessing when teams fire their coaches was the impetus behind this post.

Harrison and Mark used a combination of logistic regression and classification trees to fit model of coach firing (Yes/No) as a function of several team-level coefficients. Their final model includes, but is not limited to, each team’s past win percentage, divisional win percentage, the coaches’ experience, strength of schedule in the prior season, number of rings that the coach averaged, and whether or not the team also experienced a GM change, chosen from roughly 25 candidate covariates.

Using their final variables, I used logistic regression to model each coach firing decision between 1982 and 2015. Here are those fitted probabilities from that model, separated by the teams that did and did not fire their coach. Point are jittered to account for overlap.

Altogether, the chart isn’t surprising. Most teams in most years aren’t firing their coach, and these teams are shown in the cluster of points in the top left of the graph. Meanwhile, teams that fire their coach tend to have predicted probabilities evenly spaced between 0 and 0.9.

At this point, we know we what suspected to begin with; the teams that fired their coach are, by and large, different from those that did not. This is a problem for most statistical tools. Basic comparisons like *t-*tests wouldn’t be able to account for these baseline differences, and even regression adjustment would be prone to bias given that the two groups (teams that fired their coaches and those that didn’t) are different from one another on several of the covariates that we would want to use in a model. Moreover, regression would be sensitive to model choice, and like most applications of statistics, the *true* model specification is unknown.

Here’s where causal inference comes in.

The probabilities depicted above are examples of propensity scores, defined as the conditional probability of receiving a treatment (in our case, of a team firing its coach). A nice property of propensity scores is that if two teams have the same propensity score, they also have, in expectation, the same distribution of observed covariates. This is really important. More technically, the distribution of covariates, conditional on the propensity score, is independent of whether or not a team chose to fire its coach.

The next critical part of propensity scores ties back to our potential outcome notation from earlier. Let’s assume that, conditional on the propensity score, the distribution of the set of potential outcomes is independent of our covariates, an assumption known as unconfoundedness. In other words, if I can find two teams with the same coach firing probability, where only one team actually fired its coach, the difference in those teams’ outcomes is an unbiased, unit-level estimate of firing a coach. Moreover, I can aggregate those differences across groups (say, every team that fired its coach) to provide an estimate of the causal effect of firing a coach (in this case, the benefit of firing a coach among teams that actually fired its coach).

These properties of propensity scores have made them widely applicable in fields like economics and government. There haven’t been many applications to sports, however, much to my chagrin.

The propensity score allows us to estimate the missing potential outcome that we don’t observe.

One way of doing this is to use matching, in which subjects receiving the treatment (those that fired a coach) are matched to those that didn’t. Using the `Matching`

package in R, I matched teams that fired their coach to those that didn’t. Here’s the same plot as above, only now I use different colors (and shadings) to reflect observations that were and were not matched.

A few things to point out.

First, I used 1:1 matching with replacement, meaning that each coach who was fired (bottom row) was matched to one that didn’t (top row), but it was possible for coaches kept to be matched to more than one coach that was fired. Second, the set of coaches with a high probability of being fired who were actually fired (bottom right, in red) ended up not being part of my matched cohort. By and large, this is a good thing; there was no coach that was kept with a corresponding probability of being fired, and inference to this set of coaches would require extrapolation.

But matching alone is not sufficient for inferring causes and effects. The next step is to make sure that the matching has done its job. Specifically, matching only works if the subjects matched to one another boast similar distributions of the observed covariates.

There are several ways to analyze covariate balance, and one of the more common ones compares the standardized bias for each covariate between each treatment group, done for both the pre and post-matched observations. Large value of standardized bias are bad – generally, the recommended cutoff for justifiable inference is 0.25 – and reflect groups that are not similar to one another.

Here are the pre and post-matched absolute standardized bias’ with our matched cohort.

Each dot above reflects a variable from our logistic regression model (those recommended by Harrison and Mark). For example, the standardized bias of team win percentage (abbreviated as `win_p`

) was roughly 1.5 in the pre-matched set of teams; after matching, the bias dropped below 0.15. In fact, the absolute standardized bias of all variables was sufficiently close to 0 after matching. This is a good thing; it entails that within our matched subset, teams that did and did not fire their coach are similar to one another (similar winning percentages, rings, GM changes, etc).

One important thing to point out is that the actual fit of the propensity score model is less important than the balance that is achieved: in other words, I’m worried less about things like collinearity and model fit statistics than I am about how similar the subjects matched to one another are. In our example, it looks like the teams that fired their coach and the ones that didn’t who ended up in our matched cohort are sufficiently alike.

Notice that I am yet to mention any observed, team-level outcome. This is not by accident; indeed, the above steps are considered to be the design phase of causal inference, done without looking at any outcome data.

The second step of causal inference is the analysis phase, in which the outcome of interest is contrasted within the matched cohort. For our purposes, I used the team’s winning percentage in the year following the firing or keeping of a coach.

There are a few reasonable approaches to to estimate the effect of coach firings on future winning percentage. One oft-recommended option is to use the combination of regression and matching together. Writes Stuart, “matching methods should not be seen in conflict with regression adjustment and in fact the two methods are complementary and best used in combination.”

With future team win percentage as my outcome, I fit a multivariate linear model with coach firing (yes/no) and 10 other predictors as covariates; these were the same 10 used by Glickman and Chase in their model of coach firings.

Turns out, not only is there no evidence that coach firing causes future success, if anything, it’s an inverse association. In our matched cohort, teams that kept their coach boasted a slightly higher (3.7%) winning percentage than those that fired their coach (p-value = 0.08). Notably, this estimate of 3.7% is relatively robust to model specification.

**Extrapolating, our best estimate of the causal effect of firing a coach is about -0.6 wins in the following season, but given our uncertainty, it is unclear if this finding is due to chance or if there’s some true, net loss in the year following a coach firing. **

Hopefully this walk-through provides readers a rough introduction to how causal inference tools can be used, as well as the steps involved. You can repeat the analysis yourself using the code here, and if you are more familiar with causal tools, feel free to play around.

Some final thoughts:

- Those familiar with causal inference will notice I did not detail all of the assumptions required. One such assumption is positivity, which I think holds because it’s safe to assume that each team had a non-zero chance of firing its coach. Another is SUTVA, which I’m less confident about. As an example, it seems reasonable to argue that one teams’ choice to fire its coach ties into the potential outcomes of other teams.

- This post on visualizing covariate balance is really interesting, and would have saved me several hours of thesis writing. In fact, I wish I had seen it before I started writing the above post.

- You could certainly make the case that a team’s future win percentage in the year following a coach firing is not the best outcome. I chose the one-year outcome, as once you go to more than a year, things could get a bit dicey regarding our assumptions (i.e., if a team fires coaches in two consecutive years).

- If this were a more technical paper, I’d want to look at other variables related to the choice of firing a coach. As one example, a popular post-hoc tool in causal inference is sensitivity analysis, where some of the assumptions mentioned above are put to the test.

]]>