Unfortunately, not everyone was enamored.

While it’s tempting to deride conclusions like Pete’s, it’s also too easy of a way out. And, to be honest, I share a small piece of his frustration, because there’s a lingering secret behind win probability models:

Essentially, they’re all wrong.

But win probabilities models can still be useful.

To examine more deeply, I’ll compare 6 independently created win probability models using projections from Super Bowl 51. Lessons learned can help us better understand how these models operate. Additionally, I’ll provide one example of how to check a win probability model’s accuracy, and share some guidelines for how we should better disseminate win probability information.

**So, what is a win probability? **

A win probability is the likelihood that, given any time-state in the game, a certain team will win the game.

Win probabilities can be both subjective (“This game feels like a toss-up”) or objective (“My statistical model gives the Falcons a 50% chance of winning”). This post focuses on the latter type, which have become increasingly popular across sports over the last decade.

**What are some NFL win probability models?**

Here are the models that I’ll compare in this post.

*Pro Football Reference* (PFR): Stemming from research by Wayne Winston and Hal Stern, PFR’s model uses the normal approximation and expected points to quantify team chances of winning. Read more in Neil Paine’s post here.

*ESPN: *ESPN’s predictions, provided by Henry Gargiulo and Brian Burke, are derived from an ensemble of machine learning models.

*PhD Football*: An open-sourced creation of Andrew Schechtman-Rook built using Python, this model uses logistic regression to predict game outcomes.

*nflscrapR*: An R package from graduate students at Carnegie Mellon, win probabilities stem from a generalized additive model of game outcomes.

*Lock and Nettleton*: Probabilities generated via a random forest, as done by Dennis Lock and Dan Nettleton in the Journal of Quantitative Analysis in Sports, implemented with data from Armchair Analysis.

*Gambletron: *Created by Todd Schneider, Gambletron uses real time betting market data to impute probabilities.

Before we start, a particular thanks to PFR for this and all of their public work, Brian and Hank, Andrew, Ron and Maksim, and a student of mine (Derrick) for their help in either sharing or pulling in the data. I greatly appreciate their work and/or willingness to share. Sadly, not everyone was so helpful.* Additionally, note that the 6 models used 6 unique approaches, which demonstrate the variety of ways that people have thought about win probability.

Finally, R code – and predictions from a few models – are up on my Github page.

**How’d probabilities look in the Super Bowl? **

One interesting way to start is to visualize how each model viewed the Super Bowl. Here’s a chart of New England’s play-by-play win probabilities, using a different color for each set of predictions.**

*Super Bowl win probabilities (New England’s).*

For most of the game, there’s at least a 5% gap between New England’s lowest and highest projections, and at several points, the gap is as high as 10%.

With six unique models, it’s not surprising to see these differences, but I’d also argue that this type of variation is not an attractive property for disseminating win probability information.

**How big of a comeback was it? **

It’s obvious that by the third quarter, New England’s chances were slim. Of course, with probabilities clustered near zero, it’s a bit difficult to precisely identify differences between the models in the initial chart. So, I converted the probabilities to odds to get a better sense of how the models viewed New England’s comeback.

Here’s that chart, and I added dotted lines to identify the point in the game when each model gave New England its longest odds of a comeback.

*Odds of winning Super Bowl 51, relative to New England. Second half only shown. *

Here, gaps between models are more substantial, which is not surprising given that odds are not robust to small changes for probabilities near 0. At multiple points, *PFR* gave New England about a 1 in 1000 chance of winning (1000:1 odds) while projections from *Gambletron* (which is arguably serving a different purpose with its numbers) barely crossed 25:1.

Thus, the wow-factor of the Patriots comeback depends on your source. If you choose *Gambletron*, it’s a one Super Bowl every two or three decades type of comeback. If you choose *PFR’s*, it’s a Super Bowl comeback we’ll only see once every millennium. From a communication perspective, this is a weakness to win probability models, and one that shows up frequently given that, for better or worse, people most often look at win probability charts after a major comeback.

Finally, it’s worth noting that the moments in the game when New England was given the longest odds of winning also differ, varying from midway through the third quarter to midway through the fourth. Indeed, your definition of how much of a comeback it was isn’t just limited by your choice of a model, but by your identification of which time in the game to start at, too.

**How about win probability added? **

Brian Burke makes an important point on the Ringer that win probability models are perhaps best used for understanding in-game decision making. Often, this is done by comparing win probability from from one play to the next using a metric called win probability added (WPA).

In the Super Bowl, leaps or drops in New England’s WPA are also somewhat dependent on model choice. As an example, New England’s second play from scrimmage, a 9 yard completion on 2nd-10 from Tom Brady to Julian Edelman, helped the Patriots according to *PhD Football* (+3%) but hurt the Patriots according to *nflscrapR* (-2%).

Here’s a chart that compares each pair of models’ WPA.***

*Win probability added, contrasted between 5 NFL win probability models in Super Bowl 51. *

The figures in the bottom left show pairwise scatter plots between each models’ WPA, with correlations listed in the top right. Histograms of WPA (relative to New England) are shown on the diagonal.

There’s a moderately strong link in WPA between each pair of predictions; the strongest correlation coefficient is with *ESPN* and *PhD Football* (0.85), with the weakest between *PFR* and *nflscrapR* (0.45).

However, there’s still a decent amount of variability with respect to how each model sees the helpfulness of each play. For example, of the 125 plays shown, on fewer than half (57, or 46%) did all five models agree that the outcome either helped or hurt the Patriots. This is another humbling aspect of win probability models — there’s both uncertainty in team chances at any one point in time, but also from one play to the next.

**So how can we know if a model is useful?**

Let’s take a break for a fun anecdote.

The typical NFL season has about 40,000 plays. Let’s imagine that you flip a fair coin 40,000 times to find the proportion of heads. We know the true probability of heads — it’s 50% — but if we use the results from our 40,000 flips, the average distance we can expect between our estimate of heads and the truth is about 0.2%. That is, we can’t predict a fair coin much better than +/- 0.2% in 40,000 trials. And if we can’t get precise probability estimates from coin tosses — which don’t have variables like the offensive team, defensive team, score, down, distance, spread, clock time, and timeouts attached to them — how can we expect our NFL win probabilities to be any more accurate?

So, whether or not a win probability at a certain time in is off by 10% or 0.2%, it’s off. Humbly, it’s why all models are wrong.

So how can we know if a model is useful?

Well, the best way to judge projections of any outcome is use events that are yet to take place. For football games, this correlates to using past games (termed training data) to derive predictions for future games (test data). If the probabilities are reasonable, those predictions should match future game outcomes.

So, that’s what I did using Lock and Nettleton’s random forest model. I stated by using the 2005-2015 seasons as training data. Next, I sampled 5 plays in each quarter in each game from the 2016 season to use as test data (5340 total). Sampling plays in this manner will ensure that I have the same number of plays from each game (to weigh games equally) and that I haven’t overfit (there are no overlapping plays in the test and training data). It’s also how the Super Bowl projections above were made.

Here’s a chart of how well the Lock & Nettleton model predictions did in 2016, aggregated by quarter. I included points that average offensive team probabilities to the nearest 0.05, as well as the corresponding fraction of games in that bin when the offensive team won. The closer projections are to the diagonal line, the better. If you want to see the R code for this chart (and the ones above), see my Github page.

*Observed versus estimated win probability for a sample of 2016 NFL plays. Predictions derived from Lock and Nettleton’s random forest model. By and large, projections match reality, as demonstrated by the line of best fit roughly corresponding to the line y = x. *

In Lock and Nettleton’s model, results are fairly reasonable. Across most bins in most quarters, probabilities reflect reality. It’s not that projections are perfect – teams with low win probabilities in the first quarter win more often than expected, for example – but it’s difficult to identify any precise location where the model is off by more than what we’d expect due to chance alone. Third quarter probabilities, as an example, look particularly reasonable. I’d also argue that this model’s performance is more impressive given that no games from 2016 were used in its evaluation, which may have helped the model more reasonably pick up on recent changes to the game.

Charts like the above don’t ensure that our probabilities are correct, as that’s impossible. Instead, they are there to provide warning signs if, for certain types of game situations, probabilities were off.****

**Practical recommendations**

Given the above, here is a set of recommendations for those of us creating or citing win probabilities.

- Avoid over-precision. Using too many digits (e.g., 60.51%) belies the true difficulty of predicting unrealized outcomes in sports. Cap probabilities to the nearest percentage. (Excellent example: 538).
- Embrace uncertainty. Instead of “There’s a 2% chance”, use “There’s about a 2% chance” or “About 1 in 50.”
- Take extra care when presenting surprising results. It’s difficult to believe that New England’s comeback was a once a millennium type of result, but it was often presented as such.
- Model check, and share results. This is an easy thing to do, and it’s the only way to know if predictions are close to accurate.
- Update models over time. Sports leagues are ever-evolving — as examples, NBA teams shoot more 3’s and NFL teams pass often than ever before — and so if a model isn’t updated over time, predictions could go from wrong to really wrong.

***************************

*I also emailed numberFire and asked for their projections. The response was as follows:

*Unfortunately we will not be able to share with you our predictive model. However you can review the perks from our premium services to see how it all works and what we have to offer. If you have any further questions do not hesitate to reach out to us.*

That’s bullshit. The chart’s literally right here, with the probabilities shown when hovering over. Those probabilities are shown to four decimal places.

**I dropped overtime plays. There’s enough extrapolation in win probabilities as it is, and extending to rarely played overtime events seems unwise. Additionally, note that there may not be perfect alignment in the charts with respect to Gambletron’s data, which works by real time (and not clock time).

***This chart only shows runs and passes, as there were too many irregularities in how each model ordered and timed special teams plays. Gambletron is not shown given that its’ time stamps reflect real time, and not clock time.

*****One of my goals this summer will be to make Lock & Nettleton’s model more public, but I’ll need to check with the authors, first. It’s a fairly reasonable model to fit in R, and it would be great to have an NFL win probability Shiny app where those unfamiliar with R could enter in constants to get probabilities.

*Note: An earlier version of this post pointed to possible limitations of PFR’s win probability model. However, after some offseason tuning, things appear to be more readily in order. *

]]>

It’s called the replication crisis, and it’s an issue that has challenged psychology, engulfed economics, and been identified as a disease in field full of them (medicine).

One area where replication has not been widely discussed is sports analytics, which, while more limited in scope than the disciplines listed above, takes center stage at this weekend’s Sloan Sports Analytics Conference (SSAC), with more than 3000 practitioners, fans, and professional staffers gathering in Boston.

One of the more attractive aspects of SSAC is its research paper contest, which generally features outstanding papers, provides researchers widespread press for their work, and awards a top prize of $20000 to one submission. As a result, it was with optimism that I read that in 2017, SSAC would be doing its best to ensure the validity of its contest submissions.* Via the rules page, “research will be evaluated on but not necessarily limited to the following: novelty of research, academic rigor, and reproducibility.”

Specifically, for reproducibility, the conference asks: “Can the model and results be replicated independently?”

This is an important definition, and one that mimics the work of Prasad Patil, Roger Peng, Jeff Leek, who recently went to extensive lengths to precisely define both reproducibility and replicability. Argue Patil et al: research is *reproducible* if a different analyst can generate the same results using the same code and data, and research is *replicable* if a different analyst can obtain consistent estimates when recollecting the data and re-doing the analysis plan. In other words, look for data and code, and ideally you’ll see both.

So, how did the 2017 finalists fare by these definitions?

Not great.

Here’s a chart summarizing the 2017 contest.** Each paper is identified by keywords from its title, and the columns reflect the data source, whether or not the data is (or appears to be) publicly available, if code was provided, and whether or not the overall paper is, by definition, *reproducible*. Note that two papers are yet to be posted on the SSAC site.

*Summary** of the 2017 Sloan research papers, including data source, if the data is publicly accessible, if code is provided, and if the paper meets the definition of reproducible*

Of this year’s 21 listed finalists, less than half cite publicly available data that could be used by outsiders, as most submissions use proprietary data or do not give sufficient detail behind how the data was gathered. Even among those obtaining public data, however, only two (the Lahman database in an MLB paper, and a google doc from an NHL paper) are accessible without writing one’s own computer program (note that the scrapers to obtain the data were also not shared) or doing extensive searching. At best, five or six papers boast any chance of being *replicable*, which, sadly, is only a few more than the number of papers that don’t share any information about where their data came from.

As for code, only Adam Levin, writer of a PGA tour paper, shared some (link here). Adam also deserves credit as his data is available from ShotLink with an application. In fact, that application is as close as we get in the SSAC contest to *reproducibility*. With a publicly shared passing project data, Ryan and Matt’s NHL paper would appear to be the next closest. Additionally, a separate NHL paper made reference to code, but none was shown on the author’s website.

There are several consequences to the lack of openness. First, it increases the chances of mistakes. While most of these errors have likely been innocuous, there’s no way of knowing what’s real and what’s bullshit at Sloan, which means that the latter is sometimes rewarded. As one example, a 2015 presentation showed an impossible-to-be-true chart about profiting on baseball betting, capped with a question-and-answer session in which the speaker handed out free tee-shirts.*** Next, it stunts growth of the field, which is a shame because, as Kyle Wagner wrote, sports analytics been stuck in the fog for a few years running. Finally, while citations aren’t the end-goal for many SSAC paper writers, the lack of reproducible research means lower chances of paper’s being referenced in the future.

SSAC likes to point out that it’s a pioneer in its domain. Given that the growth of the sports analytics is to the best of everyone in attendance, I’d recommend that the conference either start enforcing one of the criteria it claims to look for, or lose the disguise that it cares about properly advancing and vetting research.

* In full disclosure, I’ll note that I was part of a paper with Greg & Ben (code and details here) that was rejected from the 2017 contest.

** If I’ve made a mistake in table, please let me know and I’ll update. There may be links or explainers that I missed.

*** If you were making money by betting on sports, the last thing you’d do is get up on stage at a famous conference and tell anyone about it.

*Note*: Thanks to Gregory Matthews for his help with this post

]]>

Turns out, NHL players may not be the only ones who come to the defense of their own; there’s a decent chance that its referees do, too.

Let’s return to January of 2016, where Calgary’s Dennis Wideman leveled linesman Don Henderson with a vicious cross-check.

Despite — or perhaps given — his reputation as a high character player and the fact that post-concussive symptoms may have played a role in his mental state, Wideman was given a 20-game suspension, eventually returning to Calgary in March of last season.

But losing Wideman for 20-games wasn’t the only way in which Calgary felt the impact of the Wideman hit. From that game on, the Flames also found themselves in the penalty box significantly more often than beforehand.

Using data from Micah McCurdy (as well as some visual inspiration), I plotted the number of taken and drawn non-matching minor penalties in all Flames games since the start of the 2014-15 season. This includes the Calgary’s 129 regular season games prior to and the 92 games since the Wideman hit.

Each game is represented by a point, and the curved lines reflect local polynomial regression curves, shown separately for games before and after the hit (along with errors).

A few things stand out.

Prior to the Wideman hit, Calgary was consistently called for about one fewer penalty per game than opponents. However, while the rate of penalties drawn by Calgary has remained fairly consistent over the last 2+ seasons (shown in grey), after Wideman’s hit, there’s an immediate bump in penalties taken by Calgary (red). Comparing games pre and post-hit, the Flames jumped from 2.1 to 3.3 non-matching minors per-game. That’s … substantial, and a practically significant increase.

As additional evidence, we note that in that prior to the Wideman hit, roughly 1 in 10 Calgary games included no taken penalties. In the 92 games after the Wideman hit, the Flames only had one such game. Moreover, the jump in Calgary’s penalties corresponded with a *drop* in the league-wide infraction rate.

In addition to the comparison of the curves above, we can assess the significance of the Flames’ increase using the Poisson distribution. Initially linked to hockey more than a decade ago by Alan Ryder, the Poisson distribution is appropriate for penalty outcomes given the fixed amount of time in each game and the discrete counts. Sure enough, the 55% rate increase is statistically significant when comparing mean penalties pre and post-hit, and it is quite unlikely that the difference could be accounted for by chance alone. For those scoring at home, the p-value is less than 0.0001, and the 95% confidence interval for the rate increase goes from 19% to 106%.

Assuming we can rule out luck (or bad-luck) of the draw, what does this suggest?

- Officials are implicitly making more calls against Calgary to get revenge. This is more feasible given how much subjectivity is involved in several NHL violations. We already know that refs are prone to make-up calls, and that they base penalty decisions on other silly factors – so we shouldn’t be surprised that they’d take a measure of revenge, either. Wideman’s hit was egregious, and refs may be punishing his team for it.
- Another variable is responsible for the jump, but we don’t know what that variable is. As a related sports example, a few years back, an NFL analyst argued that time-varying Patriots fumble rates were a sign that New England was cheating. What was missing in the initial analysis is that there were several
*other*reasons (e.g., more kneel downs, red zone plays, plays with the lead) driving the Pats’ low rates. Indeed, part of the reason why the Patriots didn’t fumble is because they were running plays that generally did not lead to fumbles. Could we be missing a similar confounding variable here, one that is artificially responsible for the increased penalties? Maybe. I’m open to ideas. In this respect, it’s particularly interesting the Calgary’s drawn penalties have stayed the same. The Flames don’t appear to be playing a more aggressive game since January of last year. - Calgary wasn’t the team same after the Wideman suspension. This would stand if the jump in penalties matched the length of the Wideman suspension. In other words, perhaps Wideman’s replacements were aggressive players. However, Wideman was suspended 20 games, and the spike in penalty calls has remained much longer.

As one additional sign that (1) is responsible for part of Calgary’s jump in taken penalties, its worth revisiting the chart. If you look at the last month of play, Calgary’s taken penalty average has dipped.

At more than one penalty a game across more than a full season of play, it’s easy, albeit unsafe, to extrapolate and estimate that Calgary’s jump in penalties was worth about 20 goals against. This is an incredible total. Even if only part of Calgary’s increase in penalties was due to a revenge factor, the biggest impact of Wideman’s hit on the Flames wasn’t felt in his suspension, but in the penalty box.

*Postscript: A loyal reader asked me to compare the rates of all teams, both pre and post Wideman hit. Here’s that chart, using data from the nhlscrapr package in R. *

*We can also look at all team-seasons across several years to get a distribution of changes in penalty rates before and after when Wideman’s hit occurred (roughly halfway through the season, on January 27). Here’s a histogram of those differences. No one’s near Calgary in 15-16. *

]]>

In the past decade, sports analytics moved from the fringes of popular consciousness to the mainstream. The typical media narrative tells us that data is changing the game. To some extent, that’s true. The majority of professional teams in the five major sports leagues have at least one person on staff or on retainer tasked with delving into details and applying numbers to performance, and nearly all NBA, NHL and MLB franchises have sent at least one representative to the Sloan Sports Analytics Conference.

Noah and I wanted to find out more details about the job, the lifestyle and how analytics are being used, so we developed an informal survey and asked people who work or had worked on the sports analytics staffs of professional teams to participate. We used a combination of social media and personal email to contact staffers who we knew worked with teams or who were mentioned in ESPN’s analytics feature. A total of 61 respondents answered questions anonymously. A pdf of the survey can be found here. We also interviewed a half-dozen by phone, either on or off the record, to get anecdotes about what it’s like to be part of a professional franchise.

Some of the responses were predictable. Our survey wasn’t perfect, and our sample isn’t necessarily representative of the industry, but the respondents were 95 percent male, 95 percent white, and 92 percent fell into the 19-to-45 age group. Additionally, respondents from Major League Baseball teams report the highest average number of full-time staffers working for their teams – 3.6 – compared with roughly two apiece per team in the NFL, NHL, and NBA. (These numbers could be skewed because teams without any current or former analytics staffers would not be able to respond to the survey.) That makes sense, considering that everyone we spoke with believes that baseball teams are generally the furthest along in the development of their analytics departments. But professional teams across all major sports are increasingly investing in their analytics departments, slowly but surely adding to their budgets as executives place more trust in numbers.

Respondent breakdown:

NBA 18

MLB 16

NHL 7

NFL 6

Professional soccer 5

Other/multisport 4

(Five respondents left “sport” blank.)

**For the love of the game**

Working for a team isn’t a 9-to-5 job. The hours are long, with analytics staffers reporting that they work anywhere from an average of 53 hours per week in soccer to 66 in the NHL, with MLB, the NFL and the NBA averaging 60. (One MLB staffer reported working 95 hours per week.) “There are no holidays,” Bill Petti, a consultant for a number of MLB teams, said about the life of an analytics staffer. “You’re working nights. You’re sitting in the office at 10 p.m. in case somebody has a question.”

Aaron Barzilai, who worked as director of analytics at the Philadelphia 76ers until last February, agreed. “Salaries are depressed. You can’t be working for a team if you don’t love it because you need to be getting some psychic benefit from working for a team.”

Salaries are decent and occasionally well above the typical white-collar worker. According to our survey, medians ranged from $75,000 (NFL) to $100,000 (MLB) per year. Three respondents said their annual wages were greater than $200,000. For most, it’s not a bad living, but consider that nearly anyone with the skills to get one of the few jobs as an analytics person on a professional sports team could also make significantly more working at Google, Facebook, Microsoft, or dozens of other firms. This leads to lots of turnover, especially at more junior positions, which are lower paid and there’s little, if any, opportunity to advance into a more senior role because those jobs rarely become available. Many recent college graduates work for teams for a few years before moving on to tech firms or other companies where they can earn more for their skill set.

**Determining value**

The people we spoke with said that teams undervalue their analytics staff and invest accordingly, not unlike employees anywhere who think their departments deserve more resources.. While any reasonable employee would say that their department deserved more money and more resources, it’s not unreasonable to think that more computing power or another data set could produce results that were cheap by comparison. “They are spending hundreds of millions on players, tens of millions on coaches and staff, but $10,000 is a large expenditure to get a computer or some data,” a sports analytics expert who’s worked with NBA teams, said. “It’s ridiculous. It’s two different budgets.” (For what it’s worth, one study found that the average price of an MLB win is $1,016,674 in player salary, an NBA win is $1,572,768, and an NFL win is $11,878,369.)

As some teams mature and develop systems to handle routine reporting like data gathering, they may be able to think about building teams to handle some of the other stuff. Those that don’t will find themselves hitting a wall. In the past, teams could get away with having Excel, a computer and a staffer or two. Advances in sports analytics overall mean that groundbreaking work requires increasing talent levels, computing power and time to experiment.

One NHL analyst told us about this dream staff:

*“You need a couple different people unless you are a one-man team willing to put in 20-hour days. Just the data handling alone — the NHL is pretty archaic in their data to begin with — is one full-time person. If you wanted to expand from coaching and tactical to GM trades and to the draft/strategic long term, you need one data person, two or three analytical people whose job is formulating and communicating analysis, and then I would have two or three developers working on dashboards and tool sets. If you’re going to give a GM a sheet of paper with a recommendation, he may or may not pay attention. But if you give him a tool that he can play with, then it’s not your idea. It’s his idea.”*

A staff of one wouldn’t have the resources to develop and build that tool.

But rather than hire a number of positions, many teams still seek unicorns. Teams want one person who can fill all of those roles and then also have the skills to communicate results to others. This Marlins job posting asks for an intern with scripting ability, database management and statistical proficiency. Those tools sometimes overlap, but an actual combination of all three is rare, and most people with such skills can make a substantial amount of money elsewhere. And keep in mind, baseball is *ahead* of most other sports. A college student with this background can get paid $7,200 or more a month with housing to work at Microsoft or Facebook, or make $12.50 an hour for the Marlins.

**Nothing matters if no one is listening**

Finally, analytics can only be effective if the decision makers use what they are given. “There seems to be too much focus on results, and not enough focus on the quality of the process,” one respondent said.

Some analysts expressed concern that teams didn’t pay attention to their work.

“If the GM or president of basketball operations doesn’t want to read the results, it doesn’t really matter how talented your Ph.D. in the basement is,” said Barzilai. “You can have organizations that are using analytics well even if they aren’t doing cutting-edge analytics just by relying on what people might think of as fundamental analytics or the stuff that was coming out five years ago or stuff that is public.”

Another added that it’s frustrating: “When the numbers are so overwhelmingly in favor of one decision and it doesn’t happen due to someone’s feelings about public perception or a ‘well, it’s always been this way’ attitude.”

Consider The New York Times’s Fourth Down Bot, a simple formula that tells readers when teams should go for it on fourth down. The bot believes that coaches are too conservative, and would universally benefit from going for it more often. If a coaching staff listened to the bot, they’d benefit in the long term.

Generally, the decision makers in the front office are more receptive than others to input from the analytics staffers. In the NBA, NFL and NHL, at least 50 percent report weekly correspondence with the general manager, while just 20 percent in MLB do. Just 10 percent of the overall sample says they have weekly correspondence with players.

Team officials say they are continually working to refine the processes, to incorporate all the information they receive. “We don’t see ourselves as having an ‘analytics team’ or ‘process of incorporating analytics,’” an assistant general manager of an NHL team said. “We look at the best information we have when making decisions, and everything’s assimilated similarly whether it’s one scout’s eye test or another (or even the same) scout’s or an analyst’s tracking data or historical comparisons that some might call ‘analytics.’”

Some teams are better than others at applying what they learn. According to our respondents, the Spurs (NBA), Maple Leafs (NHL), Dolphins and Browns (NFL) are leading the analytics push in the less-than-quick-to-adapt leagues, with the Browns the most open about their ambitions. Said one: “A team like Toronto is doing it by the book in terms of how analytics should be impacting teams. They are … eating everyone’s lunch. It will pay off in the next two to five years and teams will start to say wow, look at their talent pool. Teams will start copying them.”

Despite analytics’ increasing role and media attention, it’s still early days. There are algorithms to write, data to dissect and knowledge to create. We asked respondents to say what percentage of the most important questions in their sport have been answered. The results, averaged by sport:

MLB – 56 percent

NBA – 38 percent

NHL – 32 percent

NFL – 31 percent

Soccer – 17 percent

Sports analytics developed a great deal in the past few decades, but there’s still plenty more to discover.

]]>

The photos begin after page 177. While perfect for putting names to faces, the insert’s location also means that if you read too fast, you’ve missed my favorite part of the book.

In the pages before and after the pictures, Lindbergh and Miller link their situation in running the Stompers’ season to quotes made by Huston Street, the Angels closer who was asked about pitching in non-save situations.

“I’ll retire if [pitching in non-save situations] ever happens,” Street is quoted as saying. “It’s a ridiculous idea, it really is.”

The quote hits home for Lindbergh and Miller, who, until that point in the season, had been similarly using their closer Sean Conroy in only save situations. Although Lindbergh and Miller had wanted to bring Conroy in during any high leverage spot, the team’s manager, Feh Lentini, was to that point resistant. Lindbergh, author of this particular chapter, writes [emphasis mine]:

*That last comment really rankles, because it cuts to the heart of why we’ve come to Sonoma: to put “on paper” ideas like this into practice. Thus far, it seems as if those who side with Street are right. Not because the idea of less restrictive roles doesn’t work — it did work, before league bullpens became hyper-specialized — but because everyone is so convinced it wouldn’t work that they aren’t even willing to try it. The idea is disqualified because it can’t pass a test that no one will allow it to take.*

Street’s comments weigh on Lindbergh for days, leading to friction with Feh, eventually ending with the manager’s firing halfway through the Stompers season.

Altogether, Lindbergh and Miller’s entire book is worth reading. It’s the game all of us stats-nerds wish we had the chance to play – actually calling the shots, instead of just commenting about them. And the authors make the characters and events personal. You’re rooting for Conroy, you’re rooting for a random infield shift to work, you’re rooting for Christopher Long’s spreadsheet to spit out the best possible players, and you’re rooting for a Stompers title, all things you otherwise wouldn’t have known existed.

But it was Lindbergh’s summary of Street’s comments – and the personal and oft-tempestuous back and forth between baseball tradition and analytics strategy — which has stayed with me months after reading their book.

Indeed, statistical findings have changed the way in which teams across baseball, and to a lesser extent other sports, have assessed free-agents, implemented game-day strategy, and drafted future players. Stat-heads are also a relative bargain, as Rob and Ben suggested here, boasting strong ties to future improvement.

It’s not that one can guarantee that implementing analytical strategies will lead to success, for the same reason that Lindbergh and Miller couldn’t guarantee that every infield shift would work to the Stompers’ advantage. Moreover, not every statistical thought that is once assumed true will end up being correct. But it’s the actuality of implementing a test – the trying of something to see that whether or not it will work, and the ability to live either way with the consequences – that eats at us in sports analytics on a near daily basis. Don’t disqualify an idea because you’re not willing to try it.

And as a result, it’s when the most basic of tests can’t be taken that frustration boils over.

Statistics can beat the smartest of us in Chess and poker, know what friend you’ll connect with on Facebook, predict what you’ll buy on Amazon, and finish what you are searching for on Google. If implemented properly, it will also allow sports teams to make better decisions across nearly all facets of the game.

It just needs the ability to take the test.

*You can read more about Ben and Sam’s book here, or buy it on Amazon here. *

]]>

The 2016 NFL regular season has ended, and with it has come the usual coaching carousel in which many franchises have opted to fire their head coach.

As of January 2nd, six of the league’s 32 teams have openings, with five of those coming by way of a fired predecessor (Denver’s retiring Gary Kubiak being the lone exception). But it’s not like 2016 is any type of outlier; roughly 4 coaches per year have been canned since the early 1980’s.

What’s interesting, though, is that despite the frequent, franchise-altering decisions made across the league, it’s mostly unknown whether or not this choice benefits longterm franchise prospects. (*Postscript: Today, Brian Burke looks at the identical question here, finding similar answers to what I find below). *As one exception to the rule in soccer, one study found that sacking a manager in soccer offered no tangible benefit to the future performance of a club.

So, does firing a coach cause teams to improve?

The point of this blog post will be to look back at past firings, and to use some standard causal inference tools to help us identify if the choice of whether or not to fire a coach has been a helpful one.

Estimating causes and effects when it comes to coach firings, unfortunately, is in no way straightforward.

The easiest strategy would be to compare the performance of franchises who fired their coaches in the seasons pre and post firings. For example, since 1982, the 130 teams who have fired their coach (using end-of-season firings) boasted an average improvement in winning percentage of 0.10, or the equivalent of about 1.6 games in a 16 game season. That’s a notable and statistically significant improvement.

Of course, that simple strategy is also a misleading one. The teams who got rid of their head coach only averaged about 5 wins per season prior to the firing, so on account of reversion towards the mean, we would’ve expected most of these teams to improve, anyways.

We can and should do better.

Let’s introduce some causal inference lingo.

In an ideal world we’d observe two outcomes, (i) the future performance of a team that fired its coach and (ii) the future performance of that same team that kept its coach. These are termed potential outcomes, and if we knew both potential outcomes, it would of course be easy to pick an optimal strategy.

Alas, short of building a time machine, knowing both potential outcomes is infeasible, and we’re only left with knowing the path chosen by each franchise. This is what’s known as the fundamental problem of causal inference; we want to be able to contrast an observed outcome with something that can’t be observed.

As it turns out, this also makes causal inference a missing data problem – the missing data is the missing potential outcome. In our case, for a team that fired its coach, the missing outcome is the path that would have been observed had that team kept its coach. Likewise, for a team that kept it’s coach, the missing outcome is what would’ve occurred had the coach been canned.

Causal inference tools, initially stemming from Jerzy Neyman’s work in the 1920’s with randomized designs, have become quite popular for estimating these missing potential outcomes. Under certain – but important – assumptions, if we can estimate the missing potential outcomes, we can likewise estimate the causes and effects of a treatment, including those from observational data.

The most popular causal tools are individual or full matching, subclassification, and weighting, each of which has its own strengths and weaknesses. In the sections below, I’ll overview how to use 1:1 matching with a data set of NFL coach firings.

If you are in search of a broader look of causal inference tools, I’d start with Elizabeth Stuart’s excellent review in Statistical Science.

The data I’m using comes courtesy of Harrison Chase and Kurt Bullard, former and current members of the Harvard Sports Analytics Club. Along with Harvard professor Mark Glickman, Harrison helped write an article on coaching turnover in sports, published recently in Significance Magazine. In addition to their data, their model assessing when teams fire their coaches was the impetus behind this post.

Harrison and Mark used a combination of logistic regression and classification trees to fit model of coach firing (Yes/No) as a function of several team-level coefficients. Their final model includes, but is not limited to, each team’s past win percentage, divisional win percentage, the coaches’ experience, strength of schedule in the prior season, number of rings that the coach averaged, and whether or not the team also experienced a GM change, chosen from roughly 25 candidate covariates.

Using their final variables, I used logistic regression to model each coach firing decision between 1982 and 2015. Here are those fitted probabilities from that model, separated by the teams that did and did not fire their coach. Point are jittered to account for overlap.

Altogether, the chart isn’t surprising. Most teams in most years aren’t firing their coach, and these teams are shown in the cluster of points in the top left of the graph. Meanwhile, teams that fire their coach tend to have predicted probabilities evenly spaced between 0 and 0.9.

At this point, we know we what suspected to begin with; the teams that fired their coach are, by and large, different from those that did not. This is a problem for most statistical tools. Basic comparisons like *t-*tests wouldn’t be able to account for these baseline differences, and even regression adjustment would be prone to bias given that the two groups (teams that fired their coaches and those that didn’t) are different from one another on several of the covariates that we would want to use in a model. Moreover, regression would be sensitive to model choice, and like most applications of statistics, the *true* model specification is unknown.

Here’s where causal inference comes in.

The probabilities depicted above are examples of propensity scores, defined as the conditional probability of receiving a treatment (in our case, of a team firing its coach). A nice property of propensity scores is that if two teams have the same propensity score, they also have, in expectation, the same distribution of observed covariates. This is really important. More technically, the distribution of covariates, conditional on the propensity score, is independent of whether or not a team chose to fire its coach.

The next critical part of propensity scores ties back to our potential outcome notation from earlier. Let’s assume that, conditional on the propensity score, the distribution of the set of potential outcomes is independent of our covariates, an assumption known as unconfoundedness. In other words, if I can find two teams with the same coach firing probability, where only one team actually fired its coach, the difference in those teams’ outcomes is an unbiased, unit-level estimate of firing a coach. Moreover, I can aggregate those differences across groups (say, every team that fired its coach) to provide an estimate of the causal effect of firing a coach (in this case, the benefit of firing a coach among teams that actually fired its coach).

These properties of propensity scores have made them widely applicable in fields like economics and government. There haven’t been many applications to sports, however, much to my chagrin.

The propensity score allows us to estimate the missing potential outcome that we don’t observe.

One way of doing this is to use matching, in which subjects receiving the treatment (those that fired a coach) are matched to those that didn’t. Using the `Matching`

package in R, I matched teams that fired their coach to those that didn’t. Here’s the same plot as above, only now I use different colors (and shadings) to reflect observations that were and were not matched.

A few things to point out.

First, I used 1:1 matching with replacement, meaning that each coach who was fired (bottom row) was matched to one that didn’t (top row), but it was possible for coaches kept to be matched to more than one coach that was fired. Second, the set of coaches with a high probability of being fired who were actually fired (bottom right, in red) ended up not being part of my matched cohort. By and large, this is a good thing; there was no coach that was kept with a corresponding probability of being fired, and inference to this set of coaches would require extrapolation.

But matching alone is not sufficient for inferring causes and effects. The next step is to make sure that the matching has done its job. Specifically, matching only works if the subjects matched to one another boast similar distributions of the observed covariates.

There are several ways to analyze covariate balance, and one of the more common ones compares the standardized bias for each covariate between each treatment group, done for both the pre and post-matched observations. Large value of standardized bias are bad – generally, the recommended cutoff for justifiable inference is 0.25 – and reflect groups that are not similar to one another.

Here are the pre and post-matched absolute standardized bias’ with our matched cohort.

Each dot above reflects a variable from our logistic regression model (those recommended by Harrison and Mark). For example, the standardized bias of team win percentage (abbreviated as `win_p`

) was roughly 1.5 in the pre-matched set of teams; after matching, the bias dropped below 0.15. In fact, the absolute standardized bias of all variables was sufficiently close to 0 after matching. This is a good thing; it entails that within our matched subset, teams that did and did not fire their coach are similar to one another (similar winning percentages, rings, GM changes, etc).

One important thing to point out is that the actual fit of the propensity score model is less important than the balance that is achieved: in other words, I’m worried less about things like collinearity and model fit statistics than I am about how similar the subjects matched to one another are. In our example, it looks like the teams that fired their coach and the ones that didn’t who ended up in our matched cohort are sufficiently alike.

Notice that I am yet to mention any observed, team-level outcome. This is not by accident; indeed, the above steps are considered to be the design phase of causal inference, done without looking at any outcome data.

The second step of causal inference is the analysis phase, in which the outcome of interest is contrasted within the matched cohort. For our purposes, I used the team’s winning percentage in the year following the firing or keeping of a coach.

There are a few reasonable approaches to to estimate the effect of coach firings on future winning percentage. One oft-recommended option is to use the combination of regression and matching together. Writes Stuart, “matching methods should not be seen in conflict with regression adjustment and in fact the two methods are complementary and best used in combination.”

With future team win percentage as my outcome, I fit a multivariate linear model with coach firing (yes/no) and 10 other predictors as covariates; these were the same 10 used by Glickman and Chase in their model of coach firings.

Turns out, not only is there no evidence that coach firing causes future success, if anything, it’s an inverse association. In our matched cohort, teams that kept their coach boasted a slightly higher (3.7%) winning percentage than those that fired their coach (p-value = 0.08). Notably, this estimate of 3.7% is relatively robust to model specification.

**Extrapolating, our best estimate of the causal effect of firing a coach is about -0.6 wins in the following season, but given our uncertainty, it is unclear if this finding is due to chance or if there’s some true, net loss in the year following a coach firing. **

Hopefully this walk-through provides readers a rough introduction to how causal inference tools can be used, as well as the steps involved. You can repeat the analysis yourself using the code here, and if you are more familiar with causal tools, feel free to play around.

Some final thoughts:

- Those familiar with causal inference will notice I did not detail all of the assumptions required. One such assumption is positivity, which I think holds because it’s safe to assume that each team had a non-zero chance of firing its coach. Another is SUTVA, which I’m less confident about. As an example, it seems reasonable to argue that one teams’ choice to fire its coach ties into the potential outcomes of other teams.

- This post on visualizing covariate balance is really interesting, and would have saved me several hours of thesis writing. In fact, I wish I had seen it before I started writing the above post.

- You could certainly make the case that a team’s future win percentage in the year following a coach firing is not the best outcome. I chose the one-year outcome, as once you go to more than a year, things could get a bit dicey regarding our assumptions (i.e., if a team fires coaches in two consecutive years).

- If this were a more technical paper, I’d want to look at other variables related to the choice of firing a coach. As one example, a popular post-hoc tool in causal inference is sensitivity analysis, where some of the assumptions mentioned above are put to the test.

]]>

Lately, there has been some discussion about choosing between the extra point kick and the 2-point conversion, as well as the criteria NFL coaches should use in different situations when deciding plays. The most common argument I read is “this play has more expected points so it’s better in a long run.” While expected points give us some information about the value of our choice, I’ll point out that we should try to compare how our choices affect win probability because that is the ultimate outcome.

So, lets play a game where we have two conversion options and they are the only way to score. For sake of simplicity, assume we have to choose before the game what conversion type we are going to use. At the end, we can compare which conversion strategy leads to more points more often.

Let’s say that the number of conversion attempts per game, *n*, is between 1 and 6, and that both teams will have same amount of conversion attempts per game. Here are the conversion options with known probabilities:

1-point: 94.5% = p1 , expected points –> 0.945

2-point: 47.5% = p2, expected points –> 0.95

*n* = conversion attempts per game {1,2,3,4,5,6}

Which one should teams use and why?

The easy answer is that based on expected points criteria, we should always choose the 2-point conversion, as 0.95 > 0.945.

But let’s see what happens when compare it to the lower expected points choice (the extra point) using a more technical approach.

Let X be a binomial random variable with parameters (*n*, p1 = 0.945), and let Y be a binomial random variable with parameters (*n*, p2 = 0.475). Our interest lies in the difference Z, where Z is a random variable Z= X-2Y. This reflects the difference in point totals scored between teams taking each strategy.

Now the expected value of Z is negative, which still indicates that 2-point conversion choice is better in expectation. However, we *should* be interested in the probability of Z being positive versus negative with different *n* values. In other words, because we are interested in predicting which team will win more often, we are more interested in P(Z > 0) and P(Z < 0).

As it turns out, whether or not a team should choose the 2-point conversion (e.g., whether or not P(Z > 0) > P(Z < 0)) actually varies by *n*.

For *n* = 1, the 1-point strategy wins, with the 1-point team winning 49.6% of the time, relative to 47.5% of the time for the 2-point team. At *n* = 2*, *however, it’s actually reversed, with the 2-point strategy being most preferential (27.9% versus 27.5%).

Here is chart comparing these two strategies with different *n*. The area in red depicts the winning percentage for a team always taking the 1-point strategy, green is the fraction of tie games, while blue is the winning percentage for the team taking the two-point strategy.

Indeed, the correct strategy actually depends on the number of conversion attempts we get per game. The 1-point teams wins out for *n *= 1, 3, and 5, while the 2-point team wins out for *n* = 2, 4, and 6.

Interestingly, if each of the expected values are identical, it turns out that 1-point strategy dominates all other strategies with these rules and assumptions. One might argue that I conveniently chose my numbers this way so that smallest expected points option would have highest win percentage, but my main point was to show having expected points edge does not lead automatically to more wins in a long run which some people seem to believe. And in reality, an extra point kick and 2-point conversion probably have very close expected points values and this is the way they should be compared if we knew the exact conversion probabilities, which of course we don’t know.

The conversion probabilities chosen above for 1-point and 2-point are actually fairly similar to the estimates that we have on extra point kick and 2-point conversion in NFL today, so i would argue that kicking that extra point is not so bad after all even with that slight expected points disadvantage it might have. But if the 2-point conversion rate starts to get closer to 50%, which for some teams it might already have, that becomes the better strategy.

Of course, it’s always easy to compare these things with exact numbers, but in reality there is a lot of uncertainty about these conversion rates and they depend on various factors which are hard to measure precisely. As Michael (Lopez) pointed out me, that uncertainty makes these strategies basically coin flips and that just emphasizes the importance of trying to choose the plays to maximize our win probability given the score state of the game. And we should always try to model how our choices affect win probability and not look just raw expected points, which might sometimes lead to wrong choices.

Michael also linked me this great article by Mark Taylor where this same concept is discussed in a concept of soccer xG-model and you can find it here:

http://thepowerofgoals.blogspot.fi/2014/02/twelve-shots-good-two-shots-better.html

*Juho Jokinen is* *a former pro ice hockey player from Finland and current math/statistics student (BSc math and MSc statistics) in University of Oalu, Finland. This is his second season following the NFL, as football is a marginal sport in Finland. Follow him on Twitter** @jokinen_juho.*

]]>

This increased awareness has led to an exorbitant number of players being rested by their teams, as shown in Baxter’s tweet below.

With several star players now spending a few games a year on the bench, this begs the question: Is the NBA’s regular season too long?

Relative to the NHL, NFL, and MLB, the answer is a resounding yes.

**********************

There is no obvious mechanism for finding an ideal schedule length or for comparing the schedule lengths of different sports. In one notable example from 2007, Phil and some commenters used standard deviations of team win percentages in an informal back and forth to suggest that 33 NBA games was the rough equivalent of 162 baseball games. That conversation grew out of a few economics papers, which linked schedule length to issues of competitive balance.*

Ultimately – and putting business considerations aside – a season is too long if adding more games does little to distinguish measurements of team strength. Of course, if those additional games were to change our perceptions of team ability, then one could argue that a season was too short.

Generally, team strength can be looked at by using won-loss percentage as a proxy.** And so one simplified approach to looking at season length would compare a team’s performance at any given point in a season to their win percentage at season’s end. So that’s exactly what I did, using data from the four North American professional sporting leagues.

The chart below shows the R-squared value comparing league-wide won-loss percentages at each point in a season to eventual won-loss percentages at season’s end.*** I used the last decade of data, excluding the current NFL season and the NBA/NHL lockout years.

Instead of using game-number on the x-axis (which would vary by sport), I used percent of the season. For example, the 50% percent mark corresponds to 8, 41, 41, and 81 games for each team in the NFL, NBA, NHL, and MLB, respectively. The slow convergence to 1 is expected, as a team’s win percentage will more closely correspond to its final win percentage as the season progresses.

In any case, the graph identifies similarities in the NHL, NFL, and MLB. For each league, across most points in the season (based on percentage of the season played), there is a decent amount of similarity between how teams compare to their eventual year-long performance. One benefit of R-squared is that it’s interpretable, and above it reflects the fraction of variability in year-end win percentage explained by win percentage at each point prior. As an example, roughly 75% of season-long win percentage can be explained through the first 55% of the season in the NFL, NHL, and MLB.

The NBA curve, meanwhile, stands out, rising quickly above the other leagues. We hit that 75% mark in explaining season-long win percentage by about the 25% mark on the x-axis for example, reflecting about 20 games played. As a reference point, we’ve already passed that point in the 2016-17 season. Alternatively, within just a dozen or so games, we can explained about 50% of season-end variability in win percentage.

Altogether, if you are okay with year-end win-percentage as your measurement of team strength, the NFL’s 16-game schedule, the NHL’s 82-game schedule, and MLB’s 162 game schedule roughly match up in terms of equitable season length. The NBA’s season, meanwhile, reveals far more information at relatively earlier points in time.

**********************

At what season length would the NBA be comparable to other leagues?

One way to consider this option is to sample smaller numbers of NBA games, pretend that sample represents the full season, and repeat the same analysis above. Turns out, 20 games yielded patterns consistent with those found in the other three leagues. Here’s the chart:

Using samples of 20 games, the R-squared path over the course of the season in the NBA roughly matched those from the other three leagues. In other words, the NBA could lose over three-fourths of its season and it still wouldn’t have a relatively shorter season than the other three leagues.

**********************

A few postscripts worth mentioning:

-These were easy curves to make, so much so that I worry someone has already made them. If that’s the case, please forward so I can cite appropriately.

-Given that won-loss percentages in the NBA would shrink towards 0.500 the more and more teams rest star players, the method above could actually underestimate how much the NBA stands out.

-If I had more time, I’d bootstrap for standard errors. Welcome to the end of a semester of teaching.

-This is but a brief tangent from a longer project that I am working on with Ben and Greg. Stay tuned for more – and hopefully better – ways of making these types of comparisons.

**********************

Footnotes:

*See work from Rodney Fort, David Berri, and Brad Humphreys, among others. I also liked this paper from Julian Wolfson and Joe Koopmeiners, which looks at similar issues using more complex models.

**There are several reasons that win-loss percentage is flawed, but it’s the simplest metric for purposes of a blog post. Among other reasons, won-loss percentage is impacted by unbalanced schedules (like playing in an easy division or a tough division) and which teams you end up playing at home. Thus, outside forces can impact won-loss percentage and skew our findings in unknown directions.

***R-squared’s not great, either. It can be unduly impacted by one or two observations, for example. However, given that the fit between current win percentage and end-of-year win percentage is likely fairly linear, I’m hopeful that this issue is not a problem. One alternative approach would use won-loss percentage in a predictive model (e.g., predict the team with the higher win percentage would win). Perhaps for another day.

]]>

In that regard, I figured it was worth a quick investigation. In this post, I’ll suggest that the link between one play call and the next, at least early in a game, is a bit stronger than I thought it would be.

*******************

The importance of a run-pass balance is a common football narrative. And because coaches want to appear balanced between the run and the pass by the end of a game, they may also feel the need to appear balanced between the run and the pass in small samples of plays. If a coach calls three run plays in a row, he may fear looking *too* committed to the run-game, or, even worse, *too* predictable for the defense.

Of course, it’s not just football. If it exists, an evening up of play types would reflect more general human misconceptions rooted in probability. It’s why when we play rocks-paper-scissors, we rarely use the same throw three times in a row. If you aren’t gonna throw rock after throwing rock-rock in rocks-paper-scissors, you probably aren’t gonna run after calling run-run during a football game. And a similar bias also impacts sport officials. In the NHL, for example, referees calling violations on one team are more likely to call the next penalty on that team’s opponent, no matter the game’s score. Just like coaches want to appear balanced, so too do referees.

While a large-scale predictive model of opponent play calls would be one of the first things I would do as an NFL team analyst (see this example or this one), it may not be the most straightforward way to look at whether or not coaches even up play calls. In particular, decisions made as the game progresses are particularly tied to the score. And from my perspective, although the approaches shown in the links above include a term to test for an autocorrelation of play calls, the exact effect remains unknown.

To reduce the impact of other play and game characteristics, I’ll start as simple as possible, by only looking at a team’s first few offensive plays in a game.

*******************

Per usual, I’ll use the play-by-play data provided by Armchair Analysis, which includes each play from 2000-2015. To limit the effect of field position, I only included drives that started between the 10-yard lines, and I dropped penalties to focus on the remaining runs and passes.

Here’s a chart of run percentages on each team’s second play, varied by the play-type of the first play. The error bars account for our uncertainty in each probability estimate.

Teams run more often after they pass, and they do so significantly more often – an absolute difference of about 12%. On a relative scale, teams are about 25% more likely to run when their first play was a pass.

That said, savvy readers may have picked up on the fact that if rushes and passes were to result in different types of second plays (e.g, different yards to go), such a comparison wouldn’t make sense.

But we can look closer. Here’s the same chart above, faceted by the down and distance of the second play (2nd & short: 3 yards or less, 2nd & medium: 4-6 yards, 2nd & long: 7 yards or more).

For 2nd & shorts (bottom right), there’s no obvious difference in the likelihood of running based on the initial play call. Teams tend to run the ball here.

Among other play types – in particular, on 2nd & medium and 2nd & longs, there remains a significant difference in how an offense calls its plays given what it just called. On 2nd & longs, for example, teams rushed 44% more often (an absolute difference of 19%) after passing on first down. That’s an enormous effect.

Of course, there may be other things at play. Perhaps teams failing at one play type (rush or pass) feel the need to try another play type (pass or rush) on the second play. But if you’re feeling the need to vary your play calls based on the first play of the game (literally, that’s the only play on the x-axis), that’s a whole other issue to write about.

*******************

But we can also go just beyond the game’s first two plays. Here’s a histogram of the number of rush attempts using the first four offensive plays for each team in each game. The red bars reflect what we’d expect if teams were to pick four play types (runs and passes) out of a hat (using a run type probability of 49%); the black bars reflect what we see in the data.

The higher black bar in the middle highlights that in the first four plays of the game, coaches make more of an effort to call exactly two runs and two passes (about 46% of the time) than what we’d expect due to chance (37% of the time). Along similar lines, while we’d expect about 13 in 100 sequences of four plays to include *all* rushes or *all* passes, that only happened about 7 in 100 times in the data. Altogether, this matches our conclusion from above; coaches are a bit more balanced than we’d expect them to be if they were randomly dialing up plays.

*******************

Offensive play-callers are probably better at designing plays than we give them credit for. Schemes are enormously complex, and the amount of detail that goes into a gameplan can be awe-inspiring.

But during a game, when faced with split-second (well, 40-second) decisions, it’s natural for those same play-callers to revert to predictable tendencies. In the case of the above evidence, it appears more likely that, all else being equal, runs are more likely to follow early-game passes and passes are more likely to follow early-game runs.

]]>

What’s the optimal strategy? It’s a tough question, so I posed this on Twitter.

Roughly 50% of my respondents (overall, a more analytic-friendly crowd) answered that, yes, teams should go for two, with the remaining voters equally split between “No” and “It depends.”

In this post, I’ll suggest that, at least empirically, it hasn’t made a ton of difference one way or the other.

********

In considering the optimal two-point strategy with a seven-point lead, we can start by looking at how often teams have come back when trailing by seven, eight, or nine points. While there are hundreds of games where teams have scored and kicked an extra-point to build exactly a seven-point lead late in the game, it’s a bit dicier to find examples of teams scoring and taking a seven-point lead *before* kicking the extra point. Using Armchair Analysis’ data, for example, there were just 88 such examples between 2000 and 2015.

So instead of looking at those 88 games, I expanded the analysis to include any game where a team took possession in the final eight minutes of the fourth quarter between 10 and 40 yards from their own goal when down either 7, 8, or 9 points. In essence, this adds about 1300 contests (so 1400 total) that should be equivalent to a team trailing late in the game having just given up a touchdown.

Here’s how the games eventually played out. The chart below shows the fraction of times that the winning team held on, depending on the size of their lead. The size of each dot is proportional to the number of games with teams in those situations. I also used two colors to vary when the offensive team started its possession.

Teams ahead by seven points have won about 86% of games when starting a possession on defense with between 4 and 8 minutes left, a number that jumps to 89% when up eight and 94% when up nine points. This makes sense. If you have a larger lead, you are more likely to win.

And there’s a similar increase for teams getting the ball in the final four minutes of a contest (shown in red). In fact, in the 94 games when a team has started a defensive possession with fewer than 4 minutes left when ahead by exactly nine points, they’ve won all 94 times. That isn’t to say that teams can’t lose when ahead by this margin – they’ve lost when up by 10, for example – but it’s quite unlikely. A two-possession lead late in the game is really hard to overcome.

********

We can use the probabilities above to outline a strategy of whether or not to attempt the two-point conversion.

**For teams scoring with between 4 and 8 minutes left, we are left with the following calculation:**

**Go for two (assuming a 50% chance of a successful conversion): **

50% chance to get a 94% chance of a win + 50% chance to get an 86% chance of a win = *Win 90% of the time*

**Kick:**

*Win 89% of the time.*

Using these numbers, there’s a *slight* advantage to going for the two-possession lead by attempting the two-point conversion. Given the associated errors that come with these probabilities (the margins of error in the graph, for example, are about 4%), this difference is not statistically meaningful.

**For teams scoring with between 0 and 4 minutes left, we use the following calculation:**

**Go for two: **

50% chance at a 99% chance of a win (best guess) + 50% chance at an 89% chance of a win = *Win 94% of the time*

**Kick: **

*Win 93% of the time.*

Again, very little difference, and not a statistically meaningful one.

Altogether, there’s little empirical evidence to suggest that teams should attempt the two-point conversion late in the game when up seven. While there may be a slight advantage to the more aggressive strategy, it does not appear to be an overwhelming one. Relative to more common scenarios that coaches often screw up – like punting on 4th and 1 near midfield – the decision to attempt a late-game conversion appears to be a minor one.

********

Extra points:

-Some readers may have identified that the recent increase in extra point distance should be part of the discussion. That may be true. However, while it’s now more likely than before that the leading team misses an extra point that would give it an eight-point lead, it’s also more likely that the trailing team misses a game-tying chance if it were to score when down seven.

-I’ve seen frequent suggestions that teams should vary their decisions based on the caliber of their defense. As one example:

This is fair, but two things to keep in mind. First, when a strong defensive team like Denver goes for two, the benefit of the two-possession lead looms even larger! No way the Chiefs score on *two* drives last night.

Second, team strength probably doesn’t matter as much as you think. As part of work I did last year for SI.com, I looked at both the game’s point spread and team offensive and defensive efficiency metrics from Football Outsiders as it related to two-point success. While the game’s point spread was a significant predictor (favored teams converted more often), neither the offensive team’s strength alone, nor the defensive team’s strength alone, factored into two-point success. Team-specific probabilities of successful conversions were almost always between 40 and 60 percent, with most of those differences accounted by the game’s point spread.

-I split game minute into two categories above: 0-4 minutes left and 4-8 minutes left. I tried similar splits and they told a similar story.

-It’s worth noting that simply splitting games by deficit alone would be troublesome if there were differences in the team strength among those leading by 7, 8, or 9 points (e.g., if the Patriots and Seahawks always led by 9 points). Judging by the game’s point spread, however, this didn’t seem to be the case. The teams leading late by 7, 8, and 9 points were relatively similar in terms of team strength.

-Extra extra point:

]]>