In the past decade, sports analytics moved from the fringes of popular consciousness to the mainstream. The typical media narrative tells us that data is changing the game. To some extent, that’s true. The majority of professional teams in the five major sports leagues have at least one person on staff or on retainer tasked with delving into details and applying numbers to performance, and nearly all NBA, NHL and MLB franchises have sent at least one representative to the Sloan Sports Analytics Conference.

Noah and I wanted to find out more details about the job, the lifestyle and how analytics are being used, so we developed an informal survey and asked people who work or had worked on the sports analytics staffs of professional teams to participate. We used a combination of social media and personal email to contact staffers who we knew worked with teams or who were mentioned in ESPN’s analytics feature. A total of 61 respondents answered questions anonymously. A pdf of the survey can be found here. We also interviewed a half-dozen by phone, either on or off the record, to get anecdotes about what it’s like to be part of a professional franchise.

Some of the responses were predictable. Our survey wasn’t perfect, and our sample isn’t necessarily representative of the industry, but the respondents were 95 percent male, 95 percent white, and 92 percent fell into the 19-to-45 age group. Additionally, respondents from Major League Baseball teams report the highest average number of full-time staffers working for their teams – 3.6 – compared with roughly two apiece per team in the NFL, NHL, and NBA. (These numbers could be skewed because teams without any current or former analytics staffers would not be able to respond to the survey.) That makes sense, considering that everyone we spoke with believes that baseball teams are generally the furthest along in the development of their analytics departments. But professional teams across all major sports are increasingly investing in their analytics departments, slowly but surely adding to their budgets as executives place more trust in numbers.

Respondent breakdown:

NBA 18

MLB 16

NHL 7

NFL 6

Professional soccer 5

Other/multisport 4

(Five respondents left “sport” blank.)

**For the love of the game**

Working for a team isn’t a 9-to-5 job. The hours are long, with analytics staffers reporting that they work anywhere from an average of 53 hours per week in soccer to 66 in the NHL, with MLB, the NFL and the NBA averaging 60. (One MLB staffer reported working 95 hours per week.) “There are no holidays,” Bill Petti, a consultant for a number of MLB teams, said about the life of an analytics staffer. “You’re working nights. You’re sitting in the office at 10 p.m. in case somebody has a question.”

Aaron Barzilai, who worked as director of analytics at the Philadelphia 76ers until last February, agreed. “Salaries are depressed. You can’t be working for a team if you don’t love it because you need to be getting some psychic benefit from working for a team.”

Salaries are decent and occasionally well above the typical white-collar worker. According to our survey, medians ranged from $75,000 (NFL) to $100,000 (MLB) per year. Three respondents said their annual wages were greater than $200,000. For most, it’s not a bad living, but consider that nearly anyone with the skills to get one of the few jobs as an analytics person on a professional sports team could also make significantly more working at Google, Facebook, Microsoft, or dozens of other firms. This leads to lots of turnover, especially at more junior positions, which are lower paid and there’s little, if any, opportunity to advance into a more senior role because those jobs rarely become available. Many recent college graduates work for teams for a few years before moving on to tech firms or other companies where they can earn more for their skill set.

**Determining value**

The people we spoke with said that teams undervalue their analytics staff and invest accordingly, not unlike employees anywhere who think their departments deserve more resources.. While any reasonable employee would say that their department deserved more money and more resources, it’s not unreasonable to think that more computing power or another data set could produce results that were cheap by comparison. “They are spending hundreds of millions on players, tens of millions on coaches and staff, but $10,000 is a large expenditure to get a computer or some data,” a sports analytics expert who’s worked with NBA teams, said. “It’s ridiculous. It’s two different budgets.” (For what it’s worth, one study found that the average price of an MLB win is $1,016,674 in player salary, an NBA win is $1,572,768, and an NFL win is $11,878,369.)

As some teams mature and develop systems to handle routine reporting like data gathering, they may be able to think about building teams to handle some of the other stuff. Those that don’t will find themselves hitting a wall. In the past, teams could get away with having Excel, a computer and a staffer or two. Advances in sports analytics overall mean that groundbreaking work requires increasing talent levels, computing power and time to experiment.

One NHL analyst told us about this dream staff:

*“You need a couple different people unless you are a one-man team willing to put in 20-hour days. Just the data handling alone — the NHL is pretty archaic in their data to begin with — is one full-time person. If you wanted to expand from coaching and tactical to GM trades and to the draft/strategic long term, you need one data person, two or three analytical people whose job is formulating and communicating analysis, and then I would have two or three developers working on dashboards and tool sets. If you’re going to give a GM a sheet of paper with a recommendation, he may or may not pay attention. But if you give him a tool that he can play with, then it’s not your idea. It’s his idea.”*

A staff of one wouldn’t have the resources to develop and build that tool.

But rather than hire a number of positions, many teams still seek unicorns. Teams want one person who can fill all of those roles and then also have the skills to communicate results to others. This Marlins job posting asks for an intern with scripting ability, database management and statistical proficiency. Those tools sometimes overlap, but an actual combination of all three is rare, and most people with such skills can make a substantial amount of money elsewhere. And keep in mind, baseball is *ahead* of most other sports. A college student with this background can get paid $7,200 or more a month with housing to work at Microsoft or Facebook, or make $12.50 an hour for the Marlins.

**Nothing matters if no one is listening**

Finally, analytics can only be effective if the decision makers use what they are given. “There seems to be too much focus on results, and not enough focus on the quality of the process,” one respondent said.

Some analysts expressed concern that teams didn’t pay attention to their work.

“If the GM or president of basketball operations doesn’t want to read the results, it doesn’t really matter how talented your Ph.D. in the basement is,” said Barzilai. “You can have organizations that are using analytics well even if they aren’t doing cutting-edge analytics just by relying on what people might think of as fundamental analytics or the stuff that was coming out five years ago or stuff that is public.”

Another added that it’s frustrating: “When the numbers are so overwhelmingly in favor of one decision and it doesn’t happen due to someone’s feelings about public perception or a ‘well, it’s always been this way’ attitude.”

Consider The New York Times’s Fourth Down Bot, a simple formula that tells readers when teams should go for it on fourth down. The bot believes that coaches are too conservative, and would universally benefit from going for it more often. If a coaching staff listened to the bot, they’d benefit in the long term.

Generally, the decision makers in the front office are more receptive than others to input from the analytics staffers. In the NBA, NFL and NHL, at least 50 percent report weekly correspondence with the general manager, while just 20 percent in MLB do. Just 10 percent of the overall sample says they have weekly correspondence with players.

Team officials say they are continually working to refine the processes, to incorporate all the information they receive. “We don’t see ourselves as having an ‘analytics team’ or ‘process of incorporating analytics,’” an assistant general manager of an NHL team said. “We look at the best information we have when making decisions, and everything’s assimilated similarly whether it’s one scout’s eye test or another (or even the same) scout’s or an analyst’s tracking data or historical comparisons that some might call ‘analytics.’”

Some teams are better than others at applying what they learn. According to our respondents, the Spurs (NBA), Maple Leafs (NHL), Dolphins and Browns (NFL) are leading the analytics push in the less-than-quick-to-adapt leagues, with the Browns the most open about their ambitions. Said one: “A team like Toronto is doing it by the book in terms of how analytics should be impacting teams. They are … eating everyone’s lunch. It will pay off in the next two to five years and teams will start to say wow, look at their talent pool. Teams will start copying them.”

Despite analytics’ increasing role and media attention, it’s still early days. There are algorithms to write, data to dissect and knowledge to create. We asked respondents to say what percentage of the most important questions in their sport have been answered. The results, averaged by sport:

MLB – 56 percent

NBA – 38 percent

NHL – 32 percent

NFL – 31 percent

Soccer – 17 percent

Sports analytics developed a great deal in the past few decades, but there’s still plenty more to discover.

]]>

The photos begin after page 177. While perfect for putting names to faces, the insert’s location also means that if you read too fast, you’ve missed my favorite part of the book.

In the pages before and after the pictures, Lindbergh and Miller link their situation in running the Stompers’ season to quotes made by Huston Street, the Angels closer who was asked about pitching in non-save situations.

“I’ll retire if [pitching in non-save situations] ever happens,” Street is quoted as saying. “It’s a ridiculous idea, it really is.”

The quote hits home for Lindbergh and Miller, who, until that point in the season, had been similarly using their closer Sean Conroy in only save situations. Although Lindbergh and Miller had wanted to bring Conroy in during any high leverage spot, the team’s manager, Feh Lentini, was to that point resistant. Lindbergh, author of this particular chapter, writes [emphasis mine]:

*That last comment really rankles, because it cuts to the heart of why we’ve come to Sonoma: to put “on paper” ideas like this into practice. Thus far, it seems as if those who side with Street are right. Not because the idea of less restrictive roles doesn’t work — it did work, before league bullpens became hyper-specialized — but because everyone is so convinced it wouldn’t work that they aren’t even willing to try it. The idea is disqualified because it can’t pass a test that no one will allow it to take.*

Street’s comments weigh on Lindbergh for days, leading to friction with Feh, eventually ending with the manager’s firing halfway through the Stompers season.

Altogether, Lindbergh and Miller’s entire book is worth reading. It’s the game all of us stats-nerds wish we had the chance to play – actually calling the shots, instead of just commenting about them. And the authors make the characters and events personal. You’re rooting for Conroy, you’re rooting for a random infield shift to work, you’re rooting for Christopher Long’s spreadsheet to spit out the best possible players, and you’re rooting for a Stompers title, all things you otherwise wouldn’t have known existed.

But it was Lindbergh’s summary of Street’s comments – and the personal and oft-tempestuous back and forth between baseball tradition and analytics strategy — which has stayed with me months after reading their book.

Indeed, statistical findings have changed the way in which teams across baseball, and to a lesser extent other sports, have assessed free-agents, implemented game-day strategy, and drafted future players. Stat-heads are also a relative bargain, as Rob and Ben suggested here, boasting strong ties to future improvement.

It’s not that one can guarantee that implementing analytical strategies will lead to success, for the same reason that Lindbergh and Miller couldn’t guarantee that every infield shift would work to the Stompers’ advantage. Moreover, not every statistical thought that is once assumed true will end up being correct. But it’s the actuality of implementing a test – the trying of something to see that whether or not it will work, and the ability to live either way with the consequences – that eats at us in sports analytics on a near daily basis. Don’t disqualify an idea because you’re not willing to try it.

And as a result, it’s when the most basic of tests can’t be taken that frustration boils over.

Statistics can beat the smartest of us in Chess and poker, know what friend you’ll connect with on Facebook, predict what you’ll buy on Amazon, and finish what you are searching for on Google. If implemented properly, it will also allow sports teams to make better decisions across nearly all facets of the game.

It just needs the ability to take the test.

*You can read more about Ben and Sam’s book here, or buy it on Amazon here. *

]]>

The 2016 NFL regular season has ended, and with it has come the usual coaching carousel in which many franchises have opted to fire their head coach.

As of January 2nd, six of the league’s 32 teams have openings, with five of those coming by way of a fired predecessor (Denver’s retiring Gary Kubiak being the lone exception). But it’s not like 2016 is any type of outlier; roughly 4 coaches per year have been canned since the early 1980’s.

What’s interesting, though, is that despite the frequent, franchise-altering decisions made across the league, it’s mostly unknown whether or not this choice benefits longterm franchise prospects. (*Postscript: Today, Brian Burke looks at the identical question here, finding similar answers to what I find below). *As one exception to the rule in soccer, one study found that sacking a manager in soccer offered no tangible benefit to the future performance of a club.

So, does firing a coach cause teams to improve?

The point of this blog post will be to look back at past firings, and to use some standard causal inference tools to help us identify if the choice of whether or not to fire a coach has been a helpful one.

Estimating causes and effects when it comes to coach firings, unfortunately, is in no way straightforward.

The easiest strategy would be to compare the performance of franchises who fired their coaches in the seasons pre and post firings. For example, since 1982, the 130 teams who have fired their coach (using end-of-season firings) boasted an average improvement in winning percentage of 0.10, or the equivalent of about 1.6 games in a 16 game season. That’s a notable and statistically significant improvement.

Of course, that simple strategy is also a misleading one. The teams who got rid of their head coach only averaged about 5 wins per season prior to the firing, so on account of reversion towards the mean, we would’ve expected most of these teams to improve, anyways.

We can and should do better.

Let’s introduce some causal inference lingo.

In an ideal world we’d observe two outcomes, (i) the future performance of a team that fired its coach and (ii) the future performance of that same team that kept its coach. These are termed potential outcomes, and if we knew both potential outcomes, it would of course be easy to pick an optimal strategy.

Alas, short of building a time machine, knowing both potential outcomes is infeasible, and we’re only left with knowing the path chosen by each franchise. This is what’s known as the fundamental problem of causal inference; we want to be able to contrast an observed outcome with something that can’t be observed.

As it turns out, this also makes causal inference a missing data problem – the missing data is the missing potential outcome. In our case, for a team that fired its coach, the missing outcome is the path that would have been observed had that team kept its coach. Likewise, for a team that kept it’s coach, the missing outcome is what would’ve occurred had the coach been canned.

Causal inference tools, initially stemming from Jerzy Neyman’s work in the 1920’s with randomized designs, have become quite popular for estimating these missing potential outcomes. Under certain – but important – assumptions, if we can estimate the missing potential outcomes, we can likewise estimate the causes and effects of a treatment, including those from observational data.

The most popular causal tools are individual or full matching, subclassification, and weighting, each of which has its own strengths and weaknesses. In the sections below, I’ll overview how to use 1:1 matching with a data set of NFL coach firings.

If you are in search of a broader look of causal inference tools, I’d start with Elizabeth Stuart’s excellent review in Statistical Science.

The data I’m using comes courtesy of Harrison Chase and Kurt Bullard, former and current members of the Harvard Sports Analytics Club. Along with Harvard professor Mark Glickman, Harrison helped write an article on coaching turnover in sports, published recently in Significance Magazine. In addition to their data, their model assessing when teams fire their coaches was the impetus behind this post.

Harrison and Mark used a combination of logistic regression and classification trees to fit model of coach firing (Yes/No) as a function of several team-level coefficients. Their final model includes, but is not limited to, each team’s past win percentage, divisional win percentage, the coaches’ experience, strength of schedule in the prior season, number of rings that the coach averaged, and whether or not the team also experienced a GM change, chosen from roughly 25 candidate covariates.

Using their final variables, I used logistic regression to model each coach firing decision between 1982 and 2015. Here are those fitted probabilities from that model, separated by the teams that did and did not fire their coach. Point are jittered to account for overlap.

Altogether, the chart isn’t surprising. Most teams in most years aren’t firing their coach, and these teams are shown in the cluster of points in the top left of the graph. Meanwhile, teams that fire their coach tend to have predicted probabilities evenly spaced between 0 and 0.9.

At this point, we know we what suspected to begin with; the teams that fired their coach are, by and large, different from those that did not. This is a problem for most statistical tools. Basic comparisons like *t-*tests wouldn’t be able to account for these baseline differences, and even regression adjustment would be prone to bias given that the two groups (teams that fired their coaches and those that didn’t) are different from one another on several of the covariates that we would want to use in a model. Moreover, regression would be sensitive to model choice, and like most applications of statistics, the *true* model specification is unknown.

Here’s where causal inference comes in.

The probabilities depicted above are examples of propensity scores, defined as the conditional probability of receiving a treatment (in our case, of a team firing its coach). A nice property of propensity scores is that if two teams have the same propensity score, they also have, in expectation, the same distribution of observed covariates. This is really important. More technically, the distribution of covariates, conditional on the propensity score, is independent of whether or not a team chose to fire its coach.

The next critical part of propensity scores ties back to our potential outcome notation from earlier. Let’s assume that, conditional on the propensity score, the distribution of the set of potential outcomes is independent of our covariates, an assumption known as unconfoundedness. In other words, if I can find two teams with the same coach firing probability, where only one team actually fired its coach, the difference in those teams’ outcomes is an unbiased, unit-level estimate of firing a coach. Moreover, I can aggregate those differences across groups (say, every team that fired its coach) to provide an estimate of the causal effect of firing a coach (in this case, the benefit of firing a coach among teams that actually fired its coach).

These properties of propensity scores have made them widely applicable in fields like economics and government. There haven’t been many applications to sports, however, much to my chagrin.

The propensity score allows us to estimate the missing potential outcome that we don’t observe.

One way of doing this is to use matching, in which subjects receiving the treatment (those that fired a coach) are matched to those that didn’t. Using the `Matching`

package in R, I matched teams that fired their coach to those that didn’t. Here’s the same plot as above, only now I use different colors (and shadings) to reflect observations that were and were not matched.

A few things to point out.

First, I used 1:1 matching with replacement, meaning that each coach who was fired (bottom row) was matched to one that didn’t (top row), but it was possible for coaches kept to be matched to more than one coach that was fired. Second, the set of coaches with a high probability of being fired who were actually fired (bottom right, in red) ended up not being part of my matched cohort. By and large, this is a good thing; there was no coach that was kept with a corresponding probability of being fired, and inference to this set of coaches would require extrapolation.

But matching alone is not sufficient for inferring causes and effects. The next step is to make sure that the matching has done its job. Specifically, matching only works if the subjects matched to one another boast similar distributions of the observed covariates.

There are several ways to analyze covariate balance, and one of the more common ones compares the standardized bias for each covariate between each treatment group, done for both the pre and post-matched observations. Large value of standardized bias are bad – generally, the recommended cutoff for justifiable inference is 0.25 – and reflect groups that are not similar to one another.

Here are the pre and post-matched absolute standardized bias’ with our matched cohort.

Each dot above reflects a variable from our logistic regression model (those recommended by Harrison and Mark). For example, the standardized bias of team win percentage (abbreviated as `win_p`

) was roughly 1.5 in the pre-matched set of teams; after matching, the bias dropped below 0.15. In fact, the absolute standardized bias of all variables was sufficiently close to 0 after matching. This is a good thing; it entails that within our matched subset, teams that did and did not fire their coach are similar to one another (similar winning percentages, rings, GM changes, etc).

One important thing to point out is that the actual fit of the propensity score model is less important than the balance that is achieved: in other words, I’m worried less about things like collinearity and model fit statistics than I am about how similar the subjects matched to one another are. In our example, it looks like the teams that fired their coach and the ones that didn’t who ended up in our matched cohort are sufficiently alike.

Notice that I am yet to mention any observed, team-level outcome. This is not by accident; indeed, the above steps are considered to be the design phase of causal inference, done without looking at any outcome data.

The second step of causal inference is the analysis phase, in which the outcome of interest is contrasted within the matched cohort. For our purposes, I used the team’s winning percentage in the year following the firing or keeping of a coach.

There are a few reasonable approaches to to estimate the effect of coach firings on future winning percentage. One oft-recommended option is to use the combination of regression and matching together. Writes Stuart, “matching methods should not be seen in conflict with regression adjustment and in fact the two methods are complementary and best used in combination.”

With future team win percentage as my outcome, I fit a multivariate linear model with coach firing (yes/no) and 10 other predictors as covariates; these were the same 10 used by Glickman and Chase in their model of coach firings.

Turns out, not only is there no evidence that coach firing causes future success, if anything, it’s an inverse association. In our matched cohort, teams that kept their coach boasted a slightly higher (3.7%) winning percentage than those that fired their coach (p-value = 0.08). Notably, this estimate of 3.7% is relatively robust to model specification.

**Extrapolating, our best estimate of the causal effect of firing a coach is about -0.6 wins in the following season, but given our uncertainty, it is unclear if this finding is due to chance or if there’s some true, net loss in the year following a coach firing. **

Hopefully this walk-through provides readers a rough introduction to how causal inference tools can be used, as well as the steps involved. You can repeat the analysis yourself using the code here, and if you are more familiar with causal tools, feel free to play around.

Some final thoughts:

- Those familiar with causal inference will notice I did not detail all of the assumptions required. One such assumption is positivity, which I think holds because it’s safe to assume that each team had a non-zero chance of firing its coach. Another is SUTVA, which I’m less confident about. As an example, it seems reasonable to argue that one teams’ choice to fire its coach ties into the potential outcomes of other teams.

- This post on visualizing covariate balance is really interesting, and would have saved me several hours of thesis writing. In fact, I wish I had seen it before I started writing the above post.

- You could certainly make the case that a team’s future win percentage in the year following a coach firing is not the best outcome. I chose the one-year outcome, as once you go to more than a year, things could get a bit dicey regarding our assumptions (i.e., if a team fires coaches in two consecutive years).

- If this were a more technical paper, I’d want to look at other variables related to the choice of firing a coach. As one example, a popular post-hoc tool in causal inference is sensitivity analysis, where some of the assumptions mentioned above are put to the test.

]]>

Lately, there has been some discussion about choosing between the extra point kick and the 2-point conversion, as well as the criteria NFL coaches should use in different situations when deciding plays. The most common argument I read is “this play has more expected points so it’s better in a long run.” While expected points give us some information about the value of our choice, I’ll point out that we should try to compare how our choices affect win probability because that is the ultimate outcome.

So, lets play a game where we have two conversion options and they are the only way to score. For sake of simplicity, assume we have to choose before the game what conversion type we are going to use. At the end, we can compare which conversion strategy leads to more points more often.

Let’s say that the number of conversion attempts per game, *n*, is between 1 and 6, and that both teams will have same amount of conversion attempts per game. Here are the conversion options with known probabilities:

1-point: 94.5% = p1 , expected points –> 0.945

2-point: 47.5% = p2, expected points –> 0.95

*n* = conversion attempts per game {1,2,3,4,5,6}

Which one should teams use and why?

The easy answer is that based on expected points criteria, we should always choose the 2-point conversion, as 0.95 > 0.945.

But let’s see what happens when compare it to the lower expected points choice (the extra point) using a more technical approach.

Let X be a binomial random variable with parameters (*n*, p1 = 0.945), and let Y be a binomial random variable with parameters (*n*, p2 = 0.475). Our interest lies in the difference Z, where Z is a random variable Z= X-2Y. This reflects the difference in point totals scored between teams taking each strategy.

Now the expected value of Z is negative, which still indicates that 2-point conversion choice is better in expectation. However, we *should* be interested in the probability of Z being positive versus negative with different *n* values. In other words, because we are interested in predicting which team will win more often, we are more interested in P(Z > 0) and P(Z < 0).

As it turns out, whether or not a team should choose the 2-point conversion (e.g., whether or not P(Z > 0) > P(Z < 0)) actually varies by *n*.

For *n* = 1, the 1-point strategy wins, with the 1-point team winning 49.6% of the time, relative to 47.5% of the time for the 2-point team. At *n* = 2*, *however, it’s actually reversed, with the 2-point strategy being most preferential (27.9% versus 27.5%).

Here is chart comparing these two strategies with different *n*. The area in red depicts the winning percentage for a team always taking the 1-point strategy, green is the fraction of tie games, while blue is the winning percentage for the team taking the two-point strategy.

Indeed, the correct strategy actually depends on the number of conversion attempts we get per game. The 1-point teams wins out for *n *= 1, 3, and 5, while the 2-point team wins out for *n* = 2, 4, and 6.

Interestingly, if each of the expected values are identical, it turns out that 1-point strategy dominates all other strategies with these rules and assumptions. One might argue that I conveniently chose my numbers this way so that smallest expected points option would have highest win percentage, but my main point was to show having expected points edge does not lead automatically to more wins in a long run which some people seem to believe. And in reality, an extra point kick and 2-point conversion probably have very close expected points values and this is the way they should be compared if we knew the exact conversion probabilities, which of course we don’t know.

The conversion probabilities chosen above for 1-point and 2-point are actually fairly similar to the estimates that we have on extra point kick and 2-point conversion in NFL today, so i would argue that kicking that extra point is not so bad after all even with that slight expected points disadvantage it might have. But if the 2-point conversion rate starts to get closer to 50%, which for some teams it might already have, that becomes the better strategy.

Of course, it’s always easy to compare these things with exact numbers, but in reality there is a lot of uncertainty about these conversion rates and they depend on various factors which are hard to measure precisely. As Michael (Lopez) pointed out me, that uncertainty makes these strategies basically coin flips and that just emphasizes the importance of trying to choose the plays to maximize our win probability given the score state of the game. And we should always try to model how our choices affect win probability and not look just raw expected points, which might sometimes lead to wrong choices.

Michael also linked me this great article by Mark Taylor where this same concept is discussed in a concept of soccer xG-model and you can find it here:

http://thepowerofgoals.blogspot.fi/2014/02/twelve-shots-good-two-shots-better.html

*Juho Jokinen is* *a former pro ice hockey player from Finland and current math/statistics student (BSc math and MSc statistics) in University of Oalu, Finland. This is his second season following the NFL, as football is a marginal sport in Finland. Follow him on Twitter** @jokinen_juho.*

]]>

This increased awareness has led to an exorbitant number of players being rested by their teams, as shown in Baxter’s tweet below.

With several star players now spending a few games a year on the bench, this begs the question: Is the NBA’s regular season too long?

Relative to the NHL, NFL, and MLB, the answer is a resounding yes.

**********************

There is no obvious mechanism for finding an ideal schedule length or for comparing the schedule lengths of different sports. In one notable example from 2007, Phil and some commenters used standard deviations of team win percentages in an informal back and forth to suggest that 33 NBA games was the rough equivalent of 162 baseball games. That conversation grew out of a few economics papers, which linked schedule length to issues of competitive balance.*

Ultimately – and putting business considerations aside – a season is too long if adding more games does little to distinguish measurements of team strength. Of course, if those additional games were to change our perceptions of team ability, then one could argue that a season was too short.

Generally, team strength can be looked at by using won-loss percentage as a proxy.** And so one simplified approach to looking at season length would compare a team’s performance at any given point in a season to their win percentage at season’s end. So that’s exactly what I did, using data from the four North American professional sporting leagues.

The chart below shows the R-squared value comparing league-wide won-loss percentages at each point in a season to eventual won-loss percentages at season’s end.*** I used the last decade of data, excluding the current NFL season and the NBA/NHL lockout years.

Instead of using game-number on the x-axis (which would vary by sport), I used percent of the season. For example, the 50% percent mark corresponds to 8, 41, 41, and 81 games for each team in the NFL, NBA, NHL, and MLB, respectively. The slow convergence to 1 is expected, as a team’s win percentage will more closely correspond to its final win percentage as the season progresses.

In any case, the graph identifies similarities in the NHL, NFL, and MLB. For each league, across most points in the season (based on percentage of the season played), there is a decent amount of similarity between how teams compare to their eventual year-long performance. One benefit of R-squared is that it’s interpretable, and above it reflects the fraction of variability in year-end win percentage explained by win percentage at each point prior. As an example, roughly 75% of season-long win percentage can be explained through the first 55% of the season in the NFL, NHL, and MLB.

The NBA curve, meanwhile, stands out, rising quickly above the other leagues. We hit that 75% mark in explaining season-long win percentage by about the 25% mark on the x-axis for example, reflecting about 20 games played. As a reference point, we’ve already passed that point in the 2016-17 season. Alternatively, within just a dozen or so games, we can explained about 50% of season-end variability in win percentage.

Altogether, if you are okay with year-end win-percentage as your measurement of team strength, the NFL’s 16-game schedule, the NHL’s 82-game schedule, and MLB’s 162 game schedule roughly match up in terms of equitable season length. The NBA’s season, meanwhile, reveals far more information at relatively earlier points in time.

**********************

At what season length would the NBA be comparable to other leagues?

One way to consider this option is to sample smaller numbers of NBA games, pretend that sample represents the full season, and repeat the same analysis above. Turns out, 20 games yielded patterns consistent with those found in the other three leagues. Here’s the chart:

Using samples of 20 games, the R-squared path over the course of the season in the NBA roughly matched those from the other three leagues. In other words, the NBA could lose over three-fourths of its season and it still wouldn’t have a relatively shorter season than the other three leagues.

**********************

A few postscripts worth mentioning:

-These were easy curves to make, so much so that I worry someone has already made them. If that’s the case, please forward so I can cite appropriately.

-Given that won-loss percentages in the NBA would shrink towards 0.500 the more and more teams rest star players, the method above could actually underestimate how much the NBA stands out.

-If I had more time, I’d bootstrap for standard errors. Welcome to the end of a semester of teaching.

-This is but a brief tangent from a longer project that I am working on with Ben and Greg. Stay tuned for more – and hopefully better – ways of making these types of comparisons.

**********************

Footnotes:

*See work from Rodney Fort, David Berri, and Brad Humphreys, among others. I also liked this paper from Julian Wolfson and Joe Koopmeiners, which looks at similar issues using more complex models.

**There are several reasons that win-loss percentage is flawed, but it’s the simplest metric for purposes of a blog post. Among other reasons, won-loss percentage is impacted by unbalanced schedules (like playing in an easy division or a tough division) and which teams you end up playing at home. Thus, outside forces can impact won-loss percentage and skew our findings in unknown directions.

***R-squared’s not great, either. It can be unduly impacted by one or two observations, for example. However, given that the fit between current win percentage and end-of-year win percentage is likely fairly linear, I’m hopeful that this issue is not a problem. One alternative approach would use won-loss percentage in a predictive model (e.g., predict the team with the higher win percentage would win). Perhaps for another day.

]]>

In that regard, I figured it was worth a quick investigation. In this post, I’ll suggest that the link between one play call and the next, at least early in a game, is a bit stronger than I thought it would be.

*******************

The importance of a run-pass balance is a common football narrative. And because coaches want to appear balanced between the run and the pass by the end of a game, they may also feel the need to appear balanced between the run and the pass in small samples of plays. If a coach calls three run plays in a row, he may fear looking *too* committed to the run-game, or, even worse, *too* predictable for the defense.

Of course, it’s not just football. If it exists, an evening up of play types would reflect more general human misconceptions rooted in probability. It’s why when we play rocks-paper-scissors, we rarely use the same throw three times in a row. If you aren’t gonna throw rock after throwing rock-rock in rocks-paper-scissors, you probably aren’t gonna run after calling run-run during a football game. And a similar bias also impacts sport officials. In the NHL, for example, referees calling violations on one team are more likely to call the next penalty on that team’s opponent, no matter the game’s score. Just like coaches want to appear balanced, so too do referees.

While a large-scale predictive model of opponent play calls would be one of the first things I would do as an NFL team analyst (see this example or this one), it may not be the most straightforward way to look at whether or not coaches even up play calls. In particular, decisions made as the game progresses are particularly tied to the score. And from my perspective, although the approaches shown in the links above include a term to test for an autocorrelation of play calls, the exact effect remains unknown.

To reduce the impact of other play and game characteristics, I’ll start as simple as possible, by only looking at a team’s first few offensive plays in a game.

*******************

Per usual, I’ll use the play-by-play data provided by Armchair Analysis, which includes each play from 2000-2015. To limit the effect of field position, I only included drives that started between the 10-yard lines, and I dropped penalties to focus on the remaining runs and passes.

Here’s a chart of run percentages on each team’s second play, varied by the play-type of the first play. The error bars account for our uncertainty in each probability estimate.

Teams run more often after they pass, and they do so significantly more often – an absolute difference of about 12%. On a relative scale, teams are about 25% more likely to run when their first play was a pass.

That said, savvy readers may have picked up on the fact that if rushes and passes were to result in different types of second plays (e.g, different yards to go), such a comparison wouldn’t make sense.

But we can look closer. Here’s the same chart above, faceted by the down and distance of the second play (2nd & short: 3 yards or less, 2nd & medium: 4-6 yards, 2nd & long: 7 yards or more).

For 2nd & shorts (bottom right), there’s no obvious difference in the likelihood of running based on the initial play call. Teams tend to run the ball here.

Among other play types – in particular, on 2nd & medium and 2nd & longs, there remains a significant difference in how an offense calls its plays given what it just called. On 2nd & longs, for example, teams rushed 44% more often (an absolute difference of 19%) after passing on first down. That’s an enormous effect.

Of course, there may be other things at play. Perhaps teams failing at one play type (rush or pass) feel the need to try another play type (pass or rush) on the second play. But if you’re feeling the need to vary your play calls based on the first play of the game (literally, that’s the only play on the x-axis), that’s a whole other issue to write about.

*******************

But we can also go just beyond the game’s first two plays. Here’s a histogram of the number of rush attempts using the first four offensive plays for each team in each game. The red bars reflect what we’d expect if teams were to pick four play types (runs and passes) out of a hat (using a run type probability of 49%); the black bars reflect what we see in the data.

The higher black bar in the middle highlights that in the first four plays of the game, coaches make more of an effort to call exactly two runs and two passes (about 46% of the time) than what we’d expect due to chance (37% of the time). Along similar lines, while we’d expect about 13 in 100 sequences of four plays to include *all* rushes or *all* passes, that only happened about 7 in 100 times in the data. Altogether, this matches our conclusion from above; coaches are a bit more balanced than we’d expect them to be if they were randomly dialing up plays.

*******************

Offensive play-callers are probably better at designing plays than we give them credit for. Schemes are enormously complex, and the amount of detail that goes into a gameplan can be awe-inspiring.

But during a game, when faced with split-second (well, 40-second) decisions, it’s natural for those same play-callers to revert to predictable tendencies. In the case of the above evidence, it appears more likely that, all else being equal, runs are more likely to follow early-game passes and passes are more likely to follow early-game runs.

]]>

What’s the optimal strategy? It’s a tough question, so I posed this on Twitter.

Roughly 50% of my respondents (overall, a more analytic-friendly crowd) answered that, yes, teams should go for two, with the remaining voters equally split between “No” and “It depends.”

In this post, I’ll suggest that, at least empirically, it hasn’t made a ton of difference one way or the other.

********

In considering the optimal two-point strategy with a seven-point lead, we can start by looking at how often teams have come back when trailing by seven, eight, or nine points. While there are hundreds of games where teams have scored and kicked an extra-point to build exactly a seven-point lead late in the game, it’s a bit dicier to find examples of teams scoring and taking a seven-point lead *before* kicking the extra point. Using Armchair Analysis’ data, for example, there were just 88 such examples between 2000 and 2015.

So instead of looking at those 88 games, I expanded the analysis to include any game where a team took possession in the final eight minutes of the fourth quarter between 10 and 40 yards from their own goal when down either 7, 8, or 9 points. In essence, this adds about 1300 contests (so 1400 total) that should be equivalent to a team trailing late in the game having just given up a touchdown.

Here’s how the games eventually played out. The chart below shows the fraction of times that the winning team held on, depending on the size of their lead. The size of each dot is proportional to the number of games with teams in those situations. I also used two colors to vary when the offensive team started its possession.

Teams ahead by seven points have won about 86% of games when starting a possession on defense with between 4 and 8 minutes left, a number that jumps to 89% when up eight and 94% when up nine points. This makes sense. If you have a larger lead, you are more likely to win.

And there’s a similar increase for teams getting the ball in the final four minutes of a contest (shown in red). In fact, in the 94 games when a team has started a defensive possession with fewer than 4 minutes left when ahead by exactly nine points, they’ve won all 94 times. That isn’t to say that teams can’t lose when ahead by this margin – they’ve lost when up by 10, for example – but it’s quite unlikely. A two-possession lead late in the game is really hard to overcome.

********

We can use the probabilities above to outline a strategy of whether or not to attempt the two-point conversion.

**For teams scoring with between 4 and 8 minutes left, we are left with the following calculation:**

**Go for two (assuming a 50% chance of a successful conversion): **

50% chance to get a 94% chance of a win + 50% chance to get an 86% chance of a win = *Win 90% of the time*

**Kick:**

*Win 89% of the time.*

Using these numbers, there’s a *slight* advantage to going for the two-possession lead by attempting the two-point conversion. Given the associated errors that come with these probabilities (the margins of error in the graph, for example, are about 4%), this difference is not statistically meaningful.

**For teams scoring with between 0 and 4 minutes left, we use the following calculation:**

**Go for two: **

50% chance at a 99% chance of a win (best guess) + 50% chance at an 89% chance of a win = *Win 94% of the time*

**Kick: **

*Win 93% of the time.*

Again, very little difference, and not a statistically meaningful one.

Altogether, there’s little empirical evidence to suggest that teams should attempt the two-point conversion late in the game when up seven. While there may be a slight advantage to the more aggressive strategy, it does not appear to be an overwhelming one. Relative to more common scenarios that coaches often screw up – like punting on 4th and 1 near midfield – the decision to attempt a late-game conversion appears to be a minor one.

********

Extra points:

-Some readers may have identified that the recent increase in extra point distance should be part of the discussion. That may be true. However, while it’s now more likely than before that the leading team misses an extra point that would give it an eight-point lead, it’s also more likely that the trailing team misses a game-tying chance if it were to score when down seven.

-I’ve seen frequent suggestions that teams should vary their decisions based on the caliber of their defense. As one example:

This is fair, but two things to keep in mind. First, when a strong defensive team like Denver goes for two, the benefit of the two-possession lead looms even larger! No way the Chiefs score on *two* drives last night.

Second, team strength probably doesn’t matter as much as you think. As part of work I did last year for SI.com, I looked at both the game’s point spread and team offensive and defensive efficiency metrics from Football Outsiders as it related to two-point success. While the game’s point spread was a significant predictor (favored teams converted more often), neither the offensive team’s strength alone, nor the defensive team’s strength alone, factored into two-point success. Team-specific probabilities of successful conversions were almost always between 40 and 60 percent, with most of those differences accounted by the game’s point spread.

-I split game minute into two categories above: 0-4 minutes left and 4-8 minutes left. I tried similar splits and they told a similar story.

-It’s worth noting that simply splitting games by deficit alone would be troublesome if there were differences in the team strength among those leading by 7, 8, or 9 points (e.g., if the Patriots and Seahawks always led by 9 points). Judging by the game’s point spread, however, this didn’t seem to be the case. The teams leading late by 7, 8, and 9 points were relatively similar in terms of team strength.

-Extra extra point:

]]>

–Michael Schuckers gave a talk summarizing the state of goalie research, including material that he’s working on for an upcoming book chapter. For the unfamiliar, the most common metric in evaluating NHL goaltenders is save percentage, which is limited in part because different goaltenders face different distributions of shots over the course of a season. Indeed, you could even have a Simpson’s Paradox scenario, where Goalie A is better at saving each type of shot than Goalie B, but that Goalie B still ends up with a better save percentage overall. This upcoming book chapter will be a must-read.

-Schuckers also pushed for those in the audience to do whatever necessary to get the NHL to share its tracking data. IMO, this is a no-brainer. The world of basketball is better for the brief look into this rich information that the NBA shared during the 2014-15 season and parts of the 2015-16 one. See, among other examples, this excellent tutorial on how to scrape and analyze player movements. The NHL’s lagging behind, and given the well known flaws that the league has with scorer/rink biases, the potential is there for public analysts to answer some excellent questions and help grow the game.

-Rob Vollman gave a talk on roster construction, providing a glimpse into how rules of the CBA dictate who and what players are reasonable values. You can buy Rob’s book here, which presents this and other analytically driven research. The takeaway linking Rob’s and Schuckers’ talks: don’t give goalies massive contracts, as there’s too good of a chance they won’t be worth it.

-I gave a talk on how to use R for reproducible hockey research. Slides and code here (note: download the pdf of the slides if you are looking for links). There are very few hockey researchers who share both their code and data. It’d be better for everyone involved if we can change this.

-Cole Anderson presented work on an ELO-based player comparison tool, in which the hockey public can rank players. This makes sense, particularly given that traditional player rankings (say, a scale of 1-10) can lead to ambiguous numbers (like 7.7). Cole’s work appears similar to the surveys that 538 has used to, for example, rank James Bond villains or pick summer Olympic sports. Cole’s code is also in R, and available for your perusal here. Hope to see more out of this project.

-Eric Cantor gave a talk looking at roster construction, looking at the Tampa Bay Lightning’s experience with seven active defensemen in place of the usual six. Eric’s evidence suggests that Tampa performed slightly better with the non-traditional construction.

-Ryan Davenport and Edwin Niederberger looked at shot locations from world tournaments, including the recent World Cup and the Olympic seasons. Relative to the rest of the world, the USA’s backline looks particularly ineffective offensively.

-Brian Carothers and Joseph Nelson gave two talks. First, the pair looked at quantifying defensemen given their hit and blocked shot totals, the slides of which are found here. Second, the pair led a Python workshop, the overview of which is linked here and the code of which is found here. Python and R are both free and powerful. You should learn (at least) one of them.

-Rob was asked how many teams are doing appropriate due diligence with respect to analytics? Rob guessed six, with most teams “nowhere close.”

-Billy Jaffe (New England Sports Network), Neil Abbott (player agent), and Ron Rolston (coach) led a panel moderated by Babson’s Rick Cleary which summarized their perspectives on how analytics have changed the game. Perhaps unsurprisingly, their biggest take-home is that these practitioners don’t care about what model you may have chosen or what (statistical) tools you needed to employ, they just want immediate, actionable, and simplified recommendations. This begs the question – that wasn’t asked – what happens when those suggestions don’t match their prior viewpoints?

-Although each panelist likely loses more hockey knowledge in their sleep than I’ll ever learn, there was a bit too much selecting on the dependent variable for my taste. In other words, because Team X and Team Y have recently won the Stanley Cup, this is how all teams need to win the Stanley Cup. Hockey’s *way* to random for that.

-Perhaps given that the conference was held in Boston, the Bruins’ 2011 Cup winning team was held in particularly high regard. Two of the panelists, for example, praised the Bruins’ winning culture and development of a high character locker room as driver’s behind their success. Of course, if that Boston team was so good at hockey, why did it need seven games – and an overtime – just to get out of the first round? If Montreal had won that round one series instead of Boston, did Boston still have a winning culture and a high character locker room?

-I missed a few other talks, but if those researchers or anyone else wants to share materials, please send them along! And many thanks to Luke, Rick, George, Michael, Rob, and the rest of the organizers for their great work.

]]>

However, there’s no evidence that any of this work has made a difference as far as team behavior. For example, in close games (two possessions or less) prior to the fourth quarter, teams went for it 6.4% of the time in 2015, nearly identical to the 6.5% in the year 2000. In fact, that number was as low as 4.8% in the year 2011.

But perhaps just as many of us analytically inclined fans were ready to give up, there seemed to be a few more aggressive plays in week 1 of the 2016 season, highlighted by Antonio Brown’s fourth-down touchdown grab on Monday night.

League-wide, was this a meaningful uptick in aggressiveness?

For now, the answer is a slightly unsatisfying *maybe*.

Using Armchair Analysis’ excellent data, I grabbed every fourth down play since the 2000 season. We’re interested in whether or not teams playing close games (two possessions or less) went for it on fourth down, defined as either attempting a rush or a pass. I filtered out fourth quarter plays, as decisions later in the game are too often dictated by game situation.

In week 1 of 2016, teams went for it on 4th down on 18 of a possible 160 attempts (11.2%). That’s the highest such percentage for week 1 games since 2000, and it’s not particularly close.

The following chart shows the weekly 4th down attempt percentage in these situations. I included a separate point for each season to give a sense of the season-to-season and the week-to-week variability. The smooth blue line reflects the trend, and the grey area reflects our uncertainty in the average fourth down attempt rate.

The aggressiveness observed in week 1 of 2016 (the red point in the top left) is only exceeded on a week-level basis by 7 weeks total since 2000. Interestingly, the chart also points to the possibility that coaches are slightly more aggressive later in the year, as shown by the increasing trend by week of the season.

It’s safe to say that, given what is traditionally less-aggressive behavior during the early parts of the season, week 1 of 2016 stands out as unusual. That said, there are several caveats. This analysis doesn’t take into account other factors that undoubtably are linked to fourth-down calls, including field position and opponent characteristics. For example, it’s certainly feasible that there just happened to be more 4th-down attempts in 4th-down friendly spots during week 1.

In any case, it’s certainly worth monitoring as the season progresses.

]]>

“I wouldn’t be surprised to see a run,” said ABC announcer Todd Blackledge, opining on what type of play the Longhorns should try. “However, I will say that second down is the down to throw if you want to throw.”

Blackledge’s comment is classic football-think, behavior that’s been suggested for decades with no known empirical basis. However, thanks to the supple data of Armchair Analysis, it’s the type of behavior that easy to check and quite possibly validate (at least using NFL data). Thus, the two questions I’ll attempt to answer are:

How do coaches call plays in goal to go situations? Second, how *should* they call plays in goal to go situations?

*******************************

Armchair’s database contains each NFL play since 2000, which I filtered into goal-to-go plays that occurred on first through third downs. I also cut out fourth quarter plays, as to worry less about the effects of varying late-game behavior.

How do coaches call plays? Here’s a barchart showing the percentage of plays which are passes, separated by down and distance (*Note*: called passes include sacks).

With one yard to go, roughly one in four plays are passes, which is roughly the same on first, second, and third downs. Across other distances, coaches are fairly consistent in their desire to run on first down and throw on third down, with second down decisions roughly a 50-50 split.

Interestingly, there is no noticeable spike in teams calling pass plays on second down, at least not relative to their behavior at other downs and distances. So, while football coaches love to talk about throwing the ball on second and goal, they aren’t necessarily acting that way.

*******************************

Perhaps the more interesting question is how *should* coaches call their plays in goal to go situations. The answer is not straightforward. For example, passing plays are more likely to yield touchdowns on longer plays, but they’re also more likely to yield negative plays and plays of no gain.

One possibility is to consider the drive’s eventual point total as the outcome and to work from there. Using the same set of plays, I used the drive’s result (categorized as a touchdown, field goal, or neither) to estimate the average point total given each play call at the various down and distances.

Here’s a chart of expected points, separated by runs and passes.

Each point in the chart reflects the expected point total given a run or a pass at a certain down (1st, 2nd, or 3rd) and distance (x-axis). The size of the circle is proportional to the number of plays called at each point.

Interestingly, across most downs and distances, *running* plays offer slightly more of a return than passing plays. The differences aren’t overwhelming, but they are consistent, generally in the neighborhood of one or two tenths of an expected point. Notably, there’s no evidence to back up any claim that teams should pass the ball on second down – if anything, it’s the opposite.

These results are somewhat surprising given Kovash and Levitt’s seminal paper on NFL team behavior, which implies that teams should pass more than they currently do. One possibility is that the shorter field lessens the abilities of teams to throw the ball. Anecdotally, and as an example, teams seem to call way too many fade patterns.

*******************************

All together, what did we learn?

First, there is no obvious truth to the theory that teams *are* passing more on second down and goal to go situations than we would expect. Second, there’s no evidence to the theory that they *should* be passing more often than they already are.

If anything, there’s a drop in efficiency on passing plays in goal-to-go situations, which may be showing up in the form of fewer expected points. However, I’m cautious of reading too much into this conclusion, given the nature of this analysis (it’s on aggregate, and doesn’t account for game and play specific factors) and the inherent difficulty in categorizing a team’s decision to run or a pass.

]]>