StatsbyLopez

On the thank you’s we never get to say

statsbylopez — Thu, 05 Dec 2019 18:33:11 +0000

When I was 6, he was my babysitter who made peanut butter and jelly sandwiches and I cried because there was too much jelly.

When I was 10, he recruited me to be a water boy for the football team, where he bribed me with pizza that had any topping I wanted and let me eat as much as I wanted (it, um, showed).

When I was 12, he strung my lacrosse stick. He strung everyone’s lacrosse stick.

When I was 14, he taught me how to bench, squat (the dude could squat three plates routinely!), and hand clean. Proper technique was a must, or you wouldn’t be invited back.

When I was 16, he was at school before any teachers were, opening up the weight room, training with us, spotting us, driving us to become better athletes. I’m not even sure he was getting paid — he did it for us. But he also stayed for every game, meaning that he was, quite literally, the first one in and the last one out of the high school parking lot.

When I was 17, he bought us shirts and hats that we proudly wore around school (Deter-Mina-Tion). He took pictures with us at prom.

When I was 18 and my two kneecaps popped out, he brought a tackling pad into the weight room so that I could rehab. When a sophomore couldn’t hold the pad, he did it himself. When I had one last lacrosse season left, he taped up two kneebraces every spring practice and game. I looked like Megatron.

When I was 19, I realized he was a better athletic trainer than what most colleges have.

When I was 22, he told me what car to buy and what tires to put on.

When I was 23, he made extra batches of chili and left it in my locker.

When I was 24, he told me about his new girlfriend. We all heard the name Shelly when her real name was Cheri and gosh it was cute. Ando and Cheri invited the old lifting club to their wedding.

When I started coaching, he told me what plays to call and what cornerback on the opposing team couldn’t cover man to man.

When I was 30, he came to my wedding. When Erin and I had our first daughter, he or Cheri dropped off a bin of clothes every few months that his own daughter had grown out of. For basically eight years straight it was like Christmas in July.

When it was anyone’s birthday, he’d always chime in. And the poor guy, even with less than a week to live, was still sending well wishes to others on Facebook.

With Yoshitaka Ando passing suddenly from cancer earlier this week, the first feeling I had was one of regret. Guilt that, 38 years later, I realize I hardly didn’t do anything for him. I didn’t make him any food, string any of his sticks, or help rehab any of his injuries. I didn’t say thank you enough, didn’t return enough favors, didn’t wish him happy birthday’s. That it took me so long to realize how lucky I was to have had him around. That, in retrospect, he spent his entire life giving to others, and I just took it for granted.

And yet, with Ando, that was the thing. He needed nothing in return.

Pains me that this is too late, but thanks, Ando. For the sandwiches, the pizza, the lacrosse sticks, the early morning weight room time, the shirts, the hat, the lax shorts that I still wear, the tape, the rehab, the chili, the car advice, the girl clothes, and the friendship. I’m a better athlete because of you, and a better person. We all are.

And I don’t know what the next time around will be, but one thing is for sure, I won’t wait so long to say thanks.

Note: You can donate to Ando’s Family Fund here: https://www.gofundme.com/f/ando-family-educational-fund

The blog has moved!

statsbylopez — Tue, 20 Mar 2018 22:58:07 +0000

To facilitate an easier sharing of code and figures, I’ve started a RMarkdown blog, which you will find at http://statsbylopez.netlify.com/. All new blog posts will be shared at this new site.

I’m going to keep the WordPress site active for the time being, so past articles aren’t going anywhere. In the meantime, thanks for four years of reading and fun! Hopefully the next site will be a success.

Triggered – a look at the LSU judge sentencing paper

statsbylopez — Wed, 13 Sep 2017 18:18:21 +0000

One of the reasons I love the application of statistics to sports is the unique ways in which sports can help us better understand human behavior.

I was thus excited to read** a working paper out of LSU’s Ozkan Eren and Naci Mocan that looked at how the sentencing of Louisiana judges in juvenile court varied given the performance of the state’s favorite football team, the LSU Tigers. The paper can be read here – alternatively, check out SB Nation’s summary here.

Using regression-based approaches on court decisions between 1996 and 2012, the authors write:

We show that upset losses of the LSU football team increase disposition (sentence) length imposed by judges, and that this effect persists throughout the work week following a Saturday game. On the other hand, losses of games that were expected to be close contests ex-ante, as well as upset wins have no impact. We also find that judges’ reaction, triggered by an upset loss, is more pronounced after more important games (when LSU was ranked in top-10).

If true, such findings would and should have implications for our judicial systems. It would suggest that entities that are tasked with impartial behavior (judges) let their emotions get the best of them, even when based on a college football game.

To the best of my knowledge, this paper has not formally passed peer review, so I’ll give the authors the benefit of the doubt as far as working out any final kinks. That in mind, a few things stood out when reading that don’t pass the smell test.

Arbitrary cutpoints

The authors take continuous data (point spread, categorized as -4 or less, -3.5 to 3.5, or 4 ore more) and turn it into categorical data (game type: expected win, close, expected loss) in order to run their statistical models. This process requires unjustifiable assumptions from a regression model standpoint, and can yield both positive and negative associations, among other issues. As an example of what can go wrong in sports when categorizing continuous data, read one here.

To the authors credit, they write “we also experimented with different cutoff values (e.g., -3 and 3) to describe unexpected college football game outcomes. The results remained intact.”

In my opinion, this is insufficient. There’s no intuitive football reason to group games based on point spread, at any point spread. A 3.5-point favorite and a 4-point favorite are nearly identical, particularly with respect to how judges may view their teams’ performance. Worse, LSU has historically been an excellent team. This entails that in the bin of “expected wins” lie games where LSU has been anywhere from a 60% favorite to a 99% favorite. Treating LSU hosting Georgia the same way you treat LSU hosting Chattanooga makes no sense from a football perspective — so why would you do the same in your regression model?

There are more appropriate models to handle the relationships the authors are trying to uncover. Fortunately, the authors give us a taste of one such approach. Unfortunately, they then use

2. 90% confidence intervals

The LSU judging paper is filled with tables of regression coefficients, nearly all of which stem from the above categorization of point spread.

In the final section, the authors nicely provide a model that does not categorize point spread, and instead use a third-order (cubic) polynomial term for point spread. Although not perfect (Ex: how do we know the association is cubic and not something else?), at least this model can allow us to explore how point-spread is linked to sentencing across a wider range of point-spreads. Here’s the corresponding chart.

This figure is a giant red flag.

First, the authors write, The effect of a loss on disposition length set by the judges is decreasing in the spread. Sure, the line is decreasing in spread, but that decrease is not significant. If it were, you’d see much tighter error bounds and/or a steeper slope.

Second, across nearly all point-spreads shown, there is not significant evidence that there is an increase in disposition length when LSU loses. This is shown by the lower red line overlapping with 0. Statistically, the win versus loss comparison is indistinguishable from noise for most of the chart (the authors admit as such).

Third, in showing this chart, the authors unknowingly call into question their categorization of point spread. Above, there are no obvious changes in sentencing length around -4 or 4, which while not surprising, does highlight that perhaps such grouping was not sound to begin with.

Finally, and perhaps most disappointingly, the authors use 90% intervals. Had they used the standard (though admittedly not perfect) significance cutoff of 5%, the corresponding 95% intervals would be about 20% wider. With that extra width, the confidence intervals would overlap with 0 throughout the entire figure, and the entire basis of their findings — that sentences are longer after upset losses — would no longer hold.

It’s certainly possible that losses lead to harsher sentences, and I applaud the authors for an intuitive idea, but, for now, evidence appears limited at best.

**Note: I first read the article and started this blog post about a year ago.

Evaluating sports predictions against the market

statsbylopez — Wed, 14 Jun 2017 18:39:49 +0000

A few friends have been working on an algorithm for predicting baseball game outcomes. Roughly, the model uses player level projections to simulate baseball events, a process that requires substantive MLB and web-scraping knowledge.

Although the full operation is fascinating, this post will primarily focus on the evaluation of the predictions. The particular model in question has had a decent start to the summer. So how can we judge the accuracy of these picks? And what does that tell us about the feasibility of betting on sports?

While much of this post will seem straightforward, answering these questions gave me an increased appreciation for the variability in sporting outcomes with respect to gambling. I’ve posted the code here, in case anyone else is interested in using a similar process with their own projections.

*********

First, some background. The data consists of 659 picks made versus the game’s opening money line since the start of the 2017 season. Each pick is based on a model-estimated probability for each team in each game, which is then compared to that team’s market probability. There have been about 950 MLB games thus far, which means that the model has taken a team in about 7 of every 10 contests. On the remaining games, probabilities for each team are too close to the market’s price to have an edge. Those games were dropped from the data.

The data also contain the observed differences between the model estimated probability and implied probability, relative investments (made assuming an equal balance prior to all games), the amount to be won or lost depending on the game’s result, the actual game results (win or lose), closing money line prices, and the difference in implied team probabilities between the opening and closing odds. Note that bets are made on “units” – this could be dollars, pistachio shells, or whatever your mind can imagine. Generally, higher units are placed on bigger edges; the average unit per pick is about 0.60. Note that the highest unit is capped at 1.0, which is done given the non-zero chance that probabilities are off on account of lineup or pitching changes.

Next, some summary statistics. While a nearly identical number of picks have backed the away team as have backed the home team (51% to 49%), nearly twice as many underdogs have been backed compared to favorites (64% to 36%). Altogether, the model is up about 27 units thus far, which roughly reflects about a 7% return on investment. Game results have been most kind towards backing the home team (+24.5 units) compared to the visiting team (+2.5 units), with underdogs slightly more profitable than favorites (+19.9 to +7.1 units). While a deeper investigation could look into if these differences are meaningful, that’s not a primary goal.

*********

One immediate anecdote that I picked up quickly is how variable things could appear in small periods of time. Here’s the cumulative profit from day one of the season (shown in red). In the background are 200 simulated season-to-date profits, done using the given market implied probabilities as the true probabilities for each team.

Within any given week (say, 75 picks), profits could vary by as much as 15 or so units. And at certain time points (say, between picks 100 and 210), all appears lost, with picks going into a deep dive. Even for me, as someone whose job entails having a decent understanding of randomness, it’s tempting to look for patterns in the red line, even though none likely exist.

Relative to random season outcomes simulated using the opening market probabilities, model picks currently stand in the 96th percentile. That is, only about 4% of sequences using random game outcomes would be doing this well if the opening market probabilities reflected the true probabilities. And note the center of the above sequences: roughly -10 units, which accounts for vig taken in by betting markets.

*********

In addition to the chart above, I made a similar one (not shown) with one important difference; instead of market-implied prices as the truth, I used the model-generated probabilities. In expectation, this simulation will yield positive profits. But in what was a total shocker for me, it was still reasonable – it happened about 5% of the time – for such a model to turn a negative profit through 650 picks. That is, even with known, better than market probabilities for each game outcome, it’s still feasible to lose money across 650 games. First thoughts that went through my mind:

-650 games is three NFL seasons worth. That is, an NFL bettor taking every game could have three straight losing seasons in a row while still having better than market odds for each of his or her picks.

-Related: I could not be a professional gambler.

*********

I thought it would be interesting to take a look at which team the model has picked most often (both for and against). Here’s that plot. On the x-axis is the total investment made, either for (on the left) or against (on the right) each team, and the y-axis is the season-to-date profit.

This particular model continues to back the Padres and Mets at most opportunities, while picking against the Red Sox. Altogether, those picks have mostly broken even.

Meanwhile, the model has had some success taking the Rockies, White Sox, and Rays, while likewise performing well when fading the Indians, Giants, and Blue Jays. Picking the Phillies has not been so fruitful, nor has picking against the Diamondbacks.

*********

Our final check looks at how the model has done relative to line movement. If the model can “predict” the direction where prices will go in the moments leading up to the game, that would generally be a good thing. From what I’ve been told, closing market prices are generally more efficient than opening numbers.

Here’s a histogram showing line movement (on the probability scale). Positive changes reflect movement in the direction of the model’s chosen team.

Among the picks to date, about 1 in 20 opening lines precisely match closing lines. A tick under 58% of games have moved in the direction of the model’s team, while about 37% have moved against.

Across all contests, the average price has moved about 0.6% in the direction of the model’s chosen team. While this seems like a small number, across several hundred games, that type of advantage would seemingly add up.

There’s also a decent link between the model’s projected edge for a team and the likelihood of movement in the direction of that team. The average game moved 0.25% among games with smaller-sized edges, 0.5% on games with medium-sized edges, and a full 1.0% on games with the largest edges (putting about 200 games in each of these categories).

*********

Assorted final notes:

-Log-loss is a proper scoring rule for binary outcomes, but it is less evident how log-loss can precisely evaluate this model, given that some picks are made with more of an edge than others (perhaps a weighted log-loss?). Additionally, there’s no immediate interpretability to log-loss. In any case, the average log-loss is -0.6845 for the market implied probabilities and -0.6836 for the model estimated probabilities (closer to 0 is better).

-It is tempting to tie team allocations (as far as supporting or fading) to changes to the game that have been seen this summer. This includes the supposed juiced ball and increases to HR/FB ratio. Something to keep an eye on.

-How do others’ evaluate picks, either their own or from others? My prior is to trust the market until proven otherwise, and that’s a very strong prior.

Evaluating the evaluators

statsbylopez — Tue, 25 Apr 2017 14:57:29 +0000

Thursday’s NFL draft marks the culmination of several months — or perhaps years — of work.

The amount of preparation that team scouts and analysts put in is overwhelming. This includes sleeping at the office, 20-hour work days, and hours upon hours of poring over film and interviewing players and their coaches. With teams wanting to learn just about everything there is to know about a player, no stone is left unturned.

Additionally, the effort that teams place on evaluating players has grown leaps and bounds over the past half century. In the 1970s, for example, Washington famously went a decade without a first round pick. Given the differences between now and then, one would expect that at some level, teams can better draft players now than they could decades ago.

But have teams improved at drafting?

In this post, we’ll look into the evolution of NFL drafting ability over time, and compare it to other North American Leagues.

**************************************************

Our interest lies in the link between where a player was drafted (pick number) and how well he performs. No player-level metric is perfect, but Pro Football Reference’s career approximate value (CAV) provides a decent snapshot of a player’s talent. We’ll use that as our outcome.

Not surprisingly, the the distribution of CAV is strongly skewed right, with most players between 0 and 20 but a handful of stars rating above 100. Thus, we’ll prefer a non-parametric tool to a parametric one, as to avoid making assumptions about CAV’s underlying distribution.

One possibility would be to set a binary cutoffs, as Chase does here, to assess the percentage of a draft’s CAV that falls within a certain range of picks. Alternatively, as to maintain all of a draft’s information, rank correlations (Spearman, Kendall’s Tau) can help us assess the (hopefully monotonic) dependency between draft spot and performance while also maintaining all of a draft’s information as far as which players were ranked better.

Looking back, here are the yearly Spearman rank correlations between draft position and CAV, separated by round. Values of 1 would reflect a perfectly monotonic link between draft spot and performance, while values near 0 would reflect no link. The blue line reflects possible non-linear trends over time, with the grey area reflecting our uncertainty.

There’s no evidence that in any round, NFL teams are doing a better job at selecting the best players early. In 2013, for example, the 1st round rank correlation coefficient was about -0.2, indicating a year where picking earlier in the draft was linked to worse performance. Of the first 10 picks from that draft, only 1 has made a Pro Bowl, compared to five of the final ten selections.

Additionally, note that for rounds 4-6, there’s little evidence of a difference between rank correlation and 0, which suggests that by that point, there’s not a big benefit to picking earlier in the round.

We can also add a positional separation allows us to both assess if any changes over time identify with a specific type of player, and to account for the fact that if teams draft for positional need, that may supercede taking the best overall player available.

Per CAV, there’s been a *slight* improvement in the drafting of running backs and wide receivers, and, after a dip in the 1980’s, tight ends. The latter is potentially related to how teams may be more apt to draft receiving tight ends earlier in the draft, with less of an emphasis on blocking tight ends. Receiving tight ends may be easier to evaluate, for example, or may score higher on CAV.

For most positions, though, the link between positional CAV rank and draft position is as noisy as it was four decades ago. Interestingly, there does not appear to be any one specific position where teams are better at identifying talent.

Note that in using CAV, I was able to chart rank correlation’s all the way through the 2016 draft. However, if anything, this likely overestimates the recent link between draft position and performance – teams are more likely to give their earlier picks playing time in their first few years. Once lower drafted players have more time to establish themselves, we would expect the link between draft position and performance to lessen, which could lower the recent scores.

**************************************************

While it’s easy to pick at the the NFL’s inability to noticeably improve player evaluation over time, it’d be more telling if we could find that other professional leagues have gotten better over time.

Using the same metrics described here, I charted the link between pick number and player performance in MLB, the NBA, NFL, and NHL. I focused on each league’s first 60 choices (64 for the NFL), which matches the current length of the NBA draft.

A few things stand out.

First, the NBA bests all other leagues as far as an overall performance, which isn’t surprising given the steepness of its draft curve and the differences between the importance of the best players relative to the league average.

Second, over time, the link between performance and draft position has grown stronger in both MLB and the NHL. While improved drafting ability is one possibility in both sports, in MLB, changes to the draft structure may also be responsible. Specifically, big market teams are no longer allowed to award big bonuses to players later in rounds, which could have been pulling correlations closer to 0.

In the NBA, after an early possible spike, there doesn’t seem to be any improvement over time. However, given that the NBA is already starting with rank correlations closer to 1, there’s also less room for improvement.

Altogether, it’s certainly feasible that the NHL and MLB have gotten better at drafting, while the NBA may have already reached its peak. In the NFL, meanwhile, drafting ability has either reached its limited peak, or involves so much noise that it’s difficult to identify a substantial, league-wide improvement.

**************************************************

Postscripts:

-I dropped specialists in the NFL for the position level chart, given that so few are drafted each year.

-For those of you who gave Chase’s article a read, there was a tangible difference between using rank correlation (as I did above) and the traditional correlation coefficient (Pearson’s). With the latter, there does appear to be an improvement over time, potentially linked by a few outlying observations drafted early.

-There may be other reasons for the NHL’s apparent improvement over time besides an improvement in player evaluation, or it could be tied to my choice of player outcome (games played). Separating by rounds, the greatest efficiency improvement appeared to be in round 1.

-Code is available here. This includes the scraping code, so it could take a few minutes. Feel free to play around.

Nuts and bolts, the Metro Division got (a little bit) screwed

statsbylopez — Wed, 12 Apr 2017 16:18:10 +0000

Here’s a look at the NHL’s final regular standings in the Eastern Conference from 2016-17. As a reminder, eight teams in each conference make the playoffs

Tiebreakers and divisional qualification rules not withstanding, both the Islanders and Lightning finished a point out of the playoffs. That’s a difference between what would likely be at least a 1 in 25 chance at a Stanley Cup and at least two games of home playoff revenue, or an early start to golf season. That point difference was immense.

But there’s a problem with using the standings above – the points aren’t equivalent. Specifically, I’ll argue in this post that the 94 points from the Islanders is, all else equal, likely more impressive than the 95 points for the Leafs, given the caliber of each team’s schedule.

The NHL’s unbalanced schedule.

First, some background. NHL teams play intra-division opponents either four or five times, inter-division/intra-conference opponents three times, and all inter-conference opponents two times.

This is a small but notable difference. The Islanders play in the NHL’s Metro Division, one stacked this year with two really good teams (Pittsburgh, Columbus) and one of the best teams in the last decade (Washington). Moreover, the Islanders faced the unenviable task of being one of two teams this season to face the Capitals five times (NYI added five games against Carolina, too). Meanwhile, the Leafs faced each Metro team only three times apiece, while adding five-game sets against the Florida Panthers and Montreal Canadiens.

Does that make a difference? Surely it does.

Here’s a chart showing the estimated impact of the NHL’s unbalanced, division and conference-loaded schedule. Each team is shown on the x-axis, and the y-axis corresponds to the net benefit (or loss) in standings points, in expectation, comparing the NHL’s unbalanced schedule to one in which opponents are randomly assigned (and allowing for the fact that teams cannot play themselves). The plot is faceted by division.

Points added or lost, comparing the NHL’s unbalanced schedule (more divisional and conference games) to an unbalanced one

" data-medium-file="https://statsbylopez.com/wp-content/uploads/2017/04/screen-shot-2017-04-12-at-10-34-35-am.png?w=244" data-large-file="https://statsbylopez.com/wp-content/uploads/2017/04/screen-shot-2017-04-12-at-10-34-35-am.png?w=788" class="alignnone size-full wp-image-9822" src="https://statsbylopez.com/wp-content/uploads/2017/04/screen-shot-2017-04-12-at-10-34-35-am.png?w=1140" alt="Pdiff_NHL" srcset="https://statsbylopez.com/wp-content/uploads/2017/04/screen-shot-2017-04-12-at-10-34-35-am.png 788w, https://statsbylopez.com/wp-content/uploads/2017/04/screen-shot-2017-04-12-at-10-34-35-am.png?w=122&h=150 122w, https://statsbylopez.com/wp-content/uploads/2017/04/screen-shot-2017-04-12-at-10-34-35-am.png?w=244&h=300 244w, https://statsbylopez.com/wp-content/uploads/2017/04/screen-shot-2017-04-12-at-10-34-35-am.png?w=768&h=944 768w" sizes="(max-width: 788px) 100vw, 788px" />

Points added or lost, comparing the NHL’s unbalanced schedule (more divisional and conference games) to a balanced one

The differences are small (note the y-axis), but they are notable and follow our intuition. The Islanders’ schedule difficulty likely cost the team about 1.3 points, on average, relative to a league-average schedule. Meanwhile, the Leafs’ schedule was worth somewhere around +0.6 points. That difference, of course, is larger than the gaps that we observed in the standings. If each franchise had played a balanced schedule, ignoring all other information, we’d have expected the Islanders to finish a nose ahead of Toronto.

On a division level, all Metro teams faced a more difficult than average schedule, led by the Devils, who faced the Rangers and Penguins five times apiece. Meanwhile, nearly all Western Conference teams benefitted from being in the same conference as the Colorado Avalanche, Vancouver Canucks, and Arizona Coyotes (and from not being in the same conference as the Capitals). In particular, it was a good year to be in the Pacific – the contrast between that division and the Metro division is startling.

We know already that the NHL’s divisional format for the playoff qualification has put an unfair burden on teams in top divisions. It appears that the scheduling format, to a far lesser degree, only works to make that burden more difficult to overcome.

Final notes

-Results stem from 100,000 simulated season point totals in which opponents were generated at random, relative to simulated season point totals using this year’s actual schedule. As a result, they may not reflect the true, actual differences in schedule difficulty. I used such a large number of simulations because at smaller numbers, there was a bit too much inconsistency in the resulting charts for my preference. Additionally, the goal (for now) here is just the typical difference in points. Across simulations, it’s not uncommon for teams to jump by as many as 10 points in one direction or the other.

-The Islanders finished as a top-8 team in the conference (what I’ll call making the playoffs) about 4% more often (48%, versus 44%) during iterations when I used random scheduling, compared to the current version. Not a big difference, but roughly what I’d expect.

-In 100,000 simulated seasons with the current schedule, the Avalanche made the playoffs 8 times, and the Capitals missed the playoffs 278 times.

-Code is here.

Towards an understanding of the NHL’s final standings

statsbylopez — Mon, 10 Apr 2017 13:33:01 +0000

In a recent paper, Gregory Matthews, Ben Baumer, and I looked at the role of randomness in professional sports outcomes. Perhaps unsurprisingly, we identified that NHL and MLB games tend to be closest to a coin flip, with the worst teams capable of beating the best. Meanwhile, there are larger gaps in talent between franchises in each of the NBA and NFL. Although our focus was on individual game outcomes, we laid a bit of groundwork for related work with respect to between-league and between-team comparisons.

In particular, I left curious as to the end impact of the NHL’s randomness. If most game outcomes are near coin flips, what impact does that have on season outcomes? In this post, I’ll reflect back on the NHL’s final standings, with the goal of better understanding the underlying differences in team strengths, and what that means about where teams finish.

Why final standings?

Sports leagues use final regular season standings to both determine playoff eligibility and provide a rough sense of where other teams will pick in future player drafts. Indeed, there’s no simpler mechanism by which we judge a team’s regular season success than by where it finished in the standings. So when the Washington Capitals finish the season with 118 points, we are left to assume that the Capitals are a 118 point team.

However, standings are function of several hard-to-define inputs, including team talent, schedule difficulty, timing, injuries, and, of course, luck. And so while the Capitals finished with 118 points, it’d also be exceedingly unlikely for the Capitals to finish on that exact 118 number if we were to somehow replay the regular season again under an identical set of circumstances.

But how many points could that 118-point Capitals team finish with? And what meaningful differences between teams can we extract from where they finish in the standings?

Resampling a season

Even though the NHL only plays each season one time, it doesn’t mean that we also face the same restriction. Indeed, perhaps the only way to revisit differences in league standings that could have been observed would be to replay games. If we can assume that the probability of each game’s outcome is known — admittedly, an unjustifiable assumption — then the resampling of each season is quite straightforward. Given each game’s probability, we can simulate each regular season contest, impute the corresponding final standings, and repeat this process several times.

Here’s what the 2016-17 NHL season would look like when replayed 1000 times (assumptions are provided at the end). The chart below shows imputed point totals for each team, provided in overlaid, Joy Division inspired density plots.

Simulated number of points in 1000 replayed seasons (2016-17)

What do we learn?

To start, that 118-point Capitals team could have easily been a 118-point Capitals team or a 128-point Capitals team. In fact, for all teams, swings of 10 points in either direction are not that surprising. For the Capitals, those 10 points could be the difference between being the top-seed in the Metro Division and finishing as the third seed. For several other franchises, 10 points is the difference between making the playoffs and staying home.

Additionally, while the standings tell us that the Capitals were the league’s best team in 2016-17, there’s enough overlap between Washington and several other teams, including Chicago, Minnesota, Pittsburgh, and even Montreal, that it’s insufficient to use standings alone to justify arguing that Washington is the best team. Given the standings and the format of the league’s schedule, we know that the Caps were better than Vancouver and that they were probably better than Toronto. However, we’d hardly have any idea if they were better than Chicago when looking at the standings alone.

Postscripts

Here are the assumptions I used to replay the 16-17 NHL season.

-Team strengths were estimated using a Bradley Terry model with a fixed home advantage. While this posits that wins alone are the best way to analyze hockey teams, I’m okay with that for this exercise, as it means that our imputed standings will roughly be centered around this year’s observed standings. Game-level probabilities were extracted using each team’s estimated team strength, while providing a fixed advantage to the home team.

-If anything, I’m likely underestimating the amount of randomness in league standings. In estimating probabilities, I assumed that team strengths, as estimated using the Bradley-Terry model, were known. Of course, that’s not the case, and a more detailed imputation would account for the uncertainty in these parameter estimates.

-I assumed that overtime outcomes are random, with OT occurring in 24% of games. This rate matches the fraction of 16-17 contests that have gone to OT. Note that OT outcomes are not entirely random (see here), but they are probably close enough for our purposes here. Recall that the NHL’s scoring system awards a point to the overtime loser, so this is our way of accounting for that.

-Stay tuned for a future post, in which I’ll look at the role of the NHL’s unbalanced schedule in determining final standings. I’ll also share the code at that point.

All win probability models are wrong — Some are useful

statsbylopez — Wed, 08 Mar 2017 18:57:18 +0000

As in the moments following the 2016 US election, win probabilities took center stage in public discourse after New England’s comeback victory in the Super Bowl over Atlanta.

Unfortunately, not everyone was enamored.

After the election and this game, it's probably time the "win probability" folks take a little break. https://t.co/SfeTiqz33O

— Pete Abraham (@PeteAbe) February 6, 2017

While it’s tempting to deride conclusions like Pete’s, it’s also too easy of a way out. And, to be honest, I share a small piece of his frustration, because there’s a lingering secret behind win probability models:

Essentially, they’re all wrong.

But win probabilities models can still be useful.

To examine more deeply, I’ll compare 6 independently created win probability models using projections from Super Bowl 51. Lessons learned can help us better understand how these models operate. Additionally, I’ll provide one example of how to check a win probability model’s accuracy, and share some guidelines for how we should better disseminate win probability information.

So, what is a win probability?

A win probability is the likelihood that, given any time-state in the game, a certain team will win the game.

Win probabilities can be both subjective (“This game feels like a toss-up”) or objective (“My statistical model gives the Falcons a 50% chance of winning”). This post focuses on the latter type, which have become increasingly popular across sports over the last decade.

What are some NFL win probability models?

Here are the models that I’ll compare in this post.

Pro Football Reference (PFR): Stemming from research by Wayne Winston and Hal Stern, PFR’s model uses the normal approximation and expected points to quantify team chances of winning. Read more in Neil Paine’s post here.

ESPN: ESPN’s predictions, provided by Henry Gargiulo and Brian Burke, are derived from an ensemble of machine learning models.

PhD Football: An open-sourced creation of Andrew Schechtman-Rook built using Python, this model uses logistic regression to predict game outcomes.

nflscrapR: An R package from graduate students at Carnegie Mellon, win probabilities stem from a generalized additive model of game outcomes.

Lock and Nettleton: Probabilities generated via a random forest, as done by Dennis Lock and Dan Nettleton in the Journal of Quantitative Analysis in Sports, implemented with data from Armchair Analysis.

Gambletron: Created by Todd Schneider, Gambletron uses real time betting market data to impute probabilities.

Before we start, a particular thanks to PFR for this and all of their public work, Brian and Hank, Andrew, Ron and Maksim, and a student of mine (Derrick) for their help in either sharing or pulling in the data. I greatly appreciate their work and/or willingness to share. Sadly, not everyone was so helpful.* Additionally, note that the 6 models used 6 unique approaches, which demonstrate the variety of ways that people have thought about win probability.

Finally, R code – and predictions from a few models – are up on my Github page.

How’d probabilities look in the Super Bowl?

One interesting way to start is to visualize how each model viewed the Super Bowl. Here’s a chart of New England’s play-by-play win probabilities, using a different color for each set of predictions.**

Super Bowl win probabilities (New England’s).

For most of the game, there’s at least a 5% gap between New England’s lowest and highest projections, and at several points, the gap is as high as 10%.

With six unique models, it’s not surprising to see these differences, but I’d also argue that this type of variation is not an attractive property for disseminating win probability information.

How big of a comeback was it?

It’s obvious that by the third quarter, New England’s chances were slim. Of course, with probabilities clustered near zero, it’s a bit difficult to precisely identify differences between the models in the initial chart. So, I converted the probabilities to odds to get a better sense of how the models viewed New England’s comeback.

Here’s that chart, and I added dotted lines to identify the point in the game when each model gave New England its longest odds of a comeback.

Odds of winning Super Bowl 51, relative to New England. Second half only shown.

Here, gaps between models are more substantial, which is not surprising given that odds are not robust to small changes for probabilities near 0. At multiple points, PFR gave New England about a 1 in 1000 chance of winning (1000:1 odds) while projections from Gambletron (which is arguably serving a different purpose with its numbers) barely crossed 25:1.

Thus, the wow-factor of the Patriots comeback depends on your source. If you choose Gambletron, it’s a one Super Bowl every two or three decades type of comeback. If you choose PFR’s, it’s a Super Bowl comeback we’ll only see once every millennium. From a communication perspective, this is a weakness to win probability models, and one that shows up frequently given that, for better or worse, people most often look at win probability charts after a major comeback.

Finally, it’s worth noting that the moments in the game when New England was given the longest odds of winning also differ, varying from midway through the third quarter to midway through the fourth. Indeed, your definition of how much of a comeback it was isn’t just limited by your choice of a model, but by your identification of which time in the game to start at, too.

How about win probability added?

Brian Burke makes an important point on the Ringer that win probability models are perhaps best used for understanding in-game decision making. Often, this is done by comparing win probability from from one play to the next using a metric called win probability added (WPA).

In the Super Bowl, leaps or drops in New England’s WPA are also somewhat dependent on model choice. As an example, New England’s second play from scrimmage, a 9 yard completion on 2nd-10 from Tom Brady to Julian Edelman, helped the Patriots according to PhD Football (+3%) but hurt the Patriots according to nflscrapR (-2%).

Here’s a chart that compares each pair of models’ WPA.***

Win probability added, contrasted between 5 NFL win probability models in Super Bowl 51.

The figures in the bottom left show pairwise scatter plots between each models’ WPA, with correlations listed in the top right. Histograms of WPA (relative to New England) are shown on the diagonal.

There’s a moderately strong link in WPA between each pair of predictions; the strongest correlation coefficient is with ESPN and PhD Football (0.85), with the weakest between PFR and nflscrapR (0.45).

However, there’s still a decent amount of variability with respect to how each model sees the helpfulness of each play. For example, of the 125 plays shown, on fewer than half (57, or 46%) did all five models agree that the outcome either helped or hurt the Patriots. This is another humbling aspect of win probability models — there’s both uncertainty in team chances at any one point in time, but also from one play to the next.

So how can we know if a model is useful?

Let’s take a break for a fun anecdote.

The typical NFL season has about 40,000 plays. Let’s imagine that you flip a fair coin 40,000 times to find the proportion of heads. We know the true probability of heads — it’s 50% — but if we use the results from our 40,000 flips, the average distance we can expect between our estimate of heads and the truth is about 0.2%. That is, we can’t predict a fair coin much better than +/- 0.2% in 40,000 trials. And if we can’t get precise probability estimates from coin tosses — which don’t have variables like the offensive team, defensive team, score, down, distance, spread, clock time, and timeouts attached to them — how can we expect our NFL win probabilities to be any more accurate?

So, whether or not a win probability at a certain time in is off by 10% or 0.2%, it’s off. Humbly, it’s why all models are wrong.

So how can we know if a model is useful?

Well, the best way to judge projections of any outcome is use events that are yet to take place. For football games, this correlates to using past games (termed training data) to derive predictions for future games (test data). If the probabilities are reasonable, those predictions should match future game outcomes.

So, that’s what I did using Lock and Nettleton’s random forest model. I stated by using the 2005-2015 seasons as training data. Next, I sampled 5 plays in each quarter in each game from the 2016 season to use as test data (5340 total). Sampling plays in this manner will ensure that I have the same number of plays from each game (to weigh games equally) and that I haven’t overfit (there are no overlapping plays in the test and training data). It’s also how the Super Bowl projections above were made.

Here’s a chart of how well the Lock & Nettleton model predictions did in 2016, aggregated by quarter. I included points that average offensive team probabilities to the nearest 0.05, as well as the corresponding fraction of games in that bin when the offensive team won. The closer projections are to the diagonal line, the better. If you want to see the R code for this chart (and the ones above), see my Github page.

Observed versus estimated win probability for a sample of 2016 NFL plays. Predictions derived from Lock and Nettleton’s random forest model. By and large, projections match reality, as demonstrated by the line of best fit roughly corresponding to the line y = x.

In Lock and Nettleton’s model, results are fairly reasonable. Across most bins in most quarters, probabilities reflect reality. It’s not that projections are perfect – teams with low win probabilities in the first quarter win more often than expected, for example – but it’s difficult to identify any precise location where the model is off by more than what we’d expect due to chance alone. Third quarter probabilities, as an example, look particularly reasonable. I’d also argue that this model’s performance is more impressive given that no games from 2016 were used in its evaluation, which may have helped the model more reasonably pick up on recent changes to the game.

Charts like the above don’t ensure that our probabilities are correct, as that’s impossible. Instead, they are there to provide warning signs if, for certain types of game situations, probabilities were off.****

Practical recommendations

Given the above, here is a set of recommendations for those of us creating or citing win probabilities.

Avoid over-precision. Using too many digits (e.g., 60.51%) belies the true difficulty of predicting unrealized outcomes in sports. Cap probabilities to the nearest percentage. (Excellent example: 538).
Embrace uncertainty. Instead of “There’s a 2% chance”, use “There’s about a 2% chance” or “About 1 in 50.”
Take extra care when presenting surprising results. It’s difficult to believe that New England’s comeback was a once a millennium type of result, but it was often presented as such.
Model check, and share results. This is an easy thing to do, and it’s the only way to know if predictions are close to accurate.
Update models over time. Sports leagues are ever-evolving — as examples, NBA teams shoot more 3’s and NFL teams pass often than ever before — and so if a model isn’t updated over time, predictions could go from wrong to really wrong.

***************************

*I also emailed numberFire and asked for their projections. The response was as follows:

Unfortunately we will not be able to share with you our predictive model. However you can review the perks from our premium services to see how it all works and what we have to offer. If you have any further questions do not hesitate to reach out to us.

That’s bullshit. The chart’s literally right here, with the probabilities shown when hovering over. Those probabilities are shown to four decimal places.

**I dropped overtime plays. There’s enough extrapolation in win probabilities as it is, and extending to rarely played overtime events seems unwise. Additionally, note that there may not be perfect alignment in the charts with respect to Gambletron’s data, which works by real time (and not clock time).

***This chart only shows runs and passes, as there were too many irregularities in how each model ordered and timed special teams plays. Gambletron is not shown given that its’ time stamps reflect real time, and not clock time.

*****One of my goals this summer will be to make Lock & Nettleton’s model more public, but I’ll need to check with the authors, first. It’s a fairly reasonable model to fit in R, and it would be great to have an NFL win probability Shiny app where those unfamiliar with R could enter in constants to get probabilities.

Note: An earlier version of this post pointed to possible limitations of PFR’s win probability model. However, after some offseason tuning, things appear to be more readily in order.

In Sloan’s paper contest, irreproducibility takes the stage

statsbylopez — Fri, 03 Mar 2017 02:10:28 +0000

Over the last few years, researchers across fields have uncovered something that’s simultaneously humbling, frustrating, and scary: most research doesn’t hold up in subsequent analysis.

It’s called the replication crisis, and it’s an issue that has challenged psychology, engulfed economics, and been identified as a disease in field full of them (medicine).

One area where replication has not been widely discussed is sports analytics, which, while more limited in scope than the disciplines listed above, takes center stage at this weekend’s Sloan Sports Analytics Conference (SSAC), with more than 3000 practitioners, fans, and professional staffers gathering in Boston.

One of the more attractive aspects of SSAC is its research paper contest, which generally features outstanding papers, provides researchers widespread press for their work, and awards a top prize of $20000 to one submission. As a result, it was with optimism that I read that in 2017, SSAC would be doing its best to ensure the validity of its contest submissions.* Via the rules page, “research will be evaluated on but not necessarily limited to the following: novelty of research, academic rigor, and reproducibility.”

Specifically, for reproducibility, the conference asks: “Can the model and results be replicated independently?”

This is an important definition, and one that mimics the work of Prasad Patil, Roger Peng, Jeff Leek, who recently went to extensive lengths to precisely define both reproducibility and replicability. Argue Patil et al: research is reproducible if a different analyst can generate the same results using the same code and data, and research is replicable if a different analyst can obtain consistent estimates when recollecting the data and re-doing the analysis plan. In other words, look for data and code, and ideally you’ll see both.

So, how did the 2017 finalists fare by these definitions?

Not great.

Here’s a chart summarizing the 2017 contest.** Each paper is identified by keywords from its title, and the columns reflect the data source, whether or not the data is (or appears to be) publicly available, if code was provided, and whether or not the overall paper is, by definition, reproducible. Note that two papers are yet to be posted on the SSAC site.

Summary of the 2017 Sloan research papers, including data source, if the data is publicly accessible, if code is provided, and if the paper meets the definition of reproducible

Of this year’s 21 listed finalists, less than half cite publicly available data that could be used by outsiders, as most submissions use proprietary data or do not give sufficient detail behind how the data was gathered. Even among those obtaining public data, however, only two (the Lahman database in an MLB paper, and a google doc from an NHL paper) are accessible without writing one’s own computer program (note that the scrapers to obtain the data were also not shared) or doing extensive searching. At best, five or six papers boast any chance of being replicable, which, sadly, is only a few more than the number of papers that don’t share any information about where their data came from.

As for code, only Adam Levin, writer of a PGA tour paper, shared some (link here). Adam also deserves credit as his data is available from ShotLink with an application. In fact, that application is as close as we get in the SSAC contest to reproducibility. With a publicly shared passing project data, Ryan and Matt’s NHL paper would appear to be the next closest. Additionally, a separate NHL paper made reference to code, but none was shown on the author’s website.

There are several consequences to the lack of openness. First, it increases the chances of mistakes. While most of these errors have likely been innocuous, there’s no way of knowing what’s real and what’s bullshit at Sloan, which means that the latter is sometimes rewarded. As one example, a 2015 presentation showed an impossible-to-be-true chart about profiting on baseball betting, capped with a question-and-answer session in which the speaker handed out free tee-shirts.*** Next, it stunts growth of the field, which is a shame because, as Kyle Wagner wrote, sports analytics been stuck in the fog for a few years running. Finally, while citations aren’t the end-goal for many SSAC paper writers, the lack of reproducible research means lower chances of paper’s being referenced in the future.

SSAC likes to point out that it’s a pioneer in its domain. Given that the growth of the sports analytics is to the best of everyone in attendance, I’d recommend that the conference either start enforcing one of the criteria it claims to look for, or lose the disguise that it cares about properly advancing and vetting research.

* In full disclosure, I’ll note that I was part of a paper with Greg & Ben (code and details here) that was rejected from the 2017 contest.

** If I’ve made a mistake in table, please let me know and I’ll update. There may be links or explainers that I missed.

*** If you were making money by betting on sports, the last thing you’d do is get up on stage at a famous conference and tell anyone about it.

Note: Thanks to Gregory Matthews for his help with this post

Ref’s may respond to player aggression, too

statsbylopez — Fri, 24 Feb 2017 15:52:03 +0000

Hockey die-hards like the sports’ reputation that the players police themselves. That is, teammates have one another’s backs, and acts of malice against one team will likely result in retribution against the initial aggressor.

Turns out, NHL players may not be the only ones who come to the defense of their own; there’s a decent chance that its referees do, too.

Let’s return to January of 2016, where Calgary’s Dennis Wideman leveled linesman Don Henderson with a vicious cross-check.

Despite — or perhaps given — his reputation as a high character player and the fact that post-concussive symptoms may have played a role in his mental state, Wideman was given a 20-game suspension, eventually returning to Calgary in March of last season.

But losing Wideman for 20-games wasn’t the only way in which Calgary felt the impact of the Wideman hit. From that game on, the Flames also found themselves in the penalty box significantly more often than beforehand.

Using data from Micah McCurdy (as well as some visual inspiration), I plotted the number of taken and drawn non-matching minor penalties in all Flames games since the start of the 2014-15 season. This includes the Calgary’s 129 regular season games prior to and the 92 games since the Wideman hit.

Each game is represented by a point, and the curved lines reflect local polynomial regression curves, shown separately for games before and after the hit (along with errors).

A few things stand out.

Prior to the Wideman hit, Calgary was consistently called for about one fewer penalty per game than opponents. However, while the rate of penalties drawn by Calgary has remained fairly consistent over the last 2+ seasons (shown in grey), after Wideman’s hit, there’s an immediate bump in penalties taken by Calgary (red). Comparing games pre and post-hit, the Flames jumped from 2.1 to 3.3 non-matching minors per-game. That’s … substantial, and a practically significant increase.

As additional evidence, we note that in that prior to the Wideman hit, roughly 1 in 10 Calgary games included no taken penalties. In the 92 games after the Wideman hit, the Flames only had one such game. Moreover, the jump in Calgary’s penalties corresponded with a drop in the league-wide infraction rate.

In addition to the comparison of the curves above, we can assess the significance of the Flames’ increase using the Poisson distribution. Initially linked to hockey more than a decade ago by Alan Ryder, the Poisson distribution is appropriate for penalty outcomes given the fixed amount of time in each game and the discrete counts. Sure enough, the 55% rate increase is statistically significant when comparing mean penalties pre and post-hit, and it is quite unlikely that the difference could be accounted for by chance alone. For those scoring at home, the p-value is less than 0.0001, and the 95% confidence interval for the rate increase goes from 19% to 106%.

Assuming we can rule out luck (or bad-luck) of the draw, what does this suggest?

Officials are implicitly making more calls against Calgary to get revenge. This is more feasible given how much subjectivity is involved in several NHL violations. We already know that refs are prone to make-up calls, and that they base penalty decisions on other silly factors – so we shouldn’t be surprised that they’d take a measure of revenge, either. Wideman’s hit was egregious, and refs may be punishing his team for it.
Another variable is responsible for the jump, but we don’t know what that variable is. As a related sports example, a few years back, an NFL analyst argued that time-varying Patriots fumble rates were a sign that New England was cheating. What was missing in the initial analysis is that there were several other reasons (e.g., more kneel downs, red zone plays, plays with the lead) driving the Pats’ low rates. Indeed, part of the reason why the Patriots didn’t fumble is because they were running plays that generally did not lead to fumbles. Could we be missing a similar confounding variable here, one that is artificially responsible for the increased penalties? Maybe. I’m open to ideas. In this respect, it’s particularly interesting the Calgary’s drawn penalties have stayed the same. The Flames don’t appear to be playing a more aggressive game since January of last year.
Calgary wasn’t the team same after the Wideman suspension. This would stand if the jump in penalties matched the length of the Wideman suspension. In other words, perhaps Wideman’s replacements were aggressive players. However, Wideman was suspended 20 games, and the spike in penalty calls has remained much longer.

As one additional sign that (1) is responsible for part of Calgary’s jump in taken penalties, its worth revisiting the chart. If you look at the last month of play, Calgary’s taken penalty average has dipped.

At more than one penalty a game across more than a full season of play, it’s easy, albeit unsafe, to extrapolate and estimate that Calgary’s jump in penalties was worth about 20 goals against. This is an incredible total. Even if only part of Calgary’s increase in penalties was due to a revenge factor, the biggest impact of Wideman’s hit on the Flames wasn’t felt in his suspension, but in the penalty box.

Postscript: A loyal reader asked me to compare the rates of all teams, both pre and post Wideman hit. Here’s that chart, using data from the nhlscrapr package in R.

We can also look at all team-seasons across several years to get a distribution of changes in penalty rates before and after when Wideman’s hit occurred (roughly halfway through the season, on January 27). Here’s a histogram of those differences. No one’s near Calgary in 15-16.