Sunday is a day for relaxing. Just ask the NHL

A few weeks back, folks in the Harvard Sports Analysis Collective (HSAC) looked at the shooting rates of NBA players based on whether or not the game fell on a Sunday (link).  While the evidence was mostly inconclusive in the HSAC study, I thought it was a good idea to check for whether or not similar results exist during Sunday NHL games.

Piggybacking on recent work from Carnegie Mellon’s and War-on-Ice’s Sam Ventura, I calculated the expected goals in each NHL contest between the start of the 2005 season and February 1, 2015, which is done using a logistic regression model based on the type of shot, shot location, and shot distance. This was done based on both the game’s minute, score, and the day of week in which the game was being played. Here, note that I’m using expected goals, which Sam and other folks have shown to be as if not more predictive of future goals than traditional statistics like goals or shots.

Looking at the first period, I plotted the expected goal rate per each tied-game minute (Minute 1 through 20), using a different color for games that occurred on a Sunday and those occurring on all other days. I also included 95% confidence bands for the loess smoother, although I should note that these are the default bands in the ggplot2 package, and might not fully account for the variability in each rate.


At any rate, the difference between Sundays and all other days of the week was much larger than I anticipated. It does seem plausible that Sunday NHL games take a less aggressive tone, at least in terms of expected offensive output.

And here’s a plot of Sunday compared to every other weekday:

NHL daily

Of course, its impossible to tell exactly what drives potential drops in expected goals based on Sunday. For example, it could simply be that Sunday games tend to be the second of back-to-backs for one or both participating teams, which might cause tired legs. It could also be that Sunday games tend to feature matinees games, although the same could be said for Saturday (Note: reader Justen Fox notes that nearly all Sunday games are played before nighttime, compared to only about a quarter of Saturday games)

In any case, I thought that this was an interesting result worth sharing.

Also, I’m working on a more extensive and related project that should finish up within the month, and at that point, I’ll happily share code.

Here’s what I did on the first day of statistics class

Ample literature has gone into what teachers should do on the first day of class. Should they do an ice-breaker? Dive right into notes? Review a few example questions to motivate the course?

I don’t really have control groups to use as a comparison, but I think these two activities were helpful and engaging, and I figured it was worth passing along.

Introduction to Statistics (Intro level, undergrad)

I stole this one from Gelman and Glickman‘s “Demonstrations for Introductory Probabiity and Statistics.”

When the students come in, I split the course (appx 25 students) into eight groups. Each group was given a sheet of paper with a picture on it, and the groups were tasked with identifying the age of the subject in question. I had some fun coming up with the pictures – I went back to the 90’s with T-boz from TLC and Javy Lopez of the Atlanta Braves, added an impossibly 52 years-of-age Sheryl Crow, and, my personal favorite, Flo from the Progressive commercial.

Screen Shot 2015-01-20 at 9.38.39 PM

How old do you think Flo is?

Anyways, taking approximately one minute with each picture, the groups, without knowing about it, started talking confidence intervals (“No way she’s not between 40 and 50″) and point estimates (“My best guess is 45″). That was good. At the end, I collected the pictures, and revealed the ages.

Next, using one picture as an example, we made a table of some of the class guesses, and went through and calculated estimated errors for each group. This led to the obvious discussion of what metric would be useful for comparing group accuracy. For example, taking the average error would be problematic because the negative’s and positive’s would cancel out. The class settled on mean absolute error, but we also discussed mean squared error, and some form of a relative error, which would account for the fact that there might be more error using older subjects.

If you were wondering, most groups had an average absolute error of about 5 or 6 years (the winning group was less than 3), and Flo is 44 years old.

Probability and Statistics (upper level)

Rocks-paper-scissors (RPS) is one of my favorite games, and so I split the class up into pairs to play a best-of-20 RPS series (approximately n = 25 students). After each throw, each student was responsible for writing down their choice during that turn (R, P, or H). At the end of the series, we now had a set of n sequences, with each element of the sequence drawn from the letters ‘R’, ‘P’, and ‘H.’

Next, we did some quick analysis of these sequences. While there were several metrics possibly of interest, I focused on the longest sequence of a consecutive throw. For example, if the sequence went:

R, R, P, P, P, H, P, S

the longest sequence would have been three (3 consecutive papers).

Next, we compared the distribution of the class to what would have occurred had the throws been randomly chosen. This was easy to do using R/Rstudio.

for (i in 1:10000){
hist(maxSeq,col="blue",main="Histogram of maximum consecutive throws")

And here’s the resulting histogram, which represents the frequency of maximum consecutive throws in 10,000 randomly drawn sequences of 20 RPS throws.

Screen Shot 2015-01-20 at 9.55.18 PM

The mode of the histogram is 3, and the class was quick to pick up on the fact that, if throws were randomly drawn, we would have expected more maximum’s of 4 than 2. Of course, in our class of about 25 students, there were many more 2’s than 4’s. Such evidence is not surprising, however, given that human nature tends to to underestimate the true randomness of numbers (for examples, Wiki has a few). This helps to confirm it, and also gives students the chance to meet another, learn about Monte Carlo techniques, and gain a quick introduction to R/RStudio, all while playing Rocks-Paper-Scissors. I also ended by showing the class this New York Times RPS game, in which you can play the computer, either on ‘novice’ or ‘expert’ mode.

Obviously, more went into the courses after these two activities, but I think I’ll go back to them in the future. It was certainly more fun than starting with the syllabus.

Everyone asked for Tebow. So, here he is

My recent article for FiveThirtyEight on QBR segmentation using density curves apparently had one big missing component, and his name is Tim Tebow.

After receiving multiple tweets, Facebook comments, and emails, each of which asked where Tebow was, I figured I should give the people what they want.

Tebow has 15 games to his name, and while that’s not a great sample size to learn much about, here are the QB’s with the most similar distributional curves to Tebow: Dan Orlovsky & Seneca Wallace.

And here’s the set of density curves for that threesome. Lots of mediocre games (QBR around 50-60) from this group, and while each QB only had a few terrible games, there were very few, if any, outstanding ones.


Going beyond the mean to analyze QB performance

A few months ago, my friend & writer Noah Davis asked me a question that was bothering him. I’ll paraphrase, but this was roughly what he said:

Does consistency matter for quarterbacks? Like would you rather have an average QB who is never really great, or a good QB who occasionally sucks?

Well, fortunately there are ways to measure performance consistency, and one of them is standard deviation. QB’s with high standard deviations in their game-by-game metrics are the less consistent ones, and visa versa.

But perhaps an even better idea than just measuring each QB’s standard deviation of a certain metric is to compare the overall distribution of performance. This can be done using many tools, and we chose density curves, which are just rough approximations of the smoothed lines that one would fit over a histogram.

The culmination of our project into looking at QB density curves is summarized here on FiveThirtyEight. In addition, I created this Shiny app using the R statistical software, which can allow users to (i) graph the density curves of their quarterbacks, (ii) contrast any given QB’s home and away performances, and (iii) to identify, for any given QB, the three other players with the closest curves. We chose ESPN’s Total QBR as our metric of interest.


There are a few finer points to the analysis, however, and I figured it was worth describing them in case any readers were interested or had ideas for future work.

First, I considered a few options for grouping the players, including model based clustering (see this recent post by Brian Mills on pitcher groupings). But the problem I kept running into with a model based approach is that it assumes that the underlying distribution behind the data is Normal. Given the strange shapes in QB performance (including bimodal curves, and curves that were strongly skewed right and left), this approach didn’t feel comfortable.

We settled on using K-means clustering (KMC), settling on using k = 10, which I think did a decent job of grouping players with similar curves. We tried anywhere from k = 2 to 15, and then checked some of the within and between group metrics for each k. We found the best performance using between k = 8 and k = 10, as judged by the elbow method, and the curves looked much easier to interpret with k = 10. Beyond 10 clusters, there was too good of a chance that a cluster ended up with only one quarterback in it, which did not seem ideal.

There are a few issues with KMC, however, one of which is that players can jump back and forth depending on the algorithm and the inputs. Worse, its difficult to measure error. For example, Tom Brady ended up in a cluster with Aaron Rodgers in nearly every one of, if not all, of our iterations. However, Brady was also matched up with Drew Brees sometimes, who, when not matched with the Brady, Manning, and Rodgers group (the ‘Elites’), was always with Matt Ryan. As a result, cluster membership isn’t fixed. Once we had finalized using a k of 10, we ran several iterations of the clustering, and chose the one with the highest within-cluster similarity of those grouped.

That said, part of the reason for creating the app was to allow people to compare anyone they wanted to, without having to rely on the clustering. For comparing one quarterback to all of his peers, distributional similarity can be judged in a few ways. I used Kolmogorov-Smirnov pairwise tests of distributional equality, which are preferred over, for example, two-sample t-tests or Mann-Whitney tests, because the former are sensitive to both distribution center and shape. This is a good thing for us, because quarterbacks with bimodal shapes (Brett Favre), which signify sets of performances that are both really good and really bad, are matched to other ones with bimodal shapes (e.g., Michael Vick).

Peyton Manning and the postseason

In the wake of Sunday’s disappointing effort against Indianapolis, several scribes were quick to the task to either (i) degrade Peyton Manning for a questionable postseason history, or (ii) defend Peyton Manning’s postseason history, likely with citations to a small sample size and a more difficult level of competition.

As in many discussions with statistics, the analyses with Manning mostly came down to simple averages or cherry-picked examples.

Ex 1: Peyton Manning career postseason QB rating 89.2 Brady? 88.0 You do the math.

Ex 2: Peyton Manning falls to 8-6 at home in postseason. No other starting QB in Super Bowl era has lost more than 3 home playoff games.

But using these types of arguments is both unimaginative and uninformative. There are better ways to consider Manning’s postseason performance level, right?

Well, one improved method for a player comparison is to use individual game statistics, as opposed to taking a simple average. So, going back to the start of their careers, I extracted Manning and Brady’s game-by-game passer rating, for both regular and postseason games.

A useful tool for the comparisons of the distributions of continuous variables is a density curve, which represents the smoothed curve line that we would fit over a histogram had we made a histogram of the distribution. But unlike histograms, which can be difficult to look at for comparisons, density curves make it nice and easy to compare centers, shapes, and spreads (if you want to read more on density curves in sports, check out War-on-Ice’s post here).

Here’s a graph of the density curves for Brady (blue) and Manning (red), split by game type (posteason, regular season).


The curves for Manning and Brady’s regular season performance are nearly identical. Manning’s curve is slightly to the right, implying that, overall, his game-by-game passer rating distribution tends to be a bit higher than Brady’s. Comparing the top graph to the bottom graph, both quarterbacks have notably higher centers of quarterback rating in the regular season. Of course, this isn’t surprising, given that the competition is more difficult in the playoffs.

In the postseason, the distribution of passer ratings for each QB is once again centered at roughly the same location (about 90), but Manning has a higher density in both tails. This implies that Manning has had more games that have been both really good (QB rating ~ 150) and really bad (QB rating ~ 25).

Perhaps such conclusion isn’t all that surprising. A few putrid games in the postseason from Manning can linger in people’s minds, particularly when his biggest adversary (Brady) has avoided them. Such a result could be a sign of several things; we’ll never know if the difference is due to sampling variability, s small sample size, or the chance that #18 is indeed more likely to put up a terrible game come January.

Hopefully this post encourages more looks at the distributions of player metrics (in fact, I’m working on a longer one as we speak). Also, I should point out that quarterback rating isn’t a great metric for evaluating performance, as, among other weaknesses, it is highly dependent on throwing touchdown passes and avoiding interceptions. However, passer rating was the only such game x game metric that I could get, dating back to the start of Manning’s career.

Correlated parlays and the NFL

This upcoming weekend in the NFL playoffs, Seattle is favored to beat Carolina by 11 points, and the over/under for the game has been set at 40 total points.  A loyal reader points out the following:

It's just so unusual (in my mind) to see 11 point spread with a total as low as 40

This brings up a few questions, centered around the idea of a correlated parlay. In a correlated parlay, if one of the bets successfully occurs, it is also increasingly likely that another successful bet will occur. For example, placing a bet on the Cavaliers to win the NBA finals and LeBron to win Finals MVP is a correlated parlay, because if the first occurs, the second is also likely to occur (h/t to @wmguo for the example).

My goal in this post is to consider whether or not NFL totals and game spreads were correlated. To do so, I used Sunshine’s NFL historical database (here), and extracted every regular season game since 1979. This yields about 8,000 games. Next, I split the game’s line into (roughly) evenly spaced quantiles, before answering the following questions.

Question 1: Does the spread outcome (over vs. under) correlate with the game total?

Here’s a graph of the likelihood of the game hitting the ‘over’ by spread interval. Most of the graph is evenly scattered about 0.5, although on the extremes, there is some variation. The dot in each graph is the sample proportion, and the lines represent 95% confidence intervals for each of these point estimates.


So, overall, it doesn’t look like there’s any obvious trend in betting the game’s total based on its spread. This is expected. It shouldn’t be that easy to predict game totals.

2) Conditional on knowing whether or not the favorite covered, what does the graph look like?

As an example, if the Seahawks cover this weekend, that means there will have been at least 12 points scored in the game. Would this information aid in making a decision on the game’s total?

Here’s a graph of the over probabilities in games in which the favorite covered:


And here’s a graph of the over probabilities in games in which the underdog covered:


Overall, it doesn’t look like there’s much association between the game’s line and total, even when knowing whether or not the favorite covered.

4) But what about games with extreme totals?

Here are some more graphs, in which I look at games with lower (<41) or higher (>51) totals.

Looking at games where the total was 40 or below in which the favorite covered, there does seem to be a weak association between the game’s line and the total. On the right and left sides of the graph, notice the point estimates above 0.50.


In games in which the favorite covered such a large spread with a smaller game total, the over hits about 55% of the time. Interesting, but not a crazy strong association.

Finally, looking at games with extremely high totals (51 or above), here’s a graph of the probability of the over conditional on the favorite covering.


This was also interesting. It looks like for several of the intervals above, when the favorite covers there was a possible a tendency for a higher proportion of unders. In the graph, this is shown by several of the point estimates (particularly in the middle of the graph) falling below 0.50.

Of course, we are looking at several intervals and several situations, and its certainly feasible this trend could be due to chance. Further, the league has changed over the past several years, and information from a few decades ago might not be all that applicable.

Overall, it looks like any associations between the game’s total and spread results are limited. Thus, if you like Seattle this weekend to cover against Carolina, don’t feel like you have to take the over or the under to go with it.


As a postscript, I got two great responses when I posed a related question on Twitter.



I followed up with Frank, who was using only more recent data (2003-current), while also including the playoffs. Interesting, as I got between 55% and 60% looking at regular season data over a longer time period.

Stat pundit rankings: 2014 NFL win over/unders

We are back for another edition of the stat pundit rankings, where we rank the accuracy of different predictions for team wins from statistics or simulation based websites. Team Rankings boasted the best performance last year, outperforming competitors and the totals set by sportsbooks as far as predicting 2013 regular season win totals.

Let’s meet our competitors for 2014:

Team Rankings (TR), predictions listed here

Accuscore (AS), predictions emailed by a loyal reader

FiveThirtyEight (538), predictions extracted the week before the regular season began (missing link)

Prediction Machine (PM), predictions listed here, released just after the season began

Football Outsiders (FO), projections listed here from just before the season began

Aggregate, the average statheads predictions from the five sites above

Finally, we will want to compare all the projections to lines set by sportsbooks. To do so, I used the implied lines used by Seth Burn in his preseason post, done a week before the regular season, which accounts for both the sportsbook total and the vig. I label this in my graphs as ‘Vegas’ for simplicity.

I’ll also include last year’s win totals (2013) and a method that assigns eight wins to each team (Eights), just to see how things line up.

     1.  Which site boasted the most accurate predictions? 

To answer this question, we use two metrics, the mean absolute error (the average distance between a team’s observed and expected win totals) and the mean squared error (the average squared distance). The mean absolute error is more easily interpretable, but using mean squared error punishes places a harsher punishment on predictions which are further from the observed total.

By both metrics, Team Rankings is the only prediction site to outperform Vegas, doing so by about an eighth of a win, on average. Overall, sportsbooks were about 0.15 wins closer to the observed win totals in 2014 than in 2013, missing by an average of about two wins per team. This suggests that the 2014 season was slightly easier to predict than the 2013 one.

Here’s a barplot of both metrics for each of our predictors.

First, mean absolute error. All predictors except for the Outsiders were better at predicting 2014 that using each team’s 2013 wins.


Overall, using eight wins for each team leaves you with an average error of about 2.5 wins per team; the best statistics site that we looked at reduced this to about 1.90. The average of the statheads sites also did fairly well, with an average error right around the sportsbooks’ one.

Next is mean squared error. Everyone did better than just using last year’s win totals.


    2.  How would using the sites to place over and under bets do? 

Let’s look at how each site were do to have done if we were to have used their predictions to place mythical wagers on each team’s totals. Overall Team Rankings and Accuscore led the way, accurately predicting 20 and 19 of the 32 sides, respectively. FiveThirtyEight hit on exactly half of the 32 sides correctly, while Prediction Machine and the Outsiders only hit on 14 and 12 sides, respectively.

Looking at each site’s favorite predictions – that is, the five for which each site deviated from the sportsbook predictions by the largest absolute value – Team Rankings (3-2) and Accuscore (4-1) were the only ones to finish above 0.500.

The sites favorite successful predictions were the unders for Chicago and the New York Jets. On average, however, the sites also liked the under for New England and Green Bay, doing so unsuccessfully.

Most of the predictors accurately hit on the Minnesota, Houston, and San Diego overs, while missing on over bets for the Jaguars and Rams.

    3.  What’s the graph look like? 

I sorted each team by its predicted win total, and plotted observed wins and predicted wins. For predicted wins, I both used the sportsbook totals I described earlier, as well as the predicted wins for each team averaged across each of the five websites (called the ‘statheads’ prediction).

The graph makes it obvious the depths by which the Bucs, Titans, and Bears underachieved, while Cowboys, Cardinals, and Lions were among the biggest surprises.


4.  Is there anything else I should know?


Several sites make weekly ATS picks, so let’s summarize the 2014 season. Here, I’ll focus on the websites that make it easily accessible to track their efforts.

Seth Burn tracked Football Outsiders weekly ATS picks, which finished an unofficial 111-136-9. This would put the Outsiders in the bottom 6th percentile of picks if we were to simulate all NFL game picks by flipping a coin. Seth also tracked the Outsiders selection using the Kelly criterion, where more money is mythically wagered on the strongest selections. Things did not go well, with the picks finishing well in the red. The results for 2014 follow a disappointing 2013 effort, in which the Outsiders finished 11 games below 0.500. The website finished above 0.500 during its first four years of making picks, from 2008 through 2012, with yearly finishes as high as 57% on all picks.

Massey-Peabody finished 36.5-30-1.5 on its official plays, which, while slightly down when compared to its 2013 performance, was still a reasonable effort.

Finally, Team Rankings finished at 31-24 on its most recommended plays of 2014 (56%), after finishing at just 43% in 2013. On all picks, the site finished 2 games below 0.500.

(Thanks to Greg over at for his ideas and help with this!)