On February 20, the National Hockey League announced a partnership with software company SAP. The alliance’s primary purpose was to bring a new enhanced stats section to NHL.com, built in the shadows of popular analytics sites like war-on-ice and the now dormant extra-skater.
It was, it seemed, a partial admission from the league that it’s best metrics were hosted elsewhere.
“The stats landscape in the NHL is kind of all over the place,” suggested Chris Foster, Director of the NHL’s Digital Business Development, at the time. “One of the goals is to make sure that all of the tools that fans need are on NHL.com.”
One tool presented in February was SAP’s Matchup Analysis, designed to predict the league’s postseason play. The tool claimed 85% accuracy, which Yahoo’s Puck Daddy boasted was good enough to make “TV executives nervous and sports [bettors] rather happy.”
There’s just one problem.
85% is way too high in the NHL.
Specifically, at the series level, 85% accuracy is a crazy good number for the short term, likely impossible to achieve long-term. And at the game-level, 85% is reachable only in a world with unicorns and tooth fairies. A more reasonable upper bound for game and series predictions, in fact, lie around 60% and 70%, respectively.
So what in the name of Lord Stanley is going on?
To start, the model began with 240 variables, eventually settling on the 37 determined to have the best predictive power. Two sources (one, two) indicate the tool used 15 seasons of playoff outcomes, although an SAP representative is also quoted as saying that the model in fact used four or five years worth of data. This is a big difference, as using 240 variables is a risky idea for 15 seasons (225 playoff series), much less five.
But it’s also unclear if the model was predicting playoff games or playoff series. Puck Daddy, like most others, indicated that it was meant for predicting playoff series, but in its own press release, SAP indicated that the 85% actually applied to game-level data.
So as the details of the algorithm remain spotty, here are two guesses at what happened.
1- SAP’s 85% is an in-sample prediction, not an out-of-sample one.
Let’s come up with a silly strategy, which is to always pick the Kings, Bruins, and Blackhawks to win. In all other series, or in ones where two of those teams faced one another, we’ll pick the team whose city is listed second alphabetically.
This algorithm – using just four variables – wins at a 68% rate over the 2010-2014 postseasons. But note that the 68-percent is measured in-sample, where I designed it. Predictions are useful not in how they perform in-sample, but in how they do out-of-sample – that is, when they are applied to a data set other than the one in which they were generated.
The NHL’s 85-percent seems feasible as an in-sample prediction only, and it’s overzealous to use in-sample accuracy to reflect out-of-sample performance. Our toy approach with four variables, for example, hits at a not-so-impressive 47% clip between 2006 and 2009.
2- SAP’s model included overfitting and multicollinearity, a dangerous strategy.
It’s easy to assume that using 240 variables is a good thing. It can be, but including so many variables with small sample sizes runs the risk of overfitting, where a statistical model includes too many predictors and ends up describing random noise instead of the underlying relationship.
And with so many possible predictors, it shouldn’t be surprising that at least one will be surprisingly accurate in a sample of games. For example, shorthanded goal ratio, a somewhat superfluous metric, predicted more playoff series winner between 1984 and 1990 than both goals for and goals against.
Further, while it is tempting just to combine several such predictors together, many of these variables are likely correlated. Including highly correlated variables in the same model is known as multicollinearity, which can make predictions sensitive to small changes in the data, including when applied to out of sample data.
Fortunately for skeptics of SAP’s model, the 2015 postseason provides our first example of out-of-sample data with which to judge SAP’s predictions.
Through rounds one and two, the Matchup Analysis tool looks not much different than a balanced coin, correctly pegging seven of the twelve series winners. But a tool with 85% accuracy would pick 7 in 12 winners or worse only about 1 in 50 times (2%). In other words, unless the 2015 tournament is a 1 in 50 type of outlier, we can be confident that the model’s true accuracy lies below the 85% threshold. Finally, keep in mind these are rounds 1 and 2, which should be the easiest round to predict, given that these rounds tend to feature the largest gaps in team strength.
The Matchup Analysis tool might be awesome, and perhaps it is more accurate than using any of the NHL’s enhanced stats alone. However, it appears likely that the algorithm will fail to meet its own high standards; even if it accurately predicts each of the final three NHL series, the tool won’t crack 70%.
It has been said that sports organizations have a cold start problem with analytics. Writes Trey Causey, “How does an organization with no analytics talent successfully evaluate candidates for an analytics position?”
In such situations, it is easy to fall prey to sexy numbers like 85%. But like unicorns and tooth fairies, such predictive capabilities in the NHL are likely too good to be true.