NHL game outcomes using R and Hockey Reference

I’m always impressed with the contest and accessibility of the Baseball with R website (here), which features a great cast of statisticians writing about everything from Hall of Fame entry to umpire bias.

In a similar vein, I highly recommend Sam and AC’s nhlscrapr package in R. I’ve used it extensively to analyze play-by-play data from past seasons (for example, this post on momentum in hockey).

However, I have a soft spot for overtime outcomes in the NHL, and while the nhlscrapr package has game-by-game results, there isn’t a straight-forward mechanism for identifying whether or not a given game went to overtime. Further, data in the nhlscrapr package only goes back about a decade or so.

Thankfully, Hockey Reference has easily accessible (and scrapable) tables for us to use. Given that I am doing some updated analyses over NHL overtime rates, and that I wanted an easier method than copying and pasting .csv files from nhl.com, I figured I would post the code that I used to scrape NHL game outcomes. The code that follows extracts each game’s outcome for the last five years; if you are interested in other years, its easy enough to change the url’s.

Feel free to use, and hope you enjoy!

library(XML)
library(stringr)

nhl<-NULL
urls<- c("http://www.hockey-reference.com/leagues/NHL_2011_games.html",
 "http://www.hockey-reference.com/leagues/NHL_2012_games.html",
 "http://www.hockey-reference.com/leagues/NHL_2013_games.html",
 "http://www.hockey-reference.com/leagues/NHL_2014_games.html",
 "http://www.hockey-reference.com/leagues/NHL_2015_games.html")

for (i in 1:length(urls)){
 tables <- readHTMLTable(urls[i])
 n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
 temp<-tables[[which.max(n.rows)]]
 nhl<-rbind(nhl,temp)
}

names(nhl)<-c("Date","Visitor","VisGoals","Home","HomeGoals","OTCat","Notes")
table(nhl$OTCat)
nhl<-nhl[nhl$OTCat!="Get Tickets",]
nhl$OT<-nhl$OTCat=="OT"|nhl$OTCat=="SO"

Wiola!  That easy. We are in business with a few lines of code.

Here’s the output:

Screen Shot 2014-11-21 at 10.01.32 PM

Advertisement

2 Comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s