Featured post

So you want a graduate degree in statistics?

http://apstatsmonkey.com/StatsMonkey/Descriptive_Statistics_files/internet-statistics_1.jpg

After six years of graduate school – two at UMass-Amherst (MS, statistics), and four more at Brown University (PhD, biostatistics) – I am finally done (I think).

At this point, I have a few weeks left until my next challenge awaits, when I start a position at Skidmore College as an assistant professor in statistics this fall.

While my memories are fresh, I figured it might be a useful task to share some advice that I have picked up over the last several years. Thus, here’s a multi-part series on the lessons, trials, and tribulations of statistics graduate programs, from an n = 1 (or 2) perspective.

Part I: Deciding on a graduate program in statistics

Part II: Thriving in a graduate program in statistics

Part III: What I wish I had known before I started a graduate program in statistics  (with Greg Matthews)

Part IV: What I wish I had learned in my graduate program in statistics (with Greg Matthews)

The point of this series is to be as helpful as possible to students considering statistics graduate programs now or at some point later in their lives. As a result, if you have any comments, please feel free to share below.

Also, I encourage anyone interested in this series to read two related pieces:

Cheers, and thanks for reading.

Fun examples of round number bias

Yesterday was a comparatively big day for round number bias.

If you are unfamiliar with this idea, Economist’s View describes round number bias as “the human tendency to pay special attention to numbers that are “round” in some way.”

First, Facebook’s Sean Taylor drew me to this article, which gave several examples of round number bias that I hadn’t hear of before. Here are my two favorites:

Baseball: at the end of the season, the share of players who hit .300 or .301 was more than double the proportion who hit .299 or .298.

SAT’s: those who take the SAT test and end up with a score just below a round number–like 990 or 1090 on what used to be a 1600-point scale–are much more likely to retake the test than those who score a round number or just above.

I’ll add two of my favorite examples to this list.

Male heights, via OkCupid: men near 6-feet tall are more likely to add a few inches to their actual heights, in order to reach the 6-foot benchmark, relative to other men. Here’s a graph – note the horizontal shift among OkCupid users who are just below 6-feet fall.

Screen Shot 2014-12-18 at 9.05.20 AM

Marathon times: USC’s Eric Allen and colleagues found that runners make extra efforts to finish just before 5-minute cutoffs, with the largest effects for the 3 and 4 hour marks. Here’s another graph, this time a histogram (link).

Screen Shot 2014-12-18 at 9.08.55 AM

It’s amazing the number of runners who finished just below 4 hours, relative to those that finished just above.

********

Here’s some other round number bragging that also occurred recently:

  • Greg challenged me in a race to a round number of twitter followers. So, if you aren’t yet following me on twitter, please change that.

Screen Shot 2014-12-18 at 9.13.00 AM

  • The blog you are reading recently hit 50,000 views. Thanks for checking it out!

An NHL shootout lasted 20 rounds. How improbable was it

Tonight’s Panthers/Capitals game lasted an improbable 20 rounds, with the teams recording the following round by round outcomes (X for misses, 0 for goals)

Washington: X X X O X X O X X O O X X X X X O X X X

Florida: X X X O X X O X X O O X X X X X O X X O

Estimating the likelihood of a shootout reaching 20 rounds is actually fairly straightforward, provided that you make some reasonable assumptions.

Assumption 1: The save percentage for each goalie is 67%.

For starters, the overall NHL save rate on shootouts is about 67%. And while I have made the point that not all NHL goalies are created equal in terms of stopping shootout attempts, in this game, both Florida goalie Roberto Luongo (67.3% save rate) and Washington counterpart Braden Holtby (66.0%) are nearly identical to the league average. As a result, 67% seems like a safe bet.

Assumption 2: Independent trials.

I don’t know for sure that each shootout attempt is independent of other ones, but I have have no reason not to believe in independence in this example.

Of course, you could argue that goalies get ‘hot.’  However, given that there were several goals mixed into the Panthers/Capitals string of attempts, it doesn’t seem like Luongo or Holtby were riding a hot-hand/hot-glove.

Under Assumption 1 and Assumption 2, there are two parts of the shootout that we need to consider. We start with the Best-of-3, because NHL shootouts are only extended to extra rounds when the teams are tied after three rounds. Using a 67% save rate and the binomial distribution, the probability that a shootout continues onwards, and past the first three rounds, is about 34%. Let’s call this Pr(TiedAfter3).

Next, we want to consider the probability that the shootout extended in each successive round. This is just the probability of either both goalies making the save (0.67*0.67) or allowing a goal (0.33*0.33), which adds to about 0.56 and is labelled as Pr(Extends).

So, the probability of a shootout lasting four rounds or more is just a function of Pr(TiedAfter3), Pr(Extends), and (1-Pr(Extends)).

For example, the probability a shootout is decided in 6 rounds is:

Pr(TiedAfter3) * Pr(Extends)* Pr(Extends)*(1-Pr(Extends)) = 0.05.

For 20 rounds, we just have lots of extensions to plug in.

Pr(TiedAfter3) * Pr(Extends)^16*(1-Pr(Extends)) = about 1.5 in 100,000. 

So, I estimate the probability of the shootout lasting exactly 20 rounds to be about 1.5 in 100,000.

Similarly, about 4 out of every 100,000 shootouts would last 20 rounds or greater.

With about 13% of NHL games coming down to the shootout (or 160 games per season), we could expect a shootout to last 20 rounds or longer roughly once every 200 years.

But what if your assumptions are invalid?

Glad you asked.

NHL rules stipulate that shooters may be used twice only after all players on each team have been given one attempt. Because this rule forces coaches to use their weaker offensive players, its reasonable to believe that even though Holtby and Luongo started the shootout at a 67% save rate, by the 18th shooter (the last on the bench), that rate was closer to 90%.

Allowing for (i) the 4th shooter to score with a 33% success rate and the 18th shooter a 10% rate, (ii) a constant decline in offensive ability between shooters No. 4 and No. 18, and (iii) assuming shooters No. 19 and No. 20 are back to 33% rate, we get much more reasonable probabilities as follows:

Probability of the shootout lasting exactly 20 rounds: 4 in 10,000

Under this version, about once every decade will we see a shootout lasting 20 rounds or more.

In the future, it would be interesting to compare how these probabilities compare to the observed proportions of shootouts that last a given round.

Voter bias, football polls, and TCU

One of the topics undersold during the arguments of which four NCAA football teams deserved a spot in the college football playoff was the effect of voter bias on decision making.

Specifically, literature has found NCAA football poll voters to be biased in a few ways.

Bias #1- Associated Press (AP) poll voters are biased towards teams (i) in the voter’s home state, (ii) in the same conference as teams in the voter’s state, (iii) in BCS conferences, and (iv) teams playing in more televised games.

Bias #2- Coaching poll voters are biased in favor of both their recent opponents and their alma-maters.

Bias #3- AP voters are biased in favor of teams which were ranked higher earlier in the season.

It’s obviously too early to tell whether or not these biases will hold with the college football playoff selection committee over the long run. However, it’s particularly curious how the decision-making process manifested itself with respect to one team, at least in 2014.

TCU.

For starters, the Horned Frogs played only five games on national television, while the four playoff teams played the vast majority, if not all, of their games on ABC, ESPN, or FOX.

Even more startling, here’s a plot of the NCAA preseason and postseason AP ranks. For teams not ranked in the pre or postseason polls, I looked at the “others receiving votes” group of the poll, and also Massey rankings, to estimate team location.

Preseason and postseason AP poll ranks

Preseason and postseason AP poll ranks

AP voters had TCU lumped behind, among other teams, Marshall and Louisville in its 2014 preseason poll, and the Horned Frogs didn’t even make the Top-25 until the sixth week of the poll.

By season’s end, the Horned Frogs had jumped to No. 6, and were arguably the last team excluded by the playoff selection committee.

Why does this matter?

On December 7th, the committee ousted TCU in favor of the Ohio State for the final spot in the playoff. Interestingly, that’s the same Buckeye team that the voters had ranked No. 5 in the preseason poll. In fact, all four teams selected by the committee ranked among the preseason AP top-5. 

The decision to exclude TCU in favor of Ohio State or Florida State didn’t just happen on December 7th. With voters biased against teams starting behind in preseason polls and against ones with fewer games on national television, the decision to exclude TCU began far before that – in all likelihood, the decision making process started before the regular season even began.

**********************************************************************************************************************************************

Postscript 1: The obvious solution to certain types of voter bias is to not take the first polls of the season until a few weeks after the season actually starts.

Postscript 2: It’s interesting how Mississippi State followed nearly an identical trajectory to TCU. In fact, TCU had 23 votes in the preseason poll, and MSU had 22.

Postscript 3: Given that I have coached or helped coach a high school football team in Massachusetts for more than a decade, I was interested in making the same graph with 2014 high school teams. To do so, I used ESPN Boston’s pre and postseason team rankings. I also used computer rankings from calpreps to estimate where teams not ranked at the end of the season stood, and in some cases, those teams did not even fit on my graph’s y-axis (Doherty, Auburn, etc).

2014 Massachusetts football rankings (via ESPN Boston)

2014 Massachusetts football rankings (via ESPN Boston)

I don’t really know what this graph means, other than the fact that high school kids should take preseason rankings with a grain of salt. The team ESPN ranked No. 7 in August, for example, was ranked No. 75 by calpreps in January.

My guess is that high school kids worry about these types of things much more than they should.

Polls are mostly a crapshoots – and, at least with college football polls, they are biased ones at that.

Building an NCAA men’s basketball prediction model

Last Spring, Loyola University Chicago statistics professor Greg Matthews and I won the March Machine Learning Mania contest run by Kaggle, which involved submitting game probabilities for every possible contest in the 2014 NCAA men’s basketball tournament.

Recently, we co-wrote a paper that motivates and summarizes the prediction model that we used. In addition to describing our entry, we also simulated the tournament 10,000 times in order to help quantify how likely it was that our submission would have won the Kaggle contest.

The paper has been submitted for publication at a journal, and we are crossing our fingers that it gets accepted. The pre-published version of the paper is up on arXiv (linked here).

Quick summary: to estimate the probabilities for each game, we merged two probability models, one using point spreads (Rd. 1) and estimated point spreads (Rd. 2- Rd. 6) set by sports books, and the other using team efficiency metrics from Ken Pomeroy’s website.

According to our simulations, we estimate that our odds of winning increased anywhere between 10-fold and 50-fold, relative to if the contest winner had been randomly chosen. Even under the most optimal of game probability scenarios, however, that entry had no more more than a 50-50 chance of finishing inside the top-10 of the Kaggle contest standings.

Also, we made this density plot of the winning Kaggle scores over all tournament simulations and under the simulations in which our entry was victorious. Overall, our winning scores tended to have a lower distribution in the tails.

Screen Shot 2014-12-02 at 3.00.26 PM

Figure: Kaggle contest score under the log-loss function

Also, we found that the 2014 winning score was relatively higher than under most tournament simulations. We posit that because a 7-seed (UConn) won the national title, the larger number of upsets likely resulted in increases to the loss function.

Estimating causal effects with ordinal exposures

Just passing along a quick note from the world of academia; I, along with my adviser from Brown, Dr. Roee Gutman, published our first paper together.

It’s titled ‘Estimating the average treatment effects of nutritional label use using subclassification with regression adjustment,‘ and presents a case study of how to measure the causal effects of an ordinal exposure. The article is currently online in Statistical Methods in Medical Research.

The online version (paywall) of the article is linked here. You can also download a pre-published version on the arXiv by going here. Finally, here’s the abstract and keywords.

Screen Shot 2014-11-30 at 10.52.49 PM

What is the main point of this paper?

Here’s one of my favorite parts, a graph showing the covariates’ bias before and after subclassifying subjects into groups. In this and many other examples, subclassifying is an important tool as it allows for more of an apples-to-apples comparison. Specifically, it only makes sense to contrast the outcomes of subjects in observational data if they are similar on pre-treatment covariates, as traditional tools like regression adjustment require extrapolation and are prone to bias.

Kendall's Tau between covariate and ordinal exposure

Figure: Kendall’s Tau between covariate and ordinal exposure. 

In the graph above, the vast majority of covariates were significantly associated with our ordinal exposure (nutritional label use) before the grouping of subjects into subclasses. After subclassifying (especially with 15 subclasses), most of that covariates’ bias had been removed, as shown by Kendall’s Tau values near 0.

If you are interested in the published version of the paper, feel free to drop me an email and I can send it over. And if you are still reading, I’m also glad you are interested in causal inference!

NHL game outcomes using R and Hockey Reference

I’m always impressed with the contest and accessibility of the Baseball with R website (here), which features a great cast of statisticians writing about everything from Hall of Fame entry to umpire bias.

In a similar vein, I highly recommend Sam and AC’s nhlscrapr package in R. I’ve used it extensively to analyze play-by-play data from past seasons (for example, this post on momentum in hockey).

However, I have a soft spot for overtime outcomes in the NHL, and while the nhlscrapr package has game-by-game results, there isn’t a straight-forward mechanism for identifying whether or not a given game went to overtime. Further, data in the nhlscrapr package only goes back about a decade or so.

Thankfully, Hockey Reference has easily accessible (and scrapable) tables for us to use. Given that I am doing some updated analyses over NHL overtime rates, and that I wanted an easier method than copying and pasting .csv files from nhl.com, I figured I would post the code that I used to scrape NHL game outcomes. The code that follows extracts each game’s outcome for the last five years; if you are interested in other years, its easy enough to change the url’s.

Feel free to use, and hope you enjoy!

library(XML)
library(stringr)

nhl<-NULL
urls<- c("http://www.hockey-reference.com/leagues/NHL_2011_games.html",
 "http://www.hockey-reference.com/leagues/NHL_2012_games.html",
 "http://www.hockey-reference.com/leagues/NHL_2013_games.html",
 "http://www.hockey-reference.com/leagues/NHL_2014_games.html",
 "http://www.hockey-reference.com/leagues/NHL_2015_games.html")

for (i in 1:length(urls)){
 tables <- readHTMLTable(urls[i])
 n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
 temp<-tables[[which.max(n.rows)]]
 nhl<-rbind(nhl,temp)
}

names(nhl)<-c("Date","Visitor","VisGoals","Home","HomeGoals","OTCat","Notes")
table(nhl$OTCat)
nhl<-nhl[nhl$OTCat!="Get Tickets",]
nhl$OT<-nhl$OTCat=="OT"|nhl$OTCat=="SO"

Wiola!  That easy. We are in business with a few lines of code.

Here’s the output:

Screen Shot 2014-11-21 at 10.01.32 PM

A snapshot of math, computer science, and statistics enrollments at a liberal arts institution

Like it probably did at many institutions, student registration opened this past week for the Spring semester at Skidmore College.

As a statistics professor in the Department of Mathematics and Computer Science, I was struck at how quickly the courses in our department filled up. Was it like that elsewhere at Skidmore?

Using public data available from the registrar (here), I extracted the course enrollments for each of the school’s departments. Next, after dropping independent studies and a few similarly designed courses, I categorized each course as either “Closed” (waitlisted at 5 students or more), “Filled” (spots remaining on the waitlist only), or “Open” (enrollment less than capacity).

Next, in a few cases, I aggregated departments that were very similar (say, foreign languages, or math, computer science, and statistics) to simplify things. Further, due to the small sample sizes of courses offered in a few departments (ex: Asian studies), I also had to restrict my analysis to departments which are offering at least 8 courses in the Spring.

In any case, here’s a mosaic plot featuring the course status (y-axis) by department. The width of each department’s x-axis is proportional to the number of courses offered by that department in the Spring. For example, the English department and the Art department (which is a conglomerate of art history and arts administration) offer more courses than any other department.

Screen Shot 2014-11-15 at 12.06.56 PM

The Department of Math & Computer Science is abbreviated MA, and, as indicated above, boasts among the largest percentage of courses which are closed (red) and filled (yellow) the Spring. The “PA” department, if you are curious, is also doing well; it’s the Department of Physical Activity.

So, apparently, Skidmore students are desperate to differentiate some integrals but it makes them want to exercise afterwards.

Most department abbreviations are what you expect them to be – the first two letters of their name – but because my interest was mostly in comparing mathematics to other departments as a whole, I was purposely vague with the labeling.

So, overall, it appears enrollments in my department’s courses are doing well.

Other notes:

-Restricting to courses designed for larger enrollments (in general, these are 100 and 200 level courses) exacerbates the enthusiasm students are showing for courses in the math department. Here’s that plot.

Screen Shot 2014-11-15 at 12.17.07 PM

-It might be unfair to treat all course offerings in the same manner. This only makes for one quick look at the data.

-If you are curious, statistics courses did well enough that there is a chance another one might be added!

-Perhaps this shouldn’t be surprising; math is a sexy choice for the future (as per Jacob Rosen), who writes “A math major – or at least several courses in math – can be the differentiation point to lift your resume to the top of the pile.”