In probability class, one of the most frequently cited examples is the birthday problem: Given a class of N students, what is the probability that 2 or more share the same birthday?
While the formula for calculating the answer is straightforward, the true lesson from the birthday problem – that there’s a difference between “any 2 students” and “2 particular students” – hits much quicker when simplifying the question by using real data.
In other words, instead of coming up with formulas to calculate exact probabilities (which, even in the birthday problem, require making some unjustifiable assumptions*), it’s also worth looking at a data sets of different size N and estimating the probabilities using those results.
It’s with such a mindset that I inquisitively read the following article on Yahoo, summarizing a South Carolina town in which 5 residents have recently made the National Football League. Here’s the headline:
The odds that a town of 1,000 would produce five NFL players in 25 years are long. Really long. Like 1 in 10 million billion long.
Those are really long odds – you’re more likely to, among other things, record a perfect March Madness bracket (1 in 128 billion) or win the Powerball (1 in 238 million).
So where’d 1 in 10 million billion come from? To his credit, the article’s author, Eric Adelson**, appears to have done his due diligence, having asked for assistance in estimating these probabilities.
How rare is it? Yahoo Sports asked a handful of experts and mathematicians around the country. One couldn’t come up with an answer. Jeffrey Forrester, associate professor of math at Dickinson College (Pa.), put the chances at approximately 0.0000000000797. Yes, that’s 10 zeros. (Forrester notes the odds of being dealt a royal flush are 0.00000154, or about 20,0000 times more likely.) Dominic Yeo, who is studying math at Oxford, approximated the probability as 1 in “ten million billion.”
It’s not in my place to question the calculations of other mathematicians – particularly without seeing the details – but as a statistician, why estimate unknown probabilities when we can use real data?
So that’s what I did.
My question is whether or not the town in question (Lamar, SC) is truly an outlier, or if any of the other 43,000 United States municipalities can boast a similar claims. If it’s the latter, we can be pretty confident that the 1 in 10 million billion claim is overzealous.
To collect the data, I scraped*** birth cities of NFL players from Pro-Football-Reference and merged those with 2014 population information provided by the Census Bureau. Looking at players-per-1000 people (Ex: Lamar has 5.12), and only using towns with at least 2 NFL players, here’s are the top-20 places in terms of NFL production rates. The table is sorted by Ratio_1000, which is the ratio of players per 1000 residents.
By this measurement, Lamar ranks 19th in the country as far as a ratio of NFL players-per-capita. While many of the towns in front of Lamar only have 2 NFL players (Lamar has 5), and a few stand out as towns that may be drawing from larger populations than given in the census, it’s difficult to conceive of Lamar as one in ten-million-billion when it’s not even ranked near the top of the United States.
We can also plot our data.
Focusing on all US towns with between 800 and 1200 residents, here’s a histogram of the number of NFL players from each. Note that I dropped the roughly 3,000 towns in this category without NFL players, as it made looking at the towns that have produced NFL players nearly impossible.
There are several other towns of Lamar’s size with similar numbers of NFL players, and 10 towns have at least 3 players.
All of which brings us back to the birthday-style probability questions:
What is the problem that Lamar itself happens to be one of the small towns with an extreme number of players (4 or more)? Probably about 1 in 700, which is several orders of magnitude lower than the estimate given by Yahoo.
And what is the probability that any town of Lamar’s size could have produced that many players? Given what we see in the data, this estimate is seemingly much, much, higher than 1 in 700.
So why are our estimates are much different? Primarily, there’s not an independence in the NFL-talent production rates of United States towns. Not every town has football, and towns that produce more NFL talent are, for several reasons, more likely to continue to do so. There’s also a spatial structure to our production rates; note that in our table above, nearly every state is from the Midwest or South. So when there’s neither independence within or between towns in the likelihood of a resident making the NFL, standard methods of calculating probabilities, which rely on independence to use multiplication, are inappropriate.
Fortunately, in this and many other cases, data can come to our rescue.
*In this case, the incorrect assumption is that birthdays are evenly distributed throughout the year
**Yes, it’s this Eric Adelson.
***Let me know if you’d like me to post the code! [Update: code is here]