Evaluating sports predictions against the market

A few friends have been working on an algorithm for predicting baseball game outcomes. Roughly, the model uses player level projections to simulate baseball events, a process that requires substantive MLB and web-scraping knowledge.

Although the full operation is fascinating, this post will primarily focus on the evaluation of the predictions. The particular model in question has had a decent start to the summer. So how can we judge the accuracy of these picks? And what does that tell us about the feasibility of betting on sports?

While much of this post will seem straightforward, answering these questions gave me an increased appreciation for the variability in sporting outcomes with respect to gambling. I’ve posted the code here, in case anyone else is interested in using a similar process with their own projections.


First, some background. The data consists of 659 picks made versus the game’s opening money line since the start of the 2017 season.  Each pick is based on a model-estimated probability for each team in each game, which is then compared to that team’s market probability. There have been about 950 MLB games thus far, which means that the model has taken a team in about 7 of every 10 contests. On the remaining games, probabilities for each team are too close to the market’s price to have an edge. Those games were dropped from the data.

The data also contain the observed differences between the model estimated probability and implied probability, relative investments (made assuming an equal balance prior to all games), the amount to be won or lost depending on the game’s result, the actual game results (win or lose), closing money line prices, and the difference in implied team probabilities between the opening and closing odds. Note that bets are made on “units” – this could be dollars, pistachio shells, or whatever your mind can imagine. Generally, higher units are placed on bigger edges; the average unit per pick is about 0.60. Note that the highest unit is capped at 1.0, which is done given the non-zero chance that probabilities are off on account of lineup or pitching changes.

Next, some summary statistics. While a nearly identical number of picks have backed the away team as have backed the home team (51% to 49%), nearly twice as many underdogs have been backed compared to favorites (64% to 36%). Altogether, the model is up about 27 units thus far, which roughly reflects about a 7% return on investment. Game results have been most kind towards backing the home team (+24.5 units) compared to the visiting team (+2.5 units), with underdogs slightly more profitable than favorites (+19.9 to +7.1 units). While a deeper investigation could look into if these differences are meaningful, that’s not a primary goal.


One immediate anecdote that I picked up quickly is how variable things could appear in small periods of time. Here’s the cumulative profit from day one of the season (shown in red). In the background are 200 simulated season-to-date profits, done using the given market implied probabilities as the true probabilities for each team.

Screen Shot 2017-06-13 at 11.17.10 PM.png

Within any given week (say, 75 picks), profits could vary by as much as 15 or so units. And at certain time points (say, between picks 100 and 210), all appears lost, with picks going into a deep dive. Even for me, as someone whose job entails having a decent understanding of randomness, it’s tempting to look for patterns in the red line, even though none likely exist.

Relative to random season outcomes simulated using the opening market probabilities, model picks currently stand in the 96th percentile. That is, only about 4% of sequences using random game outcomes would be doing this well if the opening market probabilities reflected the true probabilities. And note the center of the above sequences: roughly -10 units, which accounts for vig taken in by betting markets.


In addition to the chart above, I made a similar one (not shown) with one important difference; instead of market-implied prices as the truth, I used the model-generated probabilities. In expectation, this simulation will yield positive profits. But in what was a total shocker for me, it was still reasonable – it happened about 5% of the time – for such a model to turn a negative profit through 650 picks. That is, even with known, better than market probabilities for each game outcome, it’s still feasible to lose money across 650 games.  First thoughts that went through my mind:

-650 games is three NFL seasons worth. That is, an NFL bettor taking every game could have three straight losing seasons in a row while still having better than market odds for each of his or her picks.

-Related: I could not be a professional gambler.


I thought it would be interesting to take a look at which team the model has picked most often (both for and against). Here’s that plot. On the x-axis is the total investment made, either for (on the left) or against (on the right) each team, and the y-axis is the season-to-date profit.

Screen Shot 2017-06-13 at 11.42.38 PM.png

This particular model continues to back the Padres and Mets at most opportunities, while picking against the Red Sox. Altogether, those picks have mostly broken even.

Meanwhile, the model has had some success taking the Rockies, White Sox, and Rays, while likewise performing well when fading the Indians, Giants, and Blue Jays. Picking the Phillies has not been so fruitful, nor has picking against the Diamondbacks.


Our final check looks at how the model has done relative to line movement. If the model can “predict” the direction where prices will go in the moments leading up to the game, that would generally be a good thing. From what I’ve been told, closing market prices are generally more efficient than opening numbers.

Here’s a histogram showing line movement (on the probability scale). Positive changes reflect movement in the direction of the model’s chosen team.

Screen Shot 2017-06-14 at 12.06.28 AM.png

Among the picks to date, about 1 in 20 opening lines precisely match closing lines. A tick under 58% of games have moved in the direction of the model’s team, while about 37% have moved against.

Across all contests, the average price has moved about 0.6% in the direction of the model’s chosen team. While this seems like a small number, across several hundred games, that type of advantage would seemingly add up.

There’s also a decent link between the model’s projected edge for a team and the likelihood of movement in the direction of that team. The average game moved 0.25% among games with smaller-sized edges, 0.5% on games with medium-sized edges, and a full 1.0% on games with the largest edges (putting about 200 games in each of these categories).


Assorted final notes:

-Log-loss is a proper scoring rule for binary outcomes, but it is less evident how log-loss can precisely evaluate this model, given that some picks are made with more of an edge than others (perhaps a weighted log-loss?). Additionally, there’s no immediate interpretability to log-loss. In any case, the average log-loss is -0.6845 for the market implied probabilities and -0.6836 for the model estimated probabilities (closer to 0 is better).

-It is tempting to tie team allocations (as far as supporting or fading) to changes to the game that have been seen this summer. This includes the supposed juiced ball and increases to HR/FB ratio. Something to keep an eye on.

-How do others’ evaluate picks, either their own or from others? My prior is to trust the market until proven otherwise, and that’s a very strong prior.



  1. Does the profit calculation take into account the vigorish? (+27 on 650 bets doesn’t strike me as very good performance.)

    Assuming that the bets are proportional to the predicted probability, Log-Loss seems reasonable to me. Conveniently the bet is in the same range as a probability. Are you using the bet directly in the log-loss computation, or an output probability from the model?

    1. The profit takes into account the vig…otherwise, it wouldn’t seem appropriate to call it a profit.

      The bet is an output of the probability difference from the model … to be honest I’m not sure of the formula, but assume it’s related to team price and the difference size.

  2. What is your source for opening odds?

    Is “Change of price of model picks” the raw implied pct. change? I.e, are 50% to 52% and 70% to 72% are both counted as 2%?

    Could you elaborate on what variables are included/excluded from your model? E.g., weather, platoon tendencies, recent performance.

    1. Hi DS,

      I should’ve been more specific there – I treated all changes in probabilities as the same. The opening and closing line stuff was a bit new for me to think about, and there are assuredly other & better options. This seemed like one way.

      No involved in making the model itself to know all of the steps, but I don’t think weather or platoons are a part of it.

      1. Thanks. Jyst curious — do you what source they used for opening lines, and/or what time of night (the night before the game) the lines are released?

  3. Great stuff. Fun trip down memory lane for this former MLB bettor.

    Here’s another thing I used to use to evaluate my results — compare my W-L record to my “Pythag” record. Just as you can evaluate an team based on whether they’re outperforming/underperforming their runs scored/allowed, you can do the same for bettors. It’s another way (similar to the line movements) to reassure yourself you’re on the right track.

    Granted, the simulation method that I used (and that your friends use) should be able to identify specific games where teams should over/under perform Pythag. But (as with MLB teams) that signal is still very small compared to the noise.

    And yes, there will always be teams that you consistently over/under value, and what to do about that is a very interesting question. For me, it was always the Phillies. My system always told me to bet on them. I think part of it was because I got lazy and didn’t code up reliever handedness or smart management of bullpens, so when they used to always bat three lefties in a row, that would hurt them in real life but not in my sim.

    1. Hi DFL,

      That’s a good idea, and certainly feasible. I may take another go at this again in August.

      Bullpens are dicey given the team-level differences in allocations, some of which change on a daily basis. Not sure there’s a perfect way, and I suppose it is just a matter of how lazy you are willing to be 🙂

  4. Maybe a silly question, did the sim outcome probabilities incorporate the vig? I.e. a 50% / 50% outcome in the model needs to be adjusted by +4ish% to properly compare to opening line.

    Also how much did the model add to home team for home field advantage? I always wonder in baseball how much home team’s win% advantage is due to playing at home (friendly confines) versus being the team to bat last in the bottom 9th.

    Good article, thanks for sharing.

    1. Hi MikeN,

      Thanks for reading. The simulated outcome probabilities all incorporate the vig – it’s why in the figure shown, more profits are negative than positive (with the average around -8 or so).

      Home field is worth a few percentage points, although I don’t know the exact number. I think it has more to do with balls and strike calls than anything else.

  5. In order to understand the code, I need to run it line by line. Could I have the mlb.csv file please?
    Thanks in advance

    1. Hi Alfredo,

      Unfortunately, I wasn’t given permission to share the .csv file — but hopefully enough of the variables make sense that you get *something* out of the code. Sorry to not be more helpful here


  6. “650 games is three NFL seasons worth. That is, an NFL bettor taking every game could have three straight losing seasons in a row while still having better than market odds for each of his or her picks.”

    That’s why it’s critical to measure performance in low-sample sports by margin of victory against the spread. E.g., If you bet on a team -3, and they win by 7, that’s a +4 margin for you. It’s quite surprising how few people do this in self-evaluation, whereas the same people would never even consider using a pure W-L model to evaluate a sports team.

    By the way, how did your friends’ model perform in the second half of 2017?

    1. That’s a great point/idea. Seems a bit more reasonable to do in the NFL than in MLB — in the latter, I am not sure run differential is as worth it? But maybe I am wrong.

      I shared some final thoughts on the model’s performance here:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s