The Winner's Curse in Machine Learning

Will Landecker
May 2
8 min read

Updated: May 14

An exploration of the Winner’s Curse, how it causes us to overestimate model performance, and what to do about it.

A cartoon of a person walking excitedly toward a large present, unaware that they're about to walk right into a bear trap — Illustration by Erica Fagin. ericafagin.art

Part 1: Winning.

We constantly ask our data to compete. We rank billions of pieces of content, serving only the most engaging. We cross-validate millions of models, putting into production the one that promises the best performance. We estimate click-through rates for thousands of ads, displaying only those that win the auction.

The data scientist, the machine-learning engineer, and the AI researcher sit back and relax while the data and models fight for their lives. Competition is good, we tell ourselves. The ecosystem will flourish. Our products will be the best. We will be winners.

The dust settles. The few remaining models and data are barely still standing next to a pile of bloody ones and zeroes. The survivors limp onward, ready for their next life: serving the highest-value ads amid the most engaging content. Their struggles will be repaid.

But something isn’t working. The model’s performance isn’t quite living up to what it promised during the competition. The content doesn’t engage users as much as the simulations said it would. The ads don’t generate the clicks that they did in offline testing. The scientist, engineer, and researcher guffaw. They adjust their seats and put fingers to keyboard in order to investigate.

And investigate they should. What went wrong?

Part 2: The Curse.

To understand what went wrong, let’s revisit when we thought things were going right: the moment when we selected the winner. This could be when selecting the highest-ranking content (as in recommender systems, ad servers, or any other ranker), the highest-performing model (as in hyperparameter tuning and cross-validation), or the experiment variant with the greenest metrics. The curse is versatile and will haunt us in any domain where we select something based on measurements. Whenever we select a winner, that winner will inevitably fail to perform as well as it appeared in our measurements.

Statisticians call this “regression to the mean”. Economists call this “the winner’s curse”. Machine learning engineers call this “it's not time to retrain already, is it?”

Let’s make it concrete with an example: we’re ranking content based on estimated click-through rate (CTR). We rank piles and piles of content, searching for the highest estimated CTR to display to a user. The higher the estimated CTR, the more likely we think the user will click on the content we serve them.

Imagine that we have a CTR prediction model that scores a few items, and imagine further that these items all have the same true CTR. In this example, these items are different from each other, but users are exactly equally likely to click on them. However, models are imperfect, and our CTR-prediction model scores each item slightly differently. We usually simulate these chance differences in our experiments by adding a little random noise, such as a Gaussian 𝒩(0, σ) addition to our true CTR. This additional randomness brings us closer to a production ML setting, where we can only estimate CTR.

On the left, many different pieces of content have the same true click-through rate (CTR). On the right, a CTR-prediction model produces slightly different CTR estimates for each piece of data. The noisy data on the right simulates the production setting, in which we only have access to imperfect estimates of CTR rather than the true CTRs shown on the left.

Now ask yourself: which item will be selected to show to the user? Remember, in the production setting we don’t actually know any item’s true CTR; we only know the estimated CTR produced by our model. Therefore, we will choose the one furthest to the right in the above images, as this is the item with the highest estimated CTR. And already, the curse has taken root. Because what can we say about the error in the served item’s estimated CTR? By virtue of the fact that it is furthest to the right, its error was the most positive. We have chosen the content about which our model was the most overly optimistic.

We choose the content with the highest estimated value. However, the items with the highest estimated value are inherently the ones whose value we overestimate the most.

This is the curse in action. By choosing items with the highest estimated value, we inherently choose the items whose value we overestimate the most.

Part 3: Denial, Anger, Bargaining

Surely this doesn’t apply to me, you might tell yourself. Ranking systems, auctions, and hyperparameter searches are pervasive in our field! We can’t all be suffering from this malady, can we? Moreover, you point out, there’s a flaw in the argument above: we were only discussing a funny little result that arises when items with the exact same true value are scored. In reality, this never happens. Our production systems deal with diverse data whose true value is distributed across a wide range. So I don’t need to worry about it. It doesn’t apply to me. Right?

Dear reader: wrong. The curse still haunts us in real-world settings when our data are more spread out. Imagine the effects illustrated above happening for every group of data with similar CTRs simultaneously. The result is that the right-most data — the data with the highest estimated value across your entire dataset — is likely overvalued by the model that helped you choose it. It is equally true when you rank content, when you run ad auctions, when you select winning experiment variants, and when you choose your best-performing model. Whenever you use a measurement to help make a selection, you are biasing your measurement. I am so sorry.

We are doubly cursed when we build ranking models. First, we overestimate the model's performance during our model search process. Second, we overestimate the value of the content that the model serves when it performs the ranking and selection. We just can't catch a break.

Even if you’ve never heard of the winner’s curse, you know it. You feel it deep in your bones. You’ve been accommodating it for your whole career, adopting industry-wide best practices borne from it without being told as much. After finding the model with the best performance in offline simulations, you warn your product manager not to report its performance until you run an experiment in production. The real-world performance will be lower, you warn them. For some reason, it always is. You mentor your junior colleagues about the dangers of trusting the exact metrics reported in multi-armed experiments. Once you select the winner, it won’t work quite as well, you explain. For some reason, it never does.

The winner’s curse is a thorn in your foot, and you’ve been limping so long you can’t remember any other reality. The best you can do is warn everyone else to stock up on crutches.

Part 4: Acceptance

As with other forms of grief, a thorny technical problem can only truly be solved once we accept how pervasive the problem is. And because of the ubiquity of this measurement bias, we can consider a variety of solutions developed from diverse fields, including statistics, economics, causal inference, public health, and epidemiology. Some solutions require assumptions about the distribution of data and errors that you may or may not feel comfortable making. If you want to head down a rabbit hole, check out the research on topics like “post-selection inference,” “bias selection correction,” and “James-Stein estimation”.

One flexible solution I repeatedly reach for in these situations is Double ML. (Some folks call it Debiased ML — either way, we all agree that the acronym is DML). The idea behind DML is that whenever we build a ranking or scoring model, we build a second model that estimates the likelihood of selecting an item. We then use the selection model to de-bias the winning scores coming from the ranking model.

There are plenty of other tutorials explaining how DML works, so I won’t get into those details here. Instead, I’ll just demonstrate how effective it is as a tool to combat the winner’s curse. I wrote some code that generates 20,000 random pieces of data and built an (unbiased!) model to estimate CTR for each item. Then, I simulated 1,000 auctions (or, equivalently, 1,000 runs of a recommendation or ranking system). Each auction considered 100 random pieces of data and chose the top-scoring 10.

My code trains a (naïve) CTR-prediction model as well as a DML-debiased CTR prediction model. The blue histogram below shows the bias of the naïve model’s CTR estimate compared to the true CTR when considering the auction winners: the blue distribution is off to the right of the dashed red line. The orange histogram shows the bias of the DML-corrected estimates, which are centered at 0.

DML has eliminated the selection bias. The thorn has been dislodged.

Double / Debiased ML (DML) eliminates selection bias.

The careful reader will notice a few tiny orange bumps far away from the center of the distribution. This is well documented with DML: the corrected estimates have a lower bias, but a higher variance. This tradeoff might not cut the mustard for your particular setting, but I usually find it worthwhile. One way to help make that decision is to combine the lower bias and higher variance into a single score. Here, I’ll do that by computing the root mean squared error (RMSE) of the estimates. With my simulated data, the lower bias of DML more than makes up for the higher variance: it has lowered the RMSE from 0.06 to 0.009.

Double / Debiased ML (DML) has lower RMSE (0.009) than naive estimates (0.060) for simulated data.

Part 5: So what?

DML has de-biased our estimates after selection. But what exactly does that give us?

Prior to de-biasing, the CTR model in my example overpredicted CTRs by an average of 6 percentage points. In the simulated data, true CTR ranged from 10% to 40%, of which 6 percentage points is a sizeable portion. Being positively biased by 6 percentage points means that, after offline testing, we wouldn’t actually know what CTR to expect in production. Before we even have a chance to estimate the impact of our work properly, we’d need to kick off experiments and wait for the data to settle.

In some cases we would even find that new models, trained to improve over a baseline model in offline experiments, actually perform worse than the baseline when put into production! This measurement bias clearly wastes the time of researchers, developers, and everybody on their team. These types of problems are significantly improved with debiased estimates.

In addition to bringing us more accurate information earlier in the development cycle, DML helps us monitor our models in production. Without a solution like DML, our model's biased CTR estimates are logged in production, and we immediately learn to stop looking at the CTR model output values themselves. “Oh, those never line up with the true CTR,” we explain to our junior teammates. “If we want to catch a degraded model, we have to look… somewhere else.” Our teammate follows our gaze toward the supply closet, stocked with spare crutches.

An unbiased model estimate means better observability for our ML-powered systems. We can catch model and data drift more quickly, as soon as ground truth stops matching model outputs. Earlier signals of model degradation mean fewer and less severe production incidents. Unbiased estimates mean more useful information earlier in development and higher-fidelity signals in production. We are well on our way to building robust, well-monitored ML systems.

The scientist, engineer, and researcher toss a handful of crutches back into the supply closet.

The code that generated these figures and implements DML for a simulated auction is available here. That code, and this article, were written by Will Landecker. If you’d like some help working through thorny problems related to ML evaluation and bias, please learn about working with me or reach out to hello@accountablealgorithm.com

Thanks to Kevin Dyer, Rob Story, and Sara Staszak for helpful feedback about this article.

The Winner's Curse in Machine Learning

Part 1: Winning.

Part 2: The Curse.

Part 3: Denial, Anger, Bargaining

Part 4: Acceptance

Part 5: So what?

Recent Posts

Comments

If you like my writing,
you might like working with me.

Part 1: Winning.

Part 2: The Curse.

Part 3: Denial, Anger, Bargaining

Part 4: Acceptance

Part 5: So what?

Comments

If you like my writing, you might like working with me.

If you like my writing,
you might like working with me.