Data Judging Itself
- Will Landecker
- 1 day ago
- 7 min read
Updated: 3 minutes ago
A dive into the world of ML and AI evaluation, and a reminder to be brutally honest about the ways in which our data, models, and evaluations can mislead us.
I was recently a guest on the Data Neighbor podcast, which ended with a great question: if I could convince every ML org to make a single change, what would it be? My answer: Spend more time questioning whether you’re optimizing the right thing.
What we want from an evaluation
When we design ML systems, we set out to build a series of useful simplifications.
We start with a real-world problem – perhaps we want to recommend an entertaining movie, or keep a user safe from financial fraud. The goals of these problems can be challenging to articulate and quantify. They often involve many tangled and competing motivations of the individual, or complex interactions within a society. Often, they involve both.
So we simplify. We collect data. We use it as a proxy for the squishy, hard-to-articulate needs and desires of our users. We log data from phones and websites, building massive datasets. We hope that our cold, hard data works as a (precise enough) stand-in for what would otherwise be too translucent to work with: intentions and feelings. We don't know exactly how this movie made the user feel. But we know that they clicked and watched it. So let's count the clicks.
And we simplify. We compress all that data into a model, which embodies the most essential relationships of the data. The model learns to make useful predictions based on a user's history, and based on other users' histories. It becomes a matchmaker. Will this user watch that movie? Ask the model. It has all of the signal, none of the noise.
And we simplify. We test the model by throwing even more data at it. We test the model's matchmaking abilities on data where we already knew the answer. We grade the model's work. We distill this test down to a single essential number—an evaluation of our model's quality.
For these simplifications to be useful, they must be well aligned. A well-aligned process forms a series of concentric compressions from the real-world problem, to data, to model, to our final evaluation, all perfectly centered on top of each other. If our simplifications and compressions form a dartboard, then a well-aligned evaluation is the bullseye in the very center. It captures the essence of the problem we are trying to solve. It defines the goal.

What ML evaluation is
The most common type of ML evaluation is based on held-out data, which is unseen data that was peeled off from our training data. Let's make it concrete with an example. Say we're building a model to detect whether or not a financial transaction is fraudulent. The sampled dataset comes from a large history of previous transactions: information about each purchase, the merchants, the customers, the dollar amounts, and contextual information like the types of devices used to make the transaction.
The majority of this data is funneled straight to our model-training algorithms, where it will be compressed and simplified into a model. However, a small amount (say, 10%) is withheld from the model until the training is complete. We will use this held-out data purely for assessment. It will tell us how well our model has learned to distinguish between fraud and non-fraudulent transactions. This is the ML evaluation.
In technology companies, a tremendous amount of organizational and financial pressure rides on this evaluation, and it's easy to see why. Promotions and layoffs hinge on whether we can successfully nudge this evaluation number in the right direction, because we need this evaluation to be our True North. It is responsible for letting us know whether we did something useful. It should tell us which experiments were fruitful and which were not, guiding us toward the best possible solution. No matter what compromises we have had to make along the way, we tell ourselves, the evaluation will keep us honest.
It's easy to see why we place our hopes and dreams in this type of evaluation. If a model performs well on unseen data, then everything worked, right? Our model successfully captured the most useful relationships in our data, didn't it?
Sure. But let's be brutally honest about what that means and what it doesn't.
Remember that our model itself was distilled from the very same data that produced the evaluation. The model and the evaluation are cut from the same cloth, and they are both simplifications of the original problem. Neither is able to say anything about whether those simplifications were correct. And therefore, they cannot tell us if we've solved the original problem. They only tell us if the evaluation (which comes from the data) agrees with the model (which also comes from the data).
In a very real way, standard ML evaluations ask the data to judge itself.

Despite this circularity, ML evaluations are useful! They are a critical step in building an ML system. The ML evaluation measures consistency, and helps us catch any errors that might have snuck in during our attempt to distill the data down to its most essential relationships. It is helpful to know how well the model and data agree.
But it doesn't tell the whole story.
What ML evaluation isn't
ML evaluation based on held-out data doesn't tell us whether we have solved the original real-world problem. In particular, it can't tell us whether our data captures the essential nature of our original problem. It doesn't know whether the people who clicked the button ended up where they wanted to be, or whether they just wanted to click. It doesn't know if people are scrolling through their feed because it improves their lives or simply because they are struggling to look away.
The evaluation can't tell us whether we've distilled our data into the relationships that truly represent progress in the real world, or if we've accidentally found spurious correlations that will take us somewhere unintended. It doesn't know these things because the evaluation, like the model itself, is based on the data. And the data doesn't know what it doesn't know. Without consulting some outside source for validation, an ML evaluation can't tell us if it's the perfect bullseye or whether our design decisions have led us away from the original problem we intended to solve. And therefore, even if we do a great job on the evaluation, we don't automatically know if we've made real progress.

ML evaluation based on held-out data is not objective. Every bit of math and code we use to compute a metric is a compromise between many competing goals. A thorough explanation of these goals is well beyond the scope of this article, and thankfully, smarter people than me have already done a great job. I recommend  Sean Taylor's article on Designing and Evaluating Metrics. Sean explains how metrics are designed artifacts, aiming for a balance between opposing forces like faithfulness, simplicity, and cost. Deciding the right balance is as much an art as a science. It requires subjective decisions, and that subjectivity becomes a part of the foundation upon which the evaluation is built.
ML evaluation is not neutral. Data is recorded from systems that exist in the real world, and like those systems, they are subject to upstream inaccuracies and biases. History is reflected in them. Bias that existed before we built our system (e.g., income inequality, education gaps, hiring discrimination) is pulled into our data, and then our model, and then our evaluation. Once embedded in the evaluation, these biases become dead weight. They drag as we try to build products that serve everyone's needs, and companies that grow into new markets.
ML evaluation based on held-out data doesn't tell us if we missampled the data, nor does it warn us if we misinterpreted the data. Critically, it cannot observe itself. The evaluation cannot tell us if the evaluation is well aligned, or if we are spending our time and energy optimizing something that is leading us astray. And so we must ask the question ourselves.

There is an argument that says: Of course the ML evaluation based on held-out data does none of these things. That was never its intended purpose. We use different methodologies to accomplish those things. We test the model in production and look at user behavior. If the model results in more movies watched, more dollars spent, and more time scrolling, then we know that we have accomplished our original goal.
No, we don't. At least not as often as we lead ourselves to believe. Because even those tests in production rely on simplifications and assumptions about our data. We still have the opportunity to be misled by misalignment, oversimplifications, and incorrect assumptions. We still cannot rely on the evaluation to tell us if the data used in that evaluation is a good representation of the problem we meant to solve.
Know Your Limits
This article may sound like a diatribe against ML evaluation with data, but let me assure you: I am a big fan of the stuff. As I said before, it's one critical step toward understanding whether or not we've built something useful with ML.
But sometimes, it's too easy to make ML evaluations out to be more than they are. And the evidence of this mistake is all around us. We've all spent time on social media platforms whose recommendation systems try to maximize time-on-site and end up distributing contagious conspiracy theories or morbidly fascinating public conflicts. We've all been spammed by advertisements being served by models that are trained to accumulate clicks but end up polluting websites with meaningless click-bait, chopped up word salads that drive hopeful traffic to disappointing dead ends.
The metrics go up. Users click, and read, and dwell. The evaluation tells us that our models look good. But only because we have our backs turned to the real problem.

So the next time you find yourself looking at your metrics, I hope you and your team ask yourselves a few extra questions:
What compromises did we make in our evaluations and measurements? Could there be a growing gap between the progress we record through proximate signals in our data and the real-world effects we have on our users?
How do we hope to grow? What do growth and change look like for our product, our team, our business? Will our metrics see this change and report it as the positive movement we have been working toward? Or do our evaluations assume that future data should look like the past, and any deviation from previous patterns must be a problem?
Does our data describe all users equally well? Could the numbers go up because some groups benefit while others experience friction or harm?
What do domain experts know about the problem we are trying to solve? Do they see any holes in our definitions of success? Are we missing any data that they think is important? Are we including any data that they believe is irrelevant?
Question your metrics. And invite others to do the same.
This article was written by Will Landecker. If you would like help thinking through the challenges of ML evaluation, you can learn about working with me or reach out at hello@accountablealgorithm.com.