Right for the wrong reason

Will Landecker
Jun 9
8 min read

Interpretable machine learning gives us tools to find and fix the blind spots of our data and models.

Recently, some heavy hitters have been making the case for greater explainability and interpretability in AI. Dario Amodei, the co-founder and CEO of Anthropic, wrote a piece about the Urgency of Interpretability. In it, he argues that black-box AI systems could harbor dangerous internal machinery that we might not detect until it’s too late.

As someone who has spent 16 years working with interpretable ML (first researching, then implementing and practicing), I think this is a good time to share my experiences. If there's one thing I want everyone to walk away with, it's the fact that interpretable ML can help us solve a problem that silently derails ML systems: it helps us find deficiencies in a dataset that allow a model to be right for the wrong reason.

Cracking open the black box

For the past 15 years, we have been building bigger. Our datasets keep growing, simultaneously taking advantage of increasing data availability and new learning paradigms requiring less structure in the data. More data means we can extract more complex patterns, so our models have grown in size and complexity as well. Whereas we used to train an image detection model containing hundreds to thousands of parameters with 100 megabytes of hand-labeled images, today we train models with hundreds of billions of parameters on terabytes of data.

The size and complexity of these models are overwhelming. Each word produced by a modern LLM requires 100 billion arithmetic operations. Despite being able to see and audit each individual operation, many feel that the sheer scale of the network is too large to be understandable. Some researchers liken the process of training these massive models to poorly understood alchemy. They are black boxes, and many have given up on seeing the forest through the trees.

Others look at this black-box challenge and see a fight worth fighting. Research in the field of Interpretable Machine Learning (as well as Explainable AI, which I will use synonymously throughout this article) continues to grow, trying to keep up with the frantic pace of new AI developments. Some of these novel interpretable methods dissect the internal structure of the massive AI models, looking for meaning in the constellation of components that responded to a particular input. Others try to simplify a model's logic by analyzing the model's repeated interactions with a large dataset.

These tools allow us to examine a model's output and ask various follow-up questions, each a different flavor of: "why?" Why did the model deny this applicant's loan? With an interpretable method, we can get an answer. Why did the model fail to predict this storm? Interpretability lets us peek inside and find out. By design, interpretable methods offer us an interface between human understanding and machine understanding. They aim to answer our (very human) questions about a (very inhuman) complex system.

(It's important to understand that simply querying a generative text model with the input, "Why did you say that?" is not the same as interpretability. Asking a generative model to explain itself will result in the same thing that generative models always create: plausible hallucinations. The model's response may sound reasonable. But it was never trained to be a witness under oath, and it encapsulates no notion of truth. Interpretable ML methods, on the other hand, are designed to yield the truth. A well-founded method of interpretability yields an explanation that is epistemologically consistent with the logic of the black box.)

Interpretable ML is designed to be, well, interpretable. By a human. With knowledge. The more knowledge we have about the problem we’re trying to solve ( domain knowledge, sociological knowledge, historical knowledge ), the more thoughtfully we can scrutinize the model’s solution. We can compare our knowledge to the machine’s and spot any misalignments.

Uncovering these misalignments is especially critical in high-stakes settings. If a model is involved in a decision that affects anyone's health or well-being, it's hard to trust the AI without the ability to audit it deeply. In highly regulated fields like banking, employment, housing, and medicine, the ability to explain decisions is sometimes even a legal requirement.

Diagnosing mistakes

More than just satisfying a regulatory check box, interpretability helps us build models that are better aligned and more accurate at solving the actual problem we built them for.

When we see systematic mistakes with a black box model, the standard fix is to patch the dataset. We find more examples similar to the ones causing the errors and add them to the model's training data. This patch will help teach the model whatever patterns it had missed earlier, or so we hope.

But already, we are engaging with some form of the question "why?" when we decide what types of new data to include in the patch. The patch represents a hypothesis about why the model made a systematic error.

Let's say our medical diagnostic model displays lower accuracy with 20 to 40-year-olds in the Pacific Northwest. A proper patch for this model requires a hypothesis about why this might happen. Is it because of their age? Or is it something about the Pacific Northwest region? Is it the conjunction of those things? Or does this demographic group happen to correlate with a medical fact (such as a higher incidence of multiple sclerosis) that was not adequately represented in our dataset?

Here again, interpretability gives us a bridge between the machine's knowledge and our own. We can ask an interpretable method to reveal the model's logic, brilliant and flawed. When interpretable methods point out those flaws, they help us chart the course of improving the model. Without the answer to these questions, we can only play whack-a-mole, hoping to find the right combination that yields better results. Without interpretability, we are back to alchemy.

Figure 1: A demonstration of interpretable ML lifted from my own PhD dissertation, "Interpretable machine learning and sparse coding for computer vision," published in 2014. — **Figure 1**: A demonstration of interpretable ML lifted from my own PhD dissertation, "Interpretable machine learning and sparse coding for computer vision," published in 2014.

I first uncovered this use for interpretable ML in 2009 while looking for a topic for my PhD dissertation. I was analyzing a model that was trained to determine whether or not an image contained an animal. (This is a simple-enough task for today's models, but back in 2009, it was a stretch for computer vision.)

In Figure 1 above, you can see three images that the model misclassified. They all contain animals, but for some reason, the model couldn't see them. The bottom row uses false color to explain the model's logic, and the blue splotches show us which regions of the image caused it to be misclassified. In some cases (L and M), the misclassification was due to artifacts in the background that carried more weight to the classifier than the animal did; in others (N), the animals themselves were not represented well enough in the training data. Repairing each of these different reasons for misclassification required different types of patches.

Silent failures

Fixing the machine's incorrect answers is all well and good, but when we are building high-stakes algorithms, there are scarier things to worry about.

When we talk about an algorithm's errors or mistakes, we are talking about a disagreement between the data and the model. Let's say the model is attempting to diagnose the malignancy of tumors from medical images. An error means that the model believes a benign tumor is malignant, or a malignant tumor is benign. As long as we have sufficient data that has been labeled by human experts, we can detect and patch these mistakes.

But what about the mistakes that can't be measured? What happens if the model correctly guesses the tumor's malignancy, but for a reason that indicates flawed, brittle, untrustworthy logic? What if the model completely misinterprets the data, but gives the correct answer by chance? How do we even detect those kinds of mistakes?

Let's make it concrete with an example. To train and evaluate our algorithm, we need labeled data. In this case, that's a large pile of tumor images, and each image is annotated by an expert as being either malignant or benign. By poring over that labeled data, our model learns to tell the two groups apart.

Labeled data also allows us to verify how well the algorithm has learned to make this distinction. We can measure the model's accuracy by comparing the model's responses to the human experts' labels. A highly accurate model is one that we believe will be more helpful when deployed.

However, sometimes a mistake slips into the data. Let's say we collected our tumor images from a doctor who scribbled notes in the corner of all the malignant tumor images, but not in the benign ones. Our algorithm may learn to simply look for the scribble. If it does so, its answers will agree with the experts' labels. It will score perfectly in our evaluation despite not having solved the real problem at all.

The model that accidentally catches malignant tumors simply by detecting scribbles is giving the right answer for the wrong reason. And if our metrics only measure accuracy, then the metrics will look good. But what happens if we take this model to market, and try to detect malignancy on images coming from other doctors and hospitals? It will fail. Horribly and spectacularly.

For developers of trustworthy AI, this is the stuff of nightmares.

These failures are not theoretical; they happen all the time. Models learn to correctly identify the objects in the image due to artifacts like the watermark in the corner. And yes, they can even successfully detect cancer on their biased dataset due to unintentional artifacts in the image.

Sometimes a dataset encapsulates both a statistical bias and a social one. This was the case when a company tried to automate hiring decisions by collecting data from its own gender-imbalanced tech industry. It ended up with an algorithm that penalizes resumes that list women-focused activities (e.g., "women's chess club captain").

In all these cases, the model appeared to be accurate, but only because of biases and undesired correlations in the data used to train and evaluate them.

I've written elsewhere about how standard ML metrics don't catch these kinds of blind spots in our data. The moment we settle on a particular dataset to build a model, our model begins to embody that dataset. It will internalize the data's information, its patterns, and its relationships. But also its inaccuracies, its biases, and just as importantly, our own misunderstandings and incorrect assumptions of the data.

This is why I say there are worse things to worry about than simple inaccuracy. An inaccurate model is one whose performance problems are detected with labeled data. The failure is loud. The data does the hard work for us. But a model that exploits unknown spurious correlations in our dataset is one whose failure will slip through the gates undetected. Its failure is silent.

Interpretable ML can be a (literal) lifesaver here. While standard accuracy metrics rely on labeled data as the source of truth, interpretable ML allows us to use our own knowledge (or that of a trusted expert) as the frame of reference. The doctor can inspect not just the answer, but the explanation of the answer. Suddenly, human expertise becomes a guardrail that helps us build trustworthy algorithms.

I'll write more in a future article about the process of using interpretability as a guardrail, as it's worthy of its own space. For the moment, I'll share a few quick highlights:

Red-teamers and algorithm auditors should be comfortable using interpretable methods and should be given those methods' outputs, both for accurate and inaccurate machine outputs.
Domain experts should be included in the process of "explanatory auditing," and disagreements between their understanding of a problem and the machine's are worth identifying and investigating.
Those disagreements will usually result in one of two judgments: either the machine's internal logic relies on spurious correlations, or it doesn't. In the former case, we learn how to improve the data and model. In the latter, we grow our own understanding of the problem. (The latter case is arguably more exciting.)

This article was written by me, Will Landecker. If you would like help thinking through challenging issues in interpretable ML, dataset bias, and trustworthy AI evaluation, you can click here to learn about working with me or simply send an email to hello@accountablealgorithm.com

Right for the wrong reason

Cracking open the black box

Diagnosing mistakes

Silent failures

Recent Posts

Comments

If you like my writing,
you might like working with me.

Cracking open the black box

Diagnosing mistakes

Silent failures

Comments

If you like my writing, you might like working with me.

If you like my writing,
you might like working with me.