Why Explaining AI Is Not the Same as Understanding It

Link post

Quick Intro: My name is Strad and I am a new grad working in tech wanting to learn and write more about AI safety and how tech will effect our future. I’m trying to challenge myself to write a short article a day to get back into writing. Would love any feedback on the article and any advice on writing in this field!

Currently, AI models are a black box. They take in inputs and produce useful outputs, but we don’t understand what is truly going on during this process.

Because AI models are becoming more ingrained into society, the need to truly understand how they work has grown. There are two main ways to go about understanding and predicting how models will act; “Explainability” and “Interpretability.”

The Difference Between Explainability and Interpretability

Explainability works by assessing the outputs a model generates after the fact. This strategy attempts to explain why a model produced a given output. Interpretability goes a step deeper and tries to explain the underlying reasoning within a model that caused it to produce an output.

This difference can be demonstrated through a simple analogy. Imagine, you get convicted of a crime and the jury deems you guilty. One way you can understand the verdict is by hearing the judges explanation of the evidence that persuaded the jury to vote against you. This is like explainability. You are given inputs (the evidence) and the outputs (your guilty verdict) and told which evidence caused your guilty verdict.

If you weren’t satisfied with this answer, you could go one step deeper and refer to the legal framework that defines guilt itself. You would understand what it truly means to be guilty and how the rules associated with guilt, in combination with the evidence, would result in your guilty verdict. This is like interpretability. Here you are given a clear logic that helps you fully understand the reasoning behind your guilty verdict.

Based on this analogy, you might notice the advantage that interpretability seems to have over explainability. While it’s always useful to understand how inputs led to a given output, it can feel surface level when compared to an understanding of the inner thought process that connected the inputs to the output. Both strategies have their uses, but interpretability definitely holds a lot more value in providing the ability to better understand and therefore control models more reliably. To understand the limitations of explainability, it can be useful to look at the methods commonly used for this strategy.

Methods for Explainability

There are three popular methods for explainability, LIME, SHAP and Anchors.

LIME (Local Interpretable Model-Agnostic Explanations) works by building a simple model that imitates the main model near the point of a given prediction. First, an input is given to the main model, and then that input is slightly altered many times, (ex. job application; change in name, location, skills, etc.)

Then all of these altered inputs are fed into the main model, and the associated outputs are used to train the simpler model. This simple model can then be used to determined the important features and positive/negative contributions to the output for that given prediction which acts as an explanation for why the given output was generated. While useful, it is limited since the actual model is not being explained and the explanations are reliant on the quality of the altered inputs you choose.

SHAP (SHapley Additive exPlanations) calculates a feature’s contribution to an output by averaging its contributions across all possible feature combinations. For example, SHAP will utilize questions such as: “What happens if we include this feature?” “What happens if we leave this feature out?” “What happens if we include this feature first?” “What happens if we include this feature last?”

Again, while this method is useful, it also has limitations such as being computationally heavy and not explaining how the model actually represents concepts internally.

Anchors work by finding rules that are almost always true when a given prediction occurs. This is done by picking a prediction, testing different rules (“has x skill,” “age > 25,” etc.) and stacking them until the model becomes stable. At the end of this, the strategy is left with a set of conditions that lock a prediction in place. In other words, a set of conditions is found where the prediction has a very high chance of being output.

This method is useful for determining reliable predictions, yet fails to generalize well and has the same msajor disadvantage of the other methods; not truly explaining how the model actual reasons.

The approximative nature of these methods makes explainability unreliable on its own as an effective way to understand and predict how models think. For one, since these methods only give a post-hoc explanation of why a model does what it does, they could be missing any alternative or unusual explanations for its actions. Maybe it is giving positive praise to a student’s essay cause it seems to have specific qualities found through these methods when in reality, the model just learned that giving positive praise makes the student happier.

This example emphasizes a second problem with these methods in that models can and have been shown in certain situations to deceive their users. This is something that would be inherently difficult to parse out through pure explainability strategies since the outputs we would receive from the model would be unreliable.

Why Interpretability Is Needed

Because of the limitations of explainability, interpretability can’t be ignored when trying to get a true understanding of AI models. Interpretability would allow us to see directly into the AI’s reasoning and see the actually “thoughts” behind its outputs.

This would make our predictions much more reliable since we would be getting our explanations from the source rather than trying to predict why the model is doing something after it does it. Furthermore, this would likely make detecting deception within a model much easier since we would be able to see the actual “thoughts” in the model where this intent to deceive arose.

Interpretability would also gives us the added advantage of potentially being able to precisely control how a model acts. If we want to change the output of a model using explainability, we are limited in trying to reframe prompts in a way that lead to more desired responses.

However, with interpretability, we could find the specific thought processes within a model related to our desire change, and potentially alter that part of the model to drive its behavior in the direction we want.

Takeaway

As AI takes on more and more responsibilities within society, it will become increasingly important for us to be able to understand the black box that takes them on. While explainability strategies have helped us get to where we are now, and will likely be useful as time goes on, the need for a deeper, more reliable understanding of these models will be crucial for maintaining their performance and ensuring their safety.