Senior leaders across industries face growing pressure to demonstrate that their AI systems are transparent, fair, and well governed. Regulators want explanations. Boards commission dashboards. Customers expect to understand how decisions that affect them are reached.
In response, many organizations have turned to interpretability tools — visual summaries, plots, and model explanations that appear to show how algorithms work. Sometimes called explainability or transparency tools, these methods are now widely treated as evidence that AI systems are behaving responsibly.
Our research suggests this confidence is misplaced. AI and machine learning models can be made to look fair and neutral in their interpretability outputs while continuing to produce biased real-world decisions. Apparent transparency can provide reassurance without providing protection.
The Promise and the Trap of Explainable AI
Explainable AI, the use of interpretation methods to make complex models more understandable, has become a cornerstone of responsible AI practice. Complex models may be opaque, but interpretability tools allow decision-makers to see how inputs relate to outputs. If those relationships look reasonable, the system must be behaving reasonably. This is the working assumption behind most AI governance frameworks today.
In regulated industries, the assumption is especially attractive. In insurance pricing, models are routinely accompanied by plots showing how predicted risk varies with age, vehicle characteristics, or location. When these plots appear smooth and intuitive, and sensitive attributes appear neutral, they provide comfort to executives, regulators, and boards alike.
Apparent transparency can provide reassurance without providing protection.
The problem is that some of these interpretation tools, partial dependence (PD) plots in particular, do not directly test how a model behaves on real customer data. Instead, they probe the model using synthetic feature combinations that, when features are strongly correlated, can fall well outside the range of data the model was trained on and is likely to encounter in practice.
How a Model Can Look Fair Without Being Fair
In our work, we studied modern machine-learning models used for insurance pricing and accompanied by standard interpretability tools, specifically partial dependence plots.
We found that it is possible to modify a model so that its interpretation plots appear neutral with respect to certain characteristics, even while the model’s real pricing decisions remain largely unchanged.
The mechanism exploits a subtle feature of how PD plots work. For each test value of age, the plot substitutes that value into every customer record and averages the model’s predictions. This means it inevitably considers combinations, such as a young age paired with decades of driving history, that almost no real customer exhibits. A model can be quietly tailored to behave differently in precisely these sparse, near-empty regions, neutralizing any discriminatory pattern in the plot while leaving predictions for real customers essentially unchanged.

Figure 1 presents an illustrative example showing a model’s explanation before and after manipulation. Although the explanation appears neutral after the change, the model’s claim predictions for real customers remain largely unchanged, based on findings reported in Xin, Hooker, and Huang (2025). The interpretation changed. The discrimination did not.
The result is what we call interpretability arbitrage, where interpretation outputs satisfy governance and regulatory expectations, while underlying decisions remain untouched.
Why This Should Concern Every Executive
The implications reach well beyond insurance. Any organization using AI to set prices, assess eligibility, allocate resources, or manage risk faces the same exposure.
When governance relies too heavily on interpretability tools, compliance risk grows quietly. Organizations may approve models that appear transparent even as they fall short of anti-discrimination and fairness requirements. And extensive documentation of interpretation outputs offers limited protection if real-world outcomes are later shown to be systematically biased.
Board oversight is weakened in a subtler way. Polished dashboards make it easy to confuse interpretability with accountability. Leaders end up scrutinizing plots rather than the decisions those models actually produce. A board that approves an AI system based on favorable interpretation outputs has not necessarily approved the decisions that system will make on real customers. The two can diverge, and our research shows exactly how.
Interpretability tools are one input into governance, not a substitute for it.
When customers or regulators eventually discover that transparency mechanisms provided reassurance without real protection, the reputational damage extends beyond the AI system itself.
What Leaders Should Do Now
If interpretability tools can mislead without any intent to mislead, leaders need to rethink what AI governance actually requires. Four changes in practice would substantially reduce the risks we have identified:
- Ask what the tool is actually measuring. Partial dependence plots estimate model behavior using both real and synthetic data combinations, when features are correlated. And those combinations can include pairings that no real customer resembles. Before relying on any interpretation output for governance purposes, ask whether the model is being tested on realistic inputs or on statistical artifacts.
- Test model behavior on real cohorts, not synthetic scenarios. The most direct check on fairness is to examine actual model outputs for defined customer groups, by age band, location, vehicle type, or whatever attributes are relevant. This is harder than reading a plot, but it reflects what the model genuinely does.
- Treat interpretation outputs as signals, not proof. Standardized plots should inform governance conversations, but they shouldn’t be the final word. Leaders should require teams to draw on multiple methods and check whether interpretation outputs hold up against individual-level inspection.
- Build internal capacity to challenge models, not just read reports. An interpretability dashboard is only as useful as the people reviewing it. Leaders and risk functions need access to staff or advisors who understand how these tools work and what they cannot show, people who know to ask whether a clean plot reflects clean behavior, or just a well-constructed proxy.
Interpretability Is Not the Same as Accountability
Explainable AI has given organizations a way to demonstrate responsible practice. That is genuinely valuable. But interpretability tools are one input into governance, not a substitute for it. Treating them as sufficient creates exactly the kind of exposure our research describes.
Accountability for AI decisions has to rest on what those models actually do to real people. A clean plot is not evidence of that. It is, at best, a starting point.
This article draws on research published in: Xin, X., Hooker, G., & Huang, F. (2025). “Pitfalls in Machine Learning Interpretability: Manipulating Partial Dependence Plots to Hide Discrimination.” Insurance: Mathematics and Economics, 103-135.






