How AI earns trust in Healthcare
Co-written with Samuel Deting. Thanks to Kevin Hou, Chris Chiu, Daniel Cheng, Claire Wang, Jacob Thai, Selena You for providing valuable feedback.
Trust enables cognitive offloading
There’s a lot to be learned about trust in Medicine in the dynamic between attending/consultant and their junior staff. The well-liked interns are distinct. They proactively work up the patient, think ahead to consider sensible differentials, order the necessary investigations/ask the right questions to rule-in or rule-out - then package it up nicely to present back to the attending who makes a decision if necessary. Initially attentive to every detail, the attending gradually pays less attention to the details as the junior consistently makes the right calls. Through trust, cognitive labour has been outsourced.
However, reliability and accuracy is not enough. Trust also depends on the why when mistakes inevitably happen - whether it’s the failure modes of the tool or the cognitive biases of the person. Gmail sometimes misclassifies my important emails as spam - but I know to check. The intern misdiagnoses chest pain as musculoskeletal because the ECG and troponin are normal. The experienced cardiology registrar knows to ask whether the onset of pain was within three hours because early infarctions can produce false-negative troponins.
This, “knowing how it works” is given an amorphous name in AI - model interpretability.
Interpretability exists on a spectrum
.png)
It’s a slippery concept that escapes definition but it’s best to think of AI interpretability as a spectrum. At one extreme you have mechanistic interpretability; reverse engineering the neural network to human interpretable circuits. It’s a lot like pharmacology. Consider aspirin - its precursor drug found in willow bark was first used by Hippocrates 2,400 years ago for pain relief during childbirth (Jones, 2005). Only in 1971 did Vane discover the mechanism of action and how it modified the arachidonic acid pathway (a staple of first-year medical education) (Vane, 1971). In a sense we have aspirin. Our AI models solves many daily pains (like replying to emails) but we have yet to reverse engineer it to circuits and algorithms that we can comprehend.
Somewhere in the middle you have early post-hoc interpretability work e.g. SHAP, GradCAM. These instruments occupy an interesting middle ground: they don’t quite reveal the model’s anatomy but offer clues about what the model seems to focus on when generating a prediction. It’s role in building trust with radiologists is evident by the institutional adoption of chest radiograph and CT interpretation tools.

Perception of Interpretability is all you need?
On the other end you have what I call perception of interpretability. You have a rough intuition for what the model is doing, or at least you think you do. I remember munching on a blueberry danish at a cardioimmunometabolic conference in the beautiful Keystone range the morning DeepSeek R1, a reasoning model was released in January of 2025. There was pleasant surprise and praise amongst some cardiologists munching with me, lauding its ability to think and backtrack.

The simulacrum of human-like back-tracking (“but wait”, “could it be”) in the ‘reasoning’ section is a fascinating emergent behaviour that arose from positively reinforcing longer answers.
The model feels like it’s thinking out loud. Yet the ‘reasoning’ part isn’t generated differently from the answer. It is still ultimately just extra decoded tokens, the first part of a longer answer. Whilst it no doubt leads to better answers by increasing test-time-compute, it does not inform us of the intrinsic mechanisms within the network.
The visual separation that implies a distinct ‘thought’ mechanism and anthropomorphisation of the AI's behaviour creates a compelling perception of interpretability. Indeed people are quick to surrender their trust.
I don’t think technical illiteracy entirely explains the readiness to accept this opaque paradigm of interpretability. I think we all have an acceptable level of opacity when it comes to trust and as long as we are over that trust threshold, we welcome abstraction.
A key insight here is that the mere perception of interpretability is often sufficient to gain an individual practitioner’s trust. Indeed, when you look at AI clinical scribes, practitioners rush to sacrifice interpretability for utility and convenience. Furthermore, once critical mass is achieved, society is forced to ride the wave of path dependency, institutions must retrofit regulation and guidelines RACGP, NHS etc. In these cases, adoption begets legitimacy and not vice versa.
Risk & Reliability sets Trust Threshold
Yet, not every product is an AI scribe. If you’re building a fully autonomous surgical robot then perception of interpretability alone isn’t enough. It’s helpful to think of AI products on the axes of risk and reliability. High risk refers either to severity i.e. morbidity/mortality of impact on an individual or scale i.e. errors that impact many, should something go wrong.
AI scribes are low risk, decently reliable and so they sit well within the trust threshold. No matter how reliable a fully automated surgical robot becomes, the high risk seems to preclude its adoption.

Particularly in the case of institutions like hospital networks, high risk applications require systematic defensibility i.e. an intrinsic insurance policy - any decision made always has a why, which can be drawn up in the face of scrutiny. Companies building such AI healthcare products may initially substitute this intrinsic insurance with extrinsic sources of risk mitigation such as brand e.g. Epic, but this will only raise the trust threshold so far.
Interpretability shifts the Trust Threshold
Indeed, for certain extremely high risk applications, the only path to adoption is by walking across the interpretability spectrum towards mechanistic interpretability.
Eventually, surgical robots become reliable enough at laparoscopic cholecystectomies to enter the zone of sufficient trust, freeing up registrars to focus on what they came for: Whipple procedures.

At this point a fair question is: what in tarnation is mechanistic interpretability? I’ve found it becomes far more intuitive when you focus on the gaping chasm between perception i.e. what we think the model is doing and mechanism i.e. what it is actually doing under the hood. Take the thinking out loud example (known as chain-of-thought) I gave earlier. It that increased perception of interpretability by appearing to show a separate thinking process. Anthropic in some fascinating mechanistic work Biology of LLMs, collapsed the complex web of neurons in a language model Claude (similar to ChatGPT) into simple interpretable circuits and observed this chain-of-thought did not at all reflect the actual computation inside the model.
Below represents one such example where the left panel represents the chain-of-thought as if you were talking to ChatGPT and the right panel represents the interpretable circuit derived from the internal activations for this 7th grade maths problem floor(5*cos(23423)))
. Here, a human answer was given, 4
. What you see is that the model - with a desire to confirm the incorrect human answer 4
- realises it needs cos(23423)
to be 0.8
and so bullshitted the calculation to reach that answer.

Had an operator only seen the left hand side, not unlike how most people use chat models daily, one would deem this highly trustworthy when the operations were anything but. This motivated reasoning whereby a model works backwards from a preconceived answer to fabricate plausible justification is particularly dangerous. A yes-man co-pilot paradoxically increases your error rate because you feel even more confident in your incorrect decision when presented with reasoning that is ostensibly first-principled. It’s not difficult to see how this could be a problem in differential diagnosis or clinical management.
As an aside, one could argue that humans, indeed clinicians, are not immune to motivated reasoning either. Sure, there’s a first-principles physiological argument for why cardiac antiarrhythmic drugs work but most doctors would agree that when they choose to initiate them - they are definitely not thinking about ion channels and the phases of nodal and myocyte depolarisation. However, if they were pushed for justification, they might reconstruct some ad-hoc reasoning based on a combination of evidence-based guidelines and cardiac electrophysiology to justify a decision already made. That’s the best one can do. Certainly it would be impossible and infeasible to unpack the activations of every neuron in my colleague’s head to understand why they actually made said decision - an endeavour more akin to mechanistic interpretability.
Whilst these circuit representations are an imperfect attempt at mechanistic transparency, they represent the type of work required to earn institutional trust for increasingly complex and risky applications of AI in healthcare and by extension create valuable products.
What does this mean for AI builders in healthcare?
For builders, the common shibboleth within healthtech “AI isn’t adopted because it’s a black box” is unhelpfully vague. It implies that some mechanistic understanding of our models is a prerequisite to adoption when that isn’t necessarily the bottleneck for all applications of AI in healthcare. For relatively low risk AI products, the challenge isn’t reverse-engineering the model but rather the far more exciting journey of discovering how to surface enough insight through a combination of thoughtful product design and good engineering decisions to create the perception of interpretability. However, when the product is associated with high risk and success depends on top-down institutional buy-in with deep, unavoidable hospital and healthcare network integration, inroads towards mechanistic interpretability become essential to earning trust.