Why LLMs Are Hard to Explain — and Why That Matters for Lending Decisions

Leland Burns & Jim McGuire
Mar 9
4 min read

Large language models (LLMs) are everywhere right now.

They write emails, summarize documents, draft code, screen résumés, and answer questions with remarkable fluency. It’s not surprising that lenders are asking whether the same tools could be used to make—or support—credit decisions.

At Ensemblex, we’re excited about LLMs. We use them internally, we track their progress closely, and we expect them to influence how analytical work gets done over time.

But we’re also cautious. Especially when it comes to underwriting borrowers.

The core issue isn’t performance, speed, or intelligence. It’s explainability—and the lack of it makes LLMs a poor fit for certain high-stakes lending use cases today.

Here’s why.

Not all “AI” is doing the same job

One reason this debate gets muddled is that the term AI gets used very broadly.

In credit modeling, the tools we rely on most—logistic regression, gradient-boosted trees, random forests—are examples of supervised machine learning. They’re trained on structured data, with a clearly defined target:

Given this borrower’s attributes, what is the likelihood they default within a specified time window?

The model’s job is narrow, explicit, and directly tied to the business decision.

LLMs, by contrast, are trained to do something fundamentally different. At a high level, they’re designed to predict what comes next in a sequence of language. They absorb massive amounts of text and learn how to produce responses that sound coherent, relevant, and human-like.

That distinction matters. A supervised credit model is optimized to answer one specific question. An LLM is optimized to produce plausible language across a huge range of contexts.

Those are very different objectives.

Hallucinations aren’t just a novelty problem

Most people have seen lighthearted examples of LLMs getting things wrong—confidently miscounting letters in a word, inventing facts, or contradicting themselves when pressed.

But similar behavior shows up in much more serious settings.

We’ve seen documented cases where:

LLMs generated legal citations that didn’t exist
Academic references were fabricated
Hiring tools produced opaque and inconsistent evaluations

In all of these cases, the issue wasn’t malicious intent. The model was doing exactly what it was designed to do: generating output that looked right, not output that was provably correct.

That’s manageable when the cost of being wrong is low. It’s far more problematic when the output affects people’s access to credit.

The explainability gap is the real problem

Every predictive model can produce errors. That’s not unique to LLMs.

The difference is what you can do after something looks wrong.

When a supervised credit model produces a counterintuitive decision, we have tools to investigate:

Which features mattered most?
How did this applicant differ from the average?
Did the data drift?
Is a feature behaving in an unexpected way?

Techniques like SHAP values, monotonic constraints, stability analysis, and feature attribution allow us to trace a decision back to the inputs that drove it. That’s not just helpful for modelers—it’s essential for compliance, governance, and customer explanations.

With LLMs, that level of traceability doesn’t exist in any practical sense.

You can prompt an LLM to explain itself, but that explanation is just more generated text. It’s not a faithful decomposition of the internal decision process. There’s no clear way to say:

“This specific input caused this specific outcome.”

And there’s no reliable way to fix a specific failure mode without retraining or constraining the entire system in broad, indirect ways.

Why this matters specifically in lending

In consumer lending—especially in regulated markets—models don’t just need to work. They need to be defensible.

Lenders must be able to:

Explain adverse actions to borrowers
Demonstrate fairness and consistency
Show regulators how decisions are made
Audit and monitor model behavior over time

Modern supervised ML models already required a decade of work—across industry, academia, and regulators—to be accepted for these use cases. The explainability tooling that exists today wasn’t an afterthought; it was a prerequisite.

LLMs don’t yet meet that bar.

That doesn’t mean they’re “bad models.” It means they’re optimized for a different class of problems—ones where ambiguity, creativity, and probabilistic language generation are features, not bugs.

Where LLMs do make sense

None of this is an argument against using LLMs broadly in lending organizations.

They can be extremely effective for:

Drafting policies and documentation
Summarizing analyst notes
Assisting customer support teams
Exploring data and generating hypotheses
Improving internal productivity

These are areas where outputs can be reviewed, corrected, and contextualized by humans—and where perfect traceability isn’t required.

But underwriting decisions sit in a different category. They’re high-stakes, regulated, and directly impactful to consumers. For those use cases, the lack of explainability isn’t just inconvenient—it’s disqualifying.

Looking ahead

We don’t believe the current state of affairs is permanent.

The ideas behind LLMs—large-scale representation learning, flexible architectures, and powerful pattern recognition—will absolutely influence the future of credit modeling. We expect continued innovation in models designed specifically for structured, tabular decisioning problems.

But today, it’s important not to confuse impressive conversational ability with decision-making suitability.

Understanding where a tool works—and where it doesn’t—is part of responsible model governance.

At Ensemblex, that distinction guides how we advise clients: excited about what’s new, realistic about what’s ready, and focused on using the right tool for the right job.