Why Two Good Credit Models Can Disagree — and Why That’s Not a Problem

Leland Burns & Jim McGuire
Apr 6
3 min read

In many of our projects, we’ll build two candidate models that both look strong on paper. Similar AUC. Similar stability. Both clearly predictive.

And yet — when we start comparing scores more closely — they don’t line up.

The disagreement isn’t always dramatic. It doesn’t show up as one model approving and the other declining across the board. It’s subtler than that. Borrowers shift between deciles. Score correlations aren’t as high as expected. ROC curves have similar AUCs but noticeably different shapes.

This usually happens during development — when we’re testing different data sources, target definitions, or modeling approaches. And the first instinct can be: Shouldn’t these models agree more than this?

Not necessarily.

In fact, when two good models disagree, that’s often where the opportunity begins.

First: What Do We Mean by a “Good” Model?

In credit risk, “good” typically means strong rank ordering.

We’re not trying to perfectly predict who will default. We’re trying to separate higher-risk from lower-risk borrowers in a way that supports profitable decisions.

We often use AUC as a summary metric. But “good” is contextual — a 0.65 AUC might be sufficient in one business, while another product demands 0.75+.

What matters is that the model:

Discriminates risk effectively
Supports approval, pricing, and product decisions
Improves portfolio economics

Two models can both accomplish that — and still disagree.

What Does Disagreement Actually Look Like?

Disagreement shows up in a few common ways:

Score correlation lower than expected
Meaningful movement across deciles in swap-set analysis
Different ROC curve shapes despite similar AUC
Segment-level performance differences

We rarely look at AUC alone. The ROC curve shape often reveals more. One model may climb steeply in the bottom left corner — excelling at identifying the worst risks. Another may gain area more gradually — better distinguishing among safer borrowers.

Same AUC. Different behavior.

Why Do Two Good Models Disagree?

Several structural reasons drive this — none of them imply something is broken.

1. Different Training Data

Even modest differences in time window or sample construction can shift learned risk patterns. More recent data alone can materially change rankings.

2. Different Modeling Techniques

A lean logistic scorecard and a gradient boosting model with a broader feature set will capture risk differently. Both can be predictive — but not identical.

3. Different Data Sources

Two bureaus with partial overlap won’t encode the same signals. Traditional bureau data and alternative data naturally capture different dimensions of borrower behavior.

Disagreement is expected.

4. Different Targets

Short-term outcomes (e.g., first payment default) behave differently than longer-term loss measures.

We’ve built separate models optimized for each. Both were strong. Both were meaningfully decorrelated — because they were solving slightly different problems.

When Disagreement Is an Opportunity

If two models are nearly perfectly correlated, they’re essentially encoding the same view of risk.

But if they’re predictive and moderately decorrelated, that’s where incremental lift often lives.

Each model is seeing something slightly different.

In those cases, we explore:

Simple ensembles — averaging or blending scores
Waterfall approaches — using one model to screen, another to refine
Segment-specific deployment — applying each model where it performs best

Decorrelated models frequently outperform either model alone when combined thoughtfully.

Real-World Examples

First Payment Default vs. Long-Term Risk

At a small-dollar lender, we built one model focused on first payment default and another targeting longer-term loss.

Each performed well — but on different subsets of the population.

Combining them improved overall risk assessment by capturing both short- and long-term dynamics.

Non-Overlapping Data Histories

In emerging markets, newer data sources often lack deep history. Rather than discard older bureau data, we build component models on each dataset.

They naturally disagree — because they see different time horizons and features.

Blending them provides better coverage than choosing one.

Inconsistent Bureau Reporting

In some Central American markets, different lenders report to different bureaus. No single dataset is complete.

Models built on each bureau can both work — and both disagree.

Using both is often the most prudent solution.

The Better Question

When two good models disagree, the instinct is to pick a winner.

But the better question is:

What is each model capturing that the other isn’t?

Disagreement can reveal blind spots, segment-specific dynamics, or incremental performance opportunities.

Perfect agreement often means two models are doing the same thing.

Structured disagreement — when both models are predictive and stable — is often where the real lift lives.

At Ensemblex, we don’t just compare AUCs. We study where models diverge, why they diverge, and whether that divergence can be harnessed.

Sometimes disagreement isn’t a warning.

It’s the start of insight.