What Are Leaky Variables and Why They Ruin Credit Models

Leland Burns & Jim McGuire
Jul 7, 2025
2 min read

What Exactly Is a Leaky Variable?

A leaky variable is any feature in your training data that contains information you won't have at decision time. The most extreme example would be using default status to predict default. That creates a perfect model in development with zero real-world utility, as the model will simply learn to predict the outcome with itself.

Of course, any serious data scientist would catch an error that massive. But leakage can be subtle:

Post-application updates to data that are used to train models (e.g., a bureau snapshot taken weeks after an application).
Features that embed future information, such as line amounts or rate changes after repayment performance has been observed.
Self-generated features from internal data that accidentally include post-decision performance due to sloppy development work.

These variables give the model a form of hindsight—knowledge it won't have at the moment of decisioning.

Why This Is Such a Problem

Leaky variables don't just make your model wrong. They make it confidently wrong with impressive AUC metrics to boot. But in production, without its "crystal ball" features available to it, model performance tanks. In credit, where timelines are long and capital exposure is real, this kind of mistake is expensive. Decisions based on overfit models lead to elevated defaults and months of missed opportunity.

How to Spot Leaky Variables

Check the timing on everything. For each feature, double- and triple-check whether you'll have that information when making a decision. Be conservative with your snapshot when using external data. For example, offset retro bureau data by at least one month to avoid capturing loans granted during the application month.

Scrutinize internal data. Internal systems often update fields retroactively. Check whether fields like credit line, rate, or balance were set or updated post-decision.

Question amazing performance. If your model shows an AUC near 1.0, or if a single feature accounts for an overwhelming share of feature importance, pause. Investigate whether that feature encodes future information.

Review features individually. For every top feature, ask: Does this definition make sense? Could this include future data? Do various metrics and trends align with domain intuition?

Validate with production-like conditions. Where possible, simulate production data conditions in your validation sets, or“shadow score” your new model in production. This should help flag variables that could suffer from leakage, as they would look quite different in a production setting.

The Bottom Line

Leaky variables seem obvious in theory, but they're easy to miss when you're in the weeds. The models that work best in production are usually the ones with more modest development metrics, built on trustworthy features and thoroughly validated. Catching leaky variables early saves you headaches down the line. If you're concerned about leaky variables in your current modeling pipeline—or want help designing your next model the right way—we’d be happy to help.