New Research on ML Fairness and Explainability

Leland Burns & Jim McGuire
Sep 7, 2023
6 min read

Updated: May 21, 2025

FinRegLab, a nonprofit focused on innovations that advance responsibility and inclusiveness in the financial sector, has released the results of a broad research project entitled Explainability and Fairness in Machine Learning for Credit Underwriting (1). Noteworthy for its range of empirical testing and numerous collaborators, including many leading financial technology companies, the paper is a comprehensive review of the tools available to help lenders responsibly and compliantly use ML for credit underwriting. Namely, ensuring ML underwriting models are both explainable and fair.

The stakes are high. ML underwriting is already driving many credit decisions, and that share is growing. With its ability to handle lots of data and give more accurate decisions, ML has the potential to improve business results and expand access to credit. At the same time, “the very quality that fuels ML models’ greater predictive power—their ability to detect more complex data patterns than prior generations of credit algorithms—makes them more difficult to understand and increases concerns that they could exacerbate inequalities and perform poorly in changing data conditions.” (FinRegLab)

FinRegLab’s review concludes that many of the emerging tools for improving accuracy and reducing disparities in ML models show great promise. But despite that promise, there is still no easy answer to these concerns. Critically, FinRegLab found “no ‘one size fits all’ technique or tool that performed the best across all regulatory tasks.” Accordingly, the paper concludes that more industry and regulatory guidance is needed to help stakeholders navigate the available tools and their associated trade-offs.

3 Key Facets of Compliance

While much of FinRegLab’s latest project focuses on fairness and inclusion, their overview helpfully buckets compliance into three areas: adverse action, model risk management, and fair lending.

Adverse Action

Adverse action refers to the regulations that require the disclosure of both key factors that lead to either a denial of credit or negative effects on pricing. For traditional regression-based scorecards, the process is relatively simple. Point values from an applicant’s scorecard are compared to some baseline values, and the differences in the applicant’s scores for specific attributes are ranked. These scorecard attributes are mapped to reason statements that group features together and state their meaning in accessible language.

For ML credit models, coefficients or point values for specific features are not readily available. Moreover, ML models use more data and find interactions in that data that more traditional models miss. ML models therefore require different adverse action methods. Many methods are already in use across the industry. FinRegLab's research is a valuable comprehensive comparison of the perfomance of these tools for different compliance tasks. Our decade of experience supports FinRegLab’s conclusion: the best tool will varies by use-case, but the sound performance of popular methods is reassuring and promising.

Model Risk Management

Model Risk Management (“MRM”) is a broader category that refers to the oversight and governance of the entire model lifecycle. Put simply, organizations need to have a deliberate framework that ensures their business fully understands their credit model and that it performs as expected. As with adverse action, explainability is a critical component of MRM, and ML models demand the use of more advanced explainability techniques. That said, much of sound MRM is unchanged by the evolution of ML.

Fair Lending

Of the three areas of compliance framed by FinRegLab, fair lending is unquestionably the one generating the most attention across the lending industry. Fair lending refers to both the specific prohibition of using race, gender, or other protected characteristics in underwriting models, as well as the use of other superficially neutral features that still lead to disparate impacts in decisions for protected classes. Traditional fair lending compliance has revolved around ensuring that obviously discriminatory features are excluded from your model build, then testing and removing features post hoc. Typically, this testing is grounded by comparing a measure of model performance (for example, Area Under the Receiver Operator Curve, or “AUC”) and a measure of disparate impact (for example, Adverse Impact Ratio, or “AIR”, which measures the ratio of approval rates between protected classes and benchmarks).

Naturally, many lenders have been applying this same technique to ML models. But because of the complex structure of ML models, with many longer feature lists and interactions between features, the efficacy of this method is questionable. In our experience, post-hoc feature removal on ML models has minimal impact on AIR. This is no surprise given the great care taken to remove potentially problematic features throughout the development process. Indeed, FinRegLab found that post hoc feature removal had minimal impact on the actual treatment of protected classes by ML models while often negatively impacting performance.

An alternative approach for fair lending and the search for less disparate alternatives (”LDAs”) is evolving in the ML space. It takes a proactive approach toward fairness and inclusion, incorporating it into the model development process. Automated platforms scan a wide array of available model features and iterate on potential models with the aim of identifying fairer model versions. According to FinRegLab’s research, this method is far more effective at reducing disparity. However, that does not mean this method is overall a better approach to fair lending compliance. There are significant tensions between this method and modeling best practices which must be considered.

Fair Lending Trade-offs

If an automated search for LDAs is conducted as a post hoc process, for example by a separate compliance team or vendor, then it’s in tension with the fundamentals of MRM. Sound MRM, even for advanced ML models that can handle more data, demands careful stewardship of data: examining feature importance, correlations, special values, stability over time, and monotonicity. In our engagements, tools and algorithms support feature selection, but manual processes and discussions are invaluable.

Removing selected features from your carefully selected list as part of post hoc compliance review is one thing. But pivoting to an LDA discovered through an automated de-biasing search, which may completely overhaul your chosen features, is another matter entirely. It negates much of the work done throughout the development process and could produce a poorly performing model divorced from your business goals.

The obvious way to avoid this dilemma is to incorporate automated de-biasing methods and the search for LDAs into the development process itself. Much of the MRM outlined above could take place in concert with the search for LDAs. But as FinRegLab points out, this would be a fundamental departure from the safeguards lenders have traditionally put into place to avoid disparate treatment in their models. Put simply, a basic way to ensure that data such as race or gender is not blatantly used in underwriting is to wall off demographic data and other personal information from model development. FinRegLab notes: “A threshold question is whether specific de-biasing techniques are permissible under fair lending laws to the extent that they use data about protected class membership in different ways than traditional mitigation approaches.” So, until more regulatory guidance or clarity is forthcoming, lenders using traditional safeguards will have to conduct these LDA de-biasing tests separately from core model development. As a result, the tension between these techniques and more holistic MRM will remain.

No One-Size-Fits-All Approach

FinRegLab’s approach and conclusions resonated with our team at Ensemblex. We helped pioneer the use of ML for credit underwriting over a decade ago. When we put our first ML underwriting model into production, ML in financial services was the domain of a few edgy start-ups. We now see wide-spread acceptance of ML’s potential in financial services, with techniques for understanding and managing ML growing in tandem. These tools show tremendous promise and are continually improving.

But after deploying ML underwriting models for the past decade across a range of geographies and verticals, one of our key lessons is that there is indeed no “one-size-fits-all” approach to ML, explainability, and fairness. Each business must make the best use of available tools for their specific products, data, and business considerations. If you’d like to understand more about these trade-offs and how to navigate them, give us a call. We’d love to hear from you.

Notes

(1) Unless noted otherwise, all quotations and other references to FinRegLab are drawn from Machine Learning Explainability & Fairness: Insights from Consumer Lending and/or Explainability & Fairness in Machine Learning for Credit Underwriting: Policy & Empirical Findings Overview.