Loan Default Prediction
with Machine Learning

Philippe Lambot GitHub

March 27, 2023

This Data Science project consists of building a predictive Machine Learning model on home equity loans. The final results are as follows.

Sensitivity on Validation Set	Precision on Validation Set
87%	60%

Executive Summary

Welcome to this Data Science project dedicated to default prediction on home equity loans. The prediction objective is maximizing sensitivity and precision under the double constraint of a minimum sensitivity of 75% and a minimum precision of 50%. After an Exploratory Data Analysis of the label and the predictors, Machine Learning models have been built up on the training set and the test set.

Data have been downloaded from this Kaggle repository.

The predictive model has been split in two, due to missing predictor information in 43% of loans in the training set.

On the first hand, the subset with complete information has been dealt with in Machine Learning in a multi-tier process:

pre-testing twenty-one algorithms such as eXtreme Gradient Boosting, Stochastic Gradient Boosting, or Monotone Multi-Layer Perceptron Neural Network;
shortlisting six of them based on the chosen performance metrics - sensitivity and precision - for the standard probability threshold of 0.5;
fine tuning on a broad range of probability thresholds and selecting three algorithms, namely AdaBoost Classification Trees, Random Forest, and Weighted Subspace Random Forest;
combining the three selected algorithms into a model ensembling procedure by majority vote;
eventually ensembling the three selected algorithms on the basis of the average probabilities.

On the other hand, the subset with incomplete predictor information has been dealt with separately in Machine Learning by utilizing the relation between missing information and default rate, which has been evidenced in the Exploratory Data Analysis.

The whole predictive model could be quite easily generalized and customized, as suggested in the last section on takeaways. These promising avenues are open.

The whole predictive model could be rather easily generalized and customized as suggested in the last section about takeaways. These promising avenues are open.

The HTML document presenting this Data Science project and its results is lodged in this GitHub repository.

You are most welcome to contact me. My GitHub account contains contact information.

TAGS: loan default prediction, finance, banking, home equity loan, Machine Learning, classification, missing values, eXtreme Gradient Boosting, AdaBoost Classification Trees, Random Forest, probability threshold, thresholding, model ensembling, precision-recall curve, R, HTML

Business Case

Why?

Why this domain? — Predicting default on loans is an important issue for lenders but probably even much more for borrowers. Let us briefly recall a few points about home equity loans in particular.

First, lenders are at risk for instance if they offer an amount worth more than 100% of the equity. And indeed some lenders offer home equity loans of an amount worth up to 125% of the equity, as referred to by Adam Barone in Investopedia, in his update from June 28, 2022.

They are also at risk for instance if there is a general decrease in real estate prices. In that case, the equity might no longer suffice to cover the loans. Moreover individual defaults are no longer statistically independent since multiple borrowers can be hit by the same factor.

Second, borrowers can be at risk as well, especially if they do some reloading. Reloading has been very clearly defined by Adam Barone in Investopedia in the article already referred to above:

‘The main pitfall associated with home equity loans is that they sometimes seem to be an easy solution for a borrower who may have fallen into a perpetual cycle of spending and borrowing, spending and borrowing—all the while sinking deeper into debt. Unfortunately, this scenario is so common that lenders have a term for it: “reloading,” which is basically the habit of taking a loan in order to pay off existing debt and free up additional credit, which the borrower then uses to make additional purchases.’

Third, home equity loans can also be a sensitive matter of economic, financial, and social policy, just as this could be seen some time ago.

More generally, lending and borrowing are important business and societal issues, and not just in the case of home equity lending in particular.

Why these Machine Learning algorithms? — This project is a classification challenge on structured data. It is an opportunity to use some algorithms that are often dithyrambically praised for classification challenges on structured data, such as eXtreme Gradient Boosting, Stochastic Gradient Boosting and Random Forest.

Such praise can be read for instance in an article by Vishal Morde in Towards Data Science, published in 2019:

“In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now.”

“Since its introduction, this algorithm [XGBoost] has not only been credited with winning numerous Kaggle competitions but also for being the driving force under the hood for several cutting-edge industry applications.”

It is an opportunity to try among others this algorithm even — of course — if competition ranking is not the only criterion in Data Science.

Why this dataset? — The chosen dataset has a flavor of professionalism: indeed, the predictors from this dataset are relevant from a professional point of view, even if the very list of predictors seems far from complete and even if the documentation included is limited.

From a Data Science point of view, this dataset represents a challenge in terms of data profiling and data preparation since there are numerous missing values as can be read in the Kaggle repository. This is an interesting feature since it is a trait common to many real-world Data Science projects, where data preparation is often time-consuming. An article published on ProjectPro and updated on February 2, 2023 reads:

‘Steve Lohr of The New York Times said: “Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.’

What?

Predictive approach — The current project is essentially Data Analytics, with Machine Learning algorithms being used to optimize performance metrics in loan default prediction.

Descriptive approach — In the current project, there is also Data Analysis, with an important descriptive approach, especially in data profiling and Exploratory Data Analysis, which are often essential to efficient prediction. In Exploratory Data Analysis, some essential insights will be retrieved about frequencies of default according to category. This will also pave the way for enriching the set of predictors with composite predictors. Default frequencies when information is missing will also be examined, looking for more impactful use of this type of information.

Prescriptive approach — Third, there might be a somewhat prescriptive message as well although this is not the main objective.

Explicative approach — Last, there are some explicative considerations. Since domain knowledge is an essential part of the job in real-world projects, some domain considerations will also be expressed, marginally though. There is an interpretative aspect about Data Science as well. But there is no impact quantification of each predictor as this could be the case with logistic or linear regressions where the influence of each predictor would be quantified and its statistical significance would be tested.

How?

How will prediction performance be evaluated? — As stated on the Kaggle website, in the dataset under review, defaults are a minority case among loans, more specifically 20%.

In the event of a clear imbalance between the positive and negative classes, accuracy — percentage of correctly predicted outcomes — might not be the most appropriate performance metric. Indeed, a base model — predicting in this project no defaulting for all borrowers — would already produce an accuracy of 80%, but sensitivity — percentage of defaults correctly predicted — would be zero percent!

Consequently, let us turn to a combination of two other performance metrics: sensitivity — also called recall — and precision — also called positive predictive value. Precision is the percentage of correctly predicted defaults with respect to the total number of predicted defaults.

Setting targets in advance for sensitivity and precision is pretty hard for at least two reasons. First, data have not been analyzed yet. Second, there is no loan manager to tell us what trade-off he wishes between sensitivity and precision: is she/he strongly risk-averse or is he rather tempted to minimize the number of rejected loan requests? It will be assumed that she/he is risk-averse: priority will be given to sensitivity. Thus, the objective will be to maximize sensitivity and precision, but with a minimum set at 75% for sensitivity and 50% for precision.

How will data be prepared and treated? — In light of the many missing values, data profiling, data preparation, and Exploratory Data Analysis are essential. In particular, a lot of time will be spent on Exploratory Data Analysis, as it often informs Data Science and Machine Learning processes in critical ways.

Who?

In this project, I am excited about using my dual professional experience in finance and data treatment as well as the outstanding hands-on experience provided by Data Science MOOCs delivered on the edX platform and on Udemy. I especially remember my first two MOOCs:

The Analytics Edge provided by the MIT with a kaleidoscope of real-world hands-on projects;
and the Professional Certificate in Data Science programme of Rafael Irizarry from Harvard University, with remarkable coverage of all aspects of Data Science.

All possible shortcomings in my applying these and other MOOCs are — of course — mine.

My GitHub account contains Data Science and Data Analysis projects — among others this Data Science project — and contact information. You are welcome to get in touch.

Let me also thank my friend Richard Careaga, former Associate General Counsel, JPMorgan Chase, N.A., for indefatigable questioning and discussion that were both valuable and invaluable from a domain point of view and from a Data Science standpoint. Interaction was most inspiring and supportive.

I also thank Cybernetic’s drawing function for confusion matrices, published on Stack Overflow; I have used it twice in this project.

Where?

This Data Science project — as well as some other Data Science or Data Analysis projects of mine — can be found on my GitHub account. This account also contains contact information. You are most welcome to get in touch.

Data Profiling

Data Source & Type

On the Kaggle site, there is a description of the dataset. Here is part of it:

“The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable indicating whether an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%).”

This seems to relate to real data about granted home equity loans.

There is also a reference to the Equal Credit Opportunity Act. Wikipedia, consulted on March 19, 2023, reads

“The Equal Credit Opportunity Act (ECOA) is a United States law (codified at 15 U.S.C. § 1691 et seq.), enacted 28 October 1974,[1] that makes it unlawful for any creditor to discriminate against any applicant, with respect to any aspect of a credit transaction, on the basis of race, color, religion, national origin, sex, marital status, or age (provided the applicant has the capacity to contract) […]”

Consequently, it can be reasonably assumed that the dataset under review relates to the United States.

Data Content & Subsetting

Let us have a look at the list of variables — label and predictors — with the short name and the content of each variable.

Short Name of Label and Predictors	Content of Label and Predictors
y	“Default” or “Repaid”
loan	Amount of the loan request
mort_due	Mortgage balance due (amount due on main mortgage)
prop_val	Property current market value
reason	DebtCon = debt consolidation HomeImp = home improvement
job	Six occupational categories
job_years	Years at present job
derog	Number of major derogatory reports
delinq	Number of delinquent credit lines
oldest_trade	Age of oldest trade line in months
recent_cred	Number of recent credit lines
credits	Number of credit lines
debt_to_inc	Debt-to-income ratio

The label is binary: “Default” or “Repaid”. There are twelve predictors, two of them being categorical variables and the other ten being numerical values. Let us have a look at the first rows from the dataset.

y	loan	mort_due	prop_val	reason	job	job_years	derog	delinq	oldest_trade	recent_cred	credits	debt_to_inc
Default	1100	25860	39025	HomeImp	Other	10.5	0	0	94.36667	1	9	NA
Default	1300	70053	68400	HomeImp	Other	7.0	0	2	121.83333	0	14	NA
Default	1500	13500	16700	HomeImp	Other	4.0	0	0	149.46667	1	10	NA
Default	1500	NA	NA			NA	NA	NA	NA	NA	NA	NA
Repaid	1700	97800	112000	HomeImp	Office	3.0	0	0	93.33333	0	14	NA
Default	1700	30548	40320	HomeImp	Other	9.0	0	0	101.46600	1	8	37.11361

A few statements can be made readily, and among them:

there are numerous missing values;
in the factors — that is to say the attributes “reason” and “job” — missing values are not marked as “NA”.

Tackling the second challenge is no problem: it suffices to replace empty spaces with NAs to indicate that these values are actually missing. The first challenge will be addressed later, in the Exploratory Data Analysis and the subsequent data preparation.

The first step now is to randomly divide the data into training, testing, and validation sets. One-third will be dedicated to the validation set and the remaining 70% will again be split into one-third for the test set and two-thirds for the training set.

EDA on Label & Predictors

Exploratory Data Analysis is an important – and often decisive – part of Data Science. This section deals with Exploratory Data Analysis on the label and the predictors - included in the dataset under analysis or created by assembling. The section after will deal with Exploratory Data Analysis on the numerous missing values.

Label

There is no missing value in the label.

As already explained here, the positive class — loan default — is a minority. Consequently, when formulating the prediction objective, accuracy has been disregarded as prediction performance metric in favor of sensitivity and precision. So, the objective is maximizing sensitivity and precision under the double constraint of a minimum of 75% for sensitivity and 50% for precision.

Loan Amount

Let us split the default rate according to quartiles of loan amount.

There is a significant difference in the average default rate between the first interval and the other intervals, all the more so for the third and fourth intervals.

The graph above does not show any missing value in the loan amount.

From an arithmetical point of view, some rounding of quartiles has been performed for clarity of graph and table. If this affected quartile distribution in a significant way, this would generate substantial disparity between the number of loans by interval in the table below — which is obviously not the case.

Loan Amount Interval	Loans Count	Average Default Rate
[1300 — 11000]	678	29 %
(11000 — 16300]	654	19 %
(16300 — 23000]	655	16 %
(23000 — 89900]	661	16 %

Mortgage Balance Due

The mortgage balance due is the amount still due on the main mortgage — also denominated first mortgage.

It is an essential piece of information since the difference between the current market value of the property and the mortgage balance due determines the collateral left to back additional borrowing.

More information can be found in Julia Kagan’s article in Investopedia, consulted in its February 13, 2023 update.

Let us have a look at a quartile-based graph.

There is a substantial difference in average default rate between the first and the third interval. The next table expresses information in numbers.

Interval of Mortgage Balance Due	Loans Count	Average Default Rate
[2619 — 46727]	602	25 %
(46727 — 65000]	603	19 %
(65000 — 91170]	601	16 %
(91170 — 399550]	602	18 %
NAs	240	22 %

Property Market Value

The current market value of the property is the third predictor. Let us have a look at a quartile-based graph.

It goes without saying that “current” refers to the moment of the diagnosis on the evolution of credits. This clarification will not be repeated later.

Where information is available, there is some difference in average default rate especially between the first and the third interval.

But the most striking statement is — by far — the impressive average default rate where information is missing about the current market value of the property. But how big is the subgroup of loans without information about the current market value of the property? The next table tells us.

Property Value Interval	Loans Count	Average Default Rate
[8000 — 66206]	651	25 %
(66206 — 88844]	650	16 %
(88844 — 119561]	650	14 %
(119561 — 854114]	651	19 %
NAs	46	91 %

The average default rate is 25% in the first interval and 14% in the third interval.

But much more extreme: among the 46 loans without information about property value, the average default rate jumps to 91%! In other words, 42 loans out of 46 have defaulted.

This predictor seems rather promising to discriminate between quartiles, but it is much more promising for missing values: the absence of property value information is a clear indicator of default! Of course, the percentage of 91 is calculated on only 46 loans: from a statistical point of view, standard error can be rather important. Nevertheless, it seems worthwhile to use this piece of information to predict default in case of missing information on the current market value of the property! Although limited in occurrence, this finding might upgrade both sensitivity and/or precision.

Added: Collateral

Let us evaluate the predictive power of the collateral amount. Actually, the collateral amount is not included in the dataset under review. But it can be built up as the difference between, on the one hand, the current market value of the property, and, on the other hand, the total of the mortgage balance due and the loan amount. Then, default rate can be computed by collateral quartile.

Firstly, the lowest average default rate relates to under-collateralized loans — that is to say loans with “negative collateral” or in other words with property current market value lower than the sum of the mortgage balance due and the loan amount.

One explanation could be that lenders are particularly cautious under such circumstances: from a domain standpoint, it may make sense to extend a loan that is under-collateralized (except in a few states with so-called “anti-deficiency laws”) if the lender is extending a “character loan” on the basis of good credit history, capacity to pay, and liquid assets that could be recovered in a lawsuit.

Secondly, loans without information about collateral have a rather high average default rate. Of course, we should know the number of loans without information about collateral in order to check statistical representativeness. It is provided in the next table.

Collateral Interval	Loans Count	Average Default Rate
[-233642 — 2870]	595	14 %
(2870 — 8584]	593	23 %
(8584 — 17718]	594	19 %
(17718 — 267300]	594	19 %
NAs	272	31 %

The number of loans without information about collateral is 272, which provides statistical representativeness.

Since collateral seems to have some predictive power, this predictor will be added to the training, test, and validation sets.

Tested: Loan-to-Collatoral Ratio

In the section above, a composite predictor has been added — namely the collateral available for the home equity loan after deducting the mortgage balance still due from the current market value of the property.

Other composite predictors — in the form of ratios — will be analyzed. The loan-to-collateral ratio is the first one.

In the table above, the subgroup of observations with missing information for the loan-to-collateral ratio has not been reported for two reasons. Firstly, it would provide redundant information compared to the information already provided by the components. Secondly, this would reduce the scale on the y-axis and therefore the visibility on the quartile distribution.

The graph above shows little variation in average default rate among quartiles, which does not look very promising in terms of predictive power.

This impression has been confirmed by running predictive models on the training set and on the test set. Adding the loan-to-collateral ratio has not boosted performance metrics, on the contrary rather.

Consequently, this candidate predictor is dropped.

Added: Loan-to-Property Ratio

Let us visualize the quartile distribution of default rate based on the loan-to-property ratio. It is a ratio of the loan amount to the current market value of the property.

In the table above, the subgroup of observations with missing information for the loan-to-collateral ratio has not been reported for two reasons. On the one hand, this would provide redundant information compared to the information already provided by the predictor on the value of the property. On the other hand, this would reduce the scale on the y-axis and therefore the visibility on the distribution of the quartiles.

There is some variation in default rate. This is converted to numbers in the table below.

Interval of the Loan-to-Property Ratio	Loans Count	Average Default Rate
[0 — 0.123785]	651	22 %
(0.123785 — 0.16985]	650	17 %
(0.16985 — 0.241611]	650	20 %
(0.241611 — 3]	651	16 %

The table above shows some variation in average default rate by quartile. Variation is of an order of magnitude comparable to those recorded for the predictors linked to the reason for the loan, the number of years in the current professional activity, and the number of credits.

This candidate predictor and the next one — collateral-to-property ratio — have been tested when running predictive models on the training set and on the test set. Adding them has contributed prediction performance measured by sensitivity and precision. Consequently, these candidate predictors have been validated and added to the training, test, and validation sets.

The following section presents the ratio between the collateral amount and the current market value of the property.

Added: Collateral-to-Property Ratio

Let us visualize the quartile distribution of default rate based on the collateral-to-property ratio. It is a ratio of the collateral left available — after deducting the mortgage balance due and the loan amount from the property market value — to the property market value.

In the table above, the subgroup with missing information for the collateral-to-property ratio has not been reported. Indeed, this would provide redundant information compared to the information already provided and reduce the scale on the y-axis and therefore the visibility on the quartile distribution.

There is some variation in default rate. This is presented in numerical form in the table below.

Interval of the Collateral-to-Property Ratio	Loans Count	Average Default Rate
[-9 — 0.0341609]	594	14 %
(0.0341609 — 0.0916691]	594	17 %
(0.0916691 — 0.153883]	594	22 %
(0.153883 — 1]	594	22 %

The table above shows some variation in average default rate by quartile.

As explained in the section above, the loan-to-property ratio and the collateral-to-property ratio have been added to the predictor set when running predictive models on the training set and on the test set. Adding them has contributed prediction performance measured by sensitivity and precision. Consequently, these candidate predictors have been validated and added to the training, test, and validation sets.

Reason for Loan Request

In this predictor, there are three cases:

debt consolidation,
home improvement,
not available.

Average default rate changes in a limited way according to category, more when taking into account the subgroup with missing information.

The next table gives the average default percentages.

Loan Reason	Loans Count	Average Default Rate
Debt Consolidation	1 732	18 %
Home Improvement	816	24 %
NAs	100	15 %

Professional Occupation

The average default rate fluctuates substantially by occupational category. Average default percentages are available in the next table.

Professional Occupation	Loans Count	Average Default Rate
Mgr	336	22 %
Office	395	14 %
Other	1078	23 %
ProfExe	568	17 %
Sales	48	42 %
Self	103	31 %
NAs	120	5 %

Where information is available about professional occupation, the lowest average default rate is noticed for office professional occupations and the highest average default rate for sales.

But where information is lacking, the average default rate is much lower, falling to 5%!

From a domain point of view, no explanation is available on the Kaggle website; could it come from special attention having been paid to borrowers without reported professional occupation?

From a Data Science point of view, this is most interesting: when information is missing about professional occupation, “Repaid” could be predicted for all loans, on condition, of course, that the 5% default rate has some statistical representativeness — which can be assumed with 120 observations — and also on condition that other predictors do not bring nuances.

Some predictive impact is expected from this predictor.

Years in Current Job

Average default rate varies rather slightly according to the number of years in the current professional occupation. There is a slight downward trend, which makes sense from a domain perspective.

In case of missing values, there is some interesting information: the average default rate is only 12%, as shown in the table below. With 224 observations, a decent level of statistical representativeness is provided!

Number of Years in Current Job	Loans Count	Average Default Rate
[0 — 3]	705	24 %
(3 — 7]	580	23 %
(7 — 13]	596	18 %
(13 — 41]	543	17 %
NAs	224	12 %

Number of Major Derogatory Reports

Let us evaluate the predictive power of the number of major derogatory reports.

There is vast variation in average default rate according to the number of major derogatory reports.

In ascending order of the number of major derogatory reports, the average default rate follows an upward trend for one to four reports, before plummeting for five and six reports and then peaking at 100% for more than six reports - which happens in ten cases.

This looks like a solid candidate for predicting default, since percentage differences are large — even without taking into account the exceptional rate of 100% — and since the numbers of loans involved in these large percentage differences are substantial. Indeed, as shown in the following table, the average default rate is 16% for the subgroup of 2,001 loans with zero report, 39% for 207 loans with one report, 54% for 72 loans with two reports, and 77% for 31 loans with three reports.

Where there is no information available about the number of major derogatory reports, surprisingly enough the default rate is the lowest — with 12% —, being even lower than where there is no report at all! It is all the more surprising that this very low percentage has statistical representativeness with 308 cases as indicated in the next table. This very low default rate can be a strong predictor of “Reimbursed” in the event of a missing value!

Number of Major Derogatory Reports	Loans Count	Average Default Rate
0	2 001	16 %
1	207	39 %
2	72	54 %
3	31	77 %
4	6	83 %
5	8	62 %
6	5	40 %
7	4	100 %
8	3	100 %
9	1	100 %
10	2	100 %
NA	308	12 %

Number of Delinquent Lines

The graph above shows some similarities with the previous one about the number of major derogatory reports. In the case of delinquent lines, we can observe that

there is an upward trend in default rate across loans ranked according to the number of delinquent lines;
the average default percentage culminates at 100% for more than five delinquent lines;
when information about the number of delinquent lines is missing, default rate is low — 13% as indicated in the next table.

This looks promising in terms of default prediction.

Number of Delinquent Lines	Loans Count	Average Default Rate
0	1 836	14 %
1	299	35 %
2	122	42 %
3	63	59 %
4	30	50 %
5	17	88 %
6	10	100 %
7	7	100 %
8	1	100 %
10	1	100 %
11	1	100 %
15	1	100 %
NA	260	13 %

Age of Oldest Trade Line

This predictor represents the age of the borrower’s oldest credit line in months.

There is a substantial difference in average default rate between the first and the fourth interval: as displayed in the next table, in the first interval average default rate is 29% while it falls to 11% in the fourth interval. It looks like a rather helpful predictor of loan default.

Number of Months of Oldest Trade Line	Loans Count	Average Default Rate
[0,115]	627	29 %
(115,173]	635	25 %
(173,230]	616	14 %
(230,650]	633	11 %
NAs	137	25 %

Number of Recent Credits

Let us evaluate the predictive power of the number of recent credits.

There is much variation in the average default rate by number of recent credits, even if the 100% rate is ignored because it only relates to two observations! This predictor could be effective.

On the subset without any information about the number of recent credits, the average default rate is very low — namely 14% as indicated in the next table. This is a powerful predictor of “Repaid” in case of missing information, relating to 236 cases in the training set.

Number of Recent Credits	Loans Count	Average Default Rate
0	1119	15 %
1	580	20 %
2	354	25 %
3	185	28 %
4	67	43 %
5	33	48 %
6	27	59 %
7	13	31 %
8	10	60 %
9	4	25 %
10	14	29 %
11	4	25 %
13	2	100 %
NA	236	14 %

Number of Credits

Can the number of credits be indicative of repayment prospects?

Average default rate varies in a rather limited way. Numerical quantification is provided in the next table.

Number of Credits	Loans Count	Average Default Rate
[0 — 15]	729	23 %
(15 — 20]	569	17 %
(20 — 26]	619	16 %
(26 — 65]	633	23 %
NAs	98	21 %

Debt-to-Income Ratio

The debt-to-income ratio is the final candidate predictor in this data science project on diagnosing home equity loans for default risk.

In the fourth interval, the average default rate is 50%, against 5 to 8% in the other intervals. This is not at all surprising.

But the most striking information, by far, is the average default rate of 61% in the event of missing information on the debt-to-income ratio, especially since the subgroup without information on the debt-to-income ratio contains up to 576 cases.

Interval of Debt-to-Income-Ratio	Loans Count	Average Default Rate
[0 — 29]	514	6 %
(29 — 35]	553	5 %
(35 — 39]	487	8 %
(39 — 204]	518	15 %
NAs	576	61 %

Insights about Predictors

In the Exploratory Data Analysis about label and predictors, categories have been associated with average default rates.

In some predictors, category default rates can substantially vary. In the case of professional occupation, average default rates by category vary from 5% to 42%. In the instance of the number of major derogatory reports, variation is between 12% and 100% for ten observations. Concerning the number of delinquent lines, variation is between 13% and 100% when there are more than five delinquent lines — which happens in twenty-one cases in the training set.

A priori, larger variation can boost the predictive power of a predictor, if, of course, the number of observations by category permits statistical representativeness.

EDA on Incomplete Information

Insights have already been extracted, predictor by predictor, about missing values.

Let us now get a more global picture of incomplete information all over the training set and quantify the magnitude of the challenge.

Global Inventory

The global inventory table below is based on the version of the training set without the addition of the three composite predictors — collateral, loan-to-property ratio, and collateral-to-property ratio. Otherwise some missing values would be counted more than once.

Cells without Information	Cells with Information	Loans with Partial Information	Loans with Full Information
2 345	32 079	1 141	1 507
7 %	93 %	43 %	57 %

As indicated in the table above, 7% of the cells in the training set contain no information. This affects 43% of loans.

These counts and percentages are rather impressive. To better evaluate the possible impact on predictiveness, let us localize the missing values by attribute. Some predictors might be less pivotal than others with respect to predictiveness and consequently missing values in such fields might be less impactful in terms of predictiveness. The next table shows the number and the percentage of missing values by attribute.

Short Name of Predictor	Number of Missing Values	Percentage of Missing Values
y	0	0 %
loan	0	0 %
mort_due	240	9 %
prop_val	46	2 %
reason	100	4 %
job	120	5 %
job_years	224	8 %
derog	308	12 %
delinq	260	10 %
oldest_trade	137	5 %
recent_cred	236	9 %
credits	98	4 %
debt_to_inc	576	22 %
collateral	272	10 %
loan_to_property	46	2 %
collateral_to_property	272	10 %

Information missing about the debt-to-income ratio — which happens in 22% of loans — should — I think — be considered a red flag. The same holds for the mortgage balance due — missing in 9% of cases; it seems essential in granting a home equity loan. The same again for several other predictors such as property value — missing in 2% of cases —, the number of major derogatory reports — missing in 12% —, the number of delinquent lines — missing in 10% —, the number of recent credit lines — missing in 9% —, and the number of credit lines — missing in 4%.

Missing values in these predictors should all be considered red flags since these predictors are valuable and decisive information.

All missing values are not concentrated on a limited number of loans. On the contrary, the last but one table showed that 43% of loans are affected.

Concomitance between missing information and default rate would be hardly surprising! Let us have a look.

Missing Information/Default

Let us check up on possible concomitance between missing information and default rate. Of course it is not about causation: anteriority does not mean causation.

The default rate on loans with full information and the default rate on loans with partial information are both reported in the table below.

Default Rate on Training Set	Default Rate if Partial Information	Default Rate if Full Information
20 %	8 %	35 %

As it appears in the table above, the global default rate is 20%, but it is only 8% on loans with complete information and it jumps to 35% on loans with incomplete information.

Way Forward in Data Science

In the dataset under review, missing information looks like a rather powerful predictor of default! This has been indicated globally in the table just above and for some predictors taken individually in the Exploratory Data Analysis on label and predictors.

In Data Science, the challenge is twofold:

utilizing the predictive value of missing information
and at the same time providing a technical solution for empty cells in Machine Learning algorithms.

Should an imputation method based on medians or averages (means) be used? First, missing values seem to have some predictive power of default, which might disappear if missing values were replaced with imputed values. Second, with 2,345 missing values in the training set and 1,141 rows permeated, imputation of medians or averages seems risky, even temerarious, especially so in finance!

Consequently, a completely different avenue of research will be investigated:

splitting the dataset into a subset with all rows containing full information and a second subset with all rows missing at least one piece of information;
wrangling data in a completely separate way in the subset with incomplete information, focusing on extracting the predictive power of missing information.

ML on Complete Information

So, the training set — as well as the test set and the validation set — will be split into two parts:

a training subset with full information,
a training subset with partial information.

In this section, we are dealing with the training subset with full information.

ML on Training Set

This is the Machine Learning part of this Data Science and Data Analytics project.

Twenty-one algorithms have been pre-tested with the train() function and the predict() function from the package caret. All algorithms have been trained on the training set with full information and used for prediction on the test set with full information. Here is the list of the algorithms, with for each algorithm the name as it stands in Max Kuhn’s caret package and — in parentheses — the argument to be used in the train() function from the caret package:

AdaBoost Classification Trees (adaboost),
Generalized Additive Model using Splines (gam),
Boosted Generalized Additive Model (gamboost),
Generalized Additive Model using LOESS (gamLoess),
Stochastic Gradient Boosting (gbm),
Generalized Linear Model (glm),
k-Nearest Neighbors (knn),
Linear Discriminant Analysis (lda),
Monotone Multi-Layer Perceptron Neural Network (monmlp),
Naive Bayes (naive_bayes),
Quadratic Discriminant Analysis (qda),
Random Forest (Rborist),
Random Forest (rf),
CART (rpart),
Support Vector Machines with Linear Kernel (svmLinear),
Support Vector Machines with Radial Basis Function Kernel (svmRadialCost),
Support Vector Machines with Radial Basis Function Kernel (svmRadialSigma),
Weighted Subspace Random Forest (wsrf),
eXtreme Gradient Boosting (xgbDART),
eXtreme Gradient Boosting (xgbLinear),
eXtreme Gradient Boosting (xgbTree).

Model performance has been evaluated on the basis of sensitivity and precision. Six models have emerged: adaboost, gam, gamLoess, rf, wsrf, and xgbTree. For these six models, performance has been summarized in the table below.

Algorithm	Sensitivity on Training Set	Precision on Training Set
adaboost	100 %	100 %
gam	44 %	93 %
gamLoess	43 %	96 %
rf	100 %	100 %
wsrf	100 %	99 %
xgbTree	100 %	100 %

Sensitivity and precision are both equal to 100% for adaboost, rf, and xgbTree. For wsrf, sensitivity and precision are respectively 100% and 99%.

This looks like overfitting. The same level cannot be expected on the test set.

Before calculating sensitivity and precision on the test set, let us have a look at broader statistics. This will be done for the models wsrf and gam.

This illustrates the overfitting diagnosis: sensitivity (recall), specificity, precision, F1, accuracy, and Kappa are close to 100% or even equal to 100%. For adaboost, rf, and xgbTree, all performance metrics would reach 100%.

The situation is quite different for the other two algorithms — namely gam and gamLoess. Let us visualize the confusion matrix and performance metrics on the training set for gam.

As the table above shows, for gam — and it is the same for gamLoess — precision is a bit lower than with the other four algorithms — adaboost, rf, wsrf, and xgbTree — but sensitivity is substantially lower.

So, let us summarize performances in a nutshell:

the algorithms adaboost, rf, wsrf, and xgbTree outperform the other two algorithms with performance metrics of 100% or close to 100%, but they are probably overfitting;
gam and gamLoess do not give indications of overfitting, they have rather high precision levels but low sensitivity levels.

At this stage, the six algorithms will remain preselected.

Predicting on Test Set

Now let us turn to sensitivity and precision on the test set for loans with full information.

Algorithm	Sensitivity on Test Set	Precision on Test Set
adaboost	43 %	100 %
gam	36 %	83 %
gamLoess	44 %	86 %
rf	40 %	93 %
wsrf	41 %	94 %
xgbTree	40 %	100 %

On the training set, four algorithms — adaboost, rf, wsrf, and xgbTree — had sensitivity and precision levels of 100% or almost 100%. On the test set, their sensitivity levels have sunk to 40% or somewhat higher. This confirms the overfitting diagnosis. Precision remains between 93% and 100%.

Two algorithms — gam and gamLoess — had more modest performance levels on the training set. They almost maintain their levels on the test set, which is not so surprising since there was no overfitting diagnosis for them on the training set. Moreover, gamLoess obtains the highest sensitivity level — by little though.

Annoyingly enough, sensitivity level is now much lower than the minimum target of 75%! All models face the same challenge: rebalancing sensitivity and precision, by improving sensitivity at minimum 75% while sacrificing part of the performance in precision but not lower than 50%.

Actually, there is some leeway since precision is above the minimum target and since we know that there is a trade-off between sensitivity and precision. Indeed, these performance levels have been obtained by predicting default when probability of default was larger than 50%. If the probability threshold is lowered, sensitivity can hopefully get leveraged, of course to the detriment of precision. But since precision is above the minimum target, there is room for maneuver! Let us use it!

Boosting Sensitivity on Test Set

So far, the six preselected algorithms have only produced outcomes — default or repaid — for the standard probability threshold of 0.50.

Now, from each algorithm, we will extract probabilities for each loan, instead of outcomes. These probabilities will be converted into outcomes using different probability thresholds for each algorithm. The probability thresholds will range from 0.05 to 0.5 — the default threshold — with increments of 0.001. Thus, there will be 451 probability thresholds and for each algorithm we will have 451 sets of outcomes. For each of these 2,706 outcome sets, we will calculate sensitivity and precision. This will provide six accuracy-sensitivity curves, which can be expected to cross the target area of all combinations with at least 75% sensitivity and at least 50% accuracy.

For the sake of clarity, a table will first provide part of the precision-sensitivity curves: it will be comprised of sensitivity and precision levels for the three best performing algorithms and for a limited number of probability thresholds that allow these three algorithms to get close to — or even meet — the minimum target of 75% sensitivity and 50% precision. Later on, a graph will show the whole precision-sensitivity curves for the six algorithms.

Thresh	ada_s	ada_p	rf_s	rf_p	wsrf_s	wsrf_p
0.11	94	12	76	46	79	43
0.12	94	13	76	49	79	48
0.13	94	14	74	53	77	50
0.14	94	15	73	57	73	50
0.15	93	17	73	63	71	53
0.22	83	45	67	82	64	79
0.23	83	50	67	84	63	80
0.24	80	53	64	87	61	83

The three best performing algorithms are adaboost, rf, and xgbTree.

For these three algorithms, there are some most interesting results in the table above.

In the upper part of the table, we can see a combination above the minimum target for wsrf: 77% sensitivity and 50% precision. For rf, there is a combination that almost reaches the minimum target: 76% sensitivity and 49% precision.

In the lower part of the table, adaboost culminates at 83% sensitivity and 50% precision.

Let us have a look at a graph that summarizes available information in a broader picture.

adaboost

gam

gamLoess

wsrf

xgbTree

In the graph above, a precision-sensitivity curve is allocated to each model. The word “recall” has been added as a synonym for “sensitivity” since Data Science literature often refers to precision-recall curves.

For each model, the precision-sensitivity curve gives from right to left all the precision-sensitivity combinations corresponding to the probability threshold moving from 0.5 to 0.05 by 0.001 increment.

The upper-right rectangle, marked with a deep blue perimeter, represents the target area — that is to say the area with sensitivity-precision combinations being equal to the 75-50 minimum target or superior to it (for instance 75-55 but not 74-55).

This graph allows to rank the models on the test set with respect to the 75-50 benchmark: adaboost comes clearly first, followed by rf and wsrf; xgbTree, gam, and gamLoess remain further away, outside of the target area.

Three caveats have to be clearly issued.

Caveats about Boosted Results

First, performance results have been obtained on the test set with complete information, which is just a sample — more precisely two-ninths of the dataset with full information. From a statistical point of view, taking the standard error into account, different results are probable on another sample, for instance the validation set.

Second, the encouraging results on the test set have been “snatched”, after several trials and errors, on a test set that the algorithms already “know”. Repetitive training on the test set has probably exercised a flattering effect on results, flattering effect that we might miss on the validation set.

Third, performance results have been calculated on loans with full information. But the minimum target relates to both loans with complete or incomplete information. Results on both the test set with complete information and the test set with incomplete information have to be combined in order to check whether they altogether meet the minimum target of 75-50 and the same will have to be done on the validation set. This duality can bring some stability but also makes the global performance level dependent on a subset with incomplete information.

For these three reasons, an additional boost will be sought on the test set with complete information, before moving on to the test set with incomplete information. What is sought is an improvement in the level of prediction performance metrics and/or more prospective stability in prediction.

Ensembling by Majority Vote

Taking into account the three caveats, a boost is sought in terms of prediction stability and/or performance.

Model ensembling is expected to contribute prediction stability. But will it contribute prediction performance?

Model ensembling will be applied to the three best performing algorithms on the test set — adaboost, rf, and wsrf. It will be conducted by majority vote among the three best performing algorithms, taking for each observation from the test set with complete information the outcome — “default” or “repaid” — that gathers at least two votes. Common predicted values will be established for each probability threshold ranging from 0.5 to 0.05 by 0.001 increment. For each probability threshold, a sensitivity-precision combination will be calculated. Together, the 451 sensitivity-precision combinations will form a precision-sensitivity curve for the model ensembling process.

For the sake of clarity, the best points from the precision-sensitivity curve of the composite model are shown in the next table.

Probability Threshold	Ensembling by Vote — Test Set Sensitivity	Ensembling by Vote — Test Set Precision
0.13	81	46
0.14	79	51
0.15	77	53

The model ensembling procedure by majority vote culminates with combinations of 79-51 and 77-53, so rather under adaboost, which scores 83-50. This model ensembling procedure might bring some stability as composite structure but it is outperformed, at least on the test set with complete information. Consequently, another way forward will be sought. Indeed, majority voting is not the only model ensembling technique.

Ensembling on Probabilities

Having already extracted probabilities of default from the three best performing algorithms, it is easy to directly work on probabilities to build a model ensembling procedure.

Average default probabilities will be calculated for all observations from the test set with complete information. Using these average probabilities of default, outcomes will be predicted for all probability thresholds ranging from 0.5 to 0.05 by 0.001 increment. Then once again a precision-sensitivity curve will be calculated.

Here is the precision-sensitivity curve of the model ensembling procedure by average probabilities, together with the precision-sensitivity curves of the three best performing algorithms.

adaboost

wsrf

ensembling

The graph above sheds some additional light:

the model ensembling procedure hits the target area several times;
adaboost is often better, except on some ranges above 80% sensitivity.

Greater predictive stability is expected from the ensembling procedure, which reaches the target area several times. Consequently, the ensembling procedure by probabilities is chosen to predict on the validation set with complete information.

Let us take a closer look at some probability thresholds for which the ensembling procedure crosses the target area in order to pick up a probability threshold that looks appropriate to reach the target area when predicting on the validation set.

Probability Threshold	Ensembling by Probabilities — Test Set Sensitivity	Ensembling by Probabilities — Test Set Precision
0.15	81	49
0.16	77	51
0.17	77	58
0.18	76	60
0.19	73	65

After looking at the table just above, and to be on the safe side, let us choose a probability threshold of 0.17 with performance metrics of 77-58. This threshold will be used when predicting on the validation set with complete information.

By the way, the combination 77-58 is better than the combination 77-53 reached by the model ensembling procedure by majority vote.

ML on Incomplete Information

This is the second part of the Machine Learning process, this time on the training and test sets with incomplete information. Methodology has to be adapted as explained here.

Methodology

The analysis of loans with incomplete information will strongly combine Exploratory Data Analysis and Machine Learning.

Exploratory Data Analysis has provided very valuable information about the default rates in case of missing values. This has been done predictor by predictor. For instance, the absence of information missing about the current market value of the property or the debt-to-income ratio is associated with average default rates of 91% or 61% respectively. At the opposite, the absence of information about the professional occupation or the number of years in the current professional occupation is related to average default rates of 5% or 12% respectively.

We are going to make the most of it.

In the training, test, and validation sets, missing values will be replaced with 1s and other values with zeros. In such a way, the predictors will consist in the distribution of missing values.

Outcomes will be predicted with adaboost. Actually, other algorithms have been tried on the training and test sets with incomplete information, such as glm, lda, naive_bayes, qda, rf, Rborist, xgbDART, xgbTree, etc. But adaboost outperformed the other algorithms.

Results on Training and Test Sets

Performance metrics obtained with adaboost have been summarized in the table below.

	adaboost Sensitivity	adaboost Precision
Training Set with Incomplete Information	93 %	66 %
Test Set with Incomplete Information	92 %	64 %

On the training set, results are 93% sensitivity and 66% precision, well above the 75–50 minimum target. But there might be some overfitting.

On the test set, there is some remarkable resilience of this model with performance levels remaining almost at the same level as those on the training set, with a combination of 92–64.

This finding allows the following considerations:

this is not an indication to change the model on loans with incomplete information;
this solidly restores hopes of reaching the minimum target of 75-50 on the combination of the validation set with complete information and the validation set with incomplete information.

Predicting on the Validation Set

Let us complete both Machine Learning processes by predicting on the validation set with complete information and on the validation set with incomplete information.

Predicting on Full Information

Here are the results when predicting loan default on the validation set with complete information.

	Threshold	Sensitivity	Precision
Validation Set with Complete Information	0.17	83 %	48 %

Regarding loans with complete information, the performance metrics level on the validation set — namely a sensitivity-precision combination of 83–48 — compares to the best performance level reported on the test set with complete information — namely 81–49. But, on the validation set, the range is a little broader.

Predicting on Partial Information

Here are the results when predicting loan default on the validation set with incomplete information — that is to say the validation set comprised of all rows containing at least one missing value.

	Sensitivity	Precision
Validation Set with Incomplete Information	89 %	66 %

On the validation set with incomplete information, predicting loan default reaches a performance level of 89% sensitivity and 66% precision.

The performance level on the validation set with incomplete information

compares to the 92–64 combination reached on the test set with incomplete information,
and largely exceeds the minimum target of 75–50.

Global Prediction Exceeds Target

Modeling and analysis have followed a double track: loans with complete information have been treated apart from loans missing information in at least one predictor.

But the 75–50 minimum target applies globally — namely on the global validation set comprised of the validation set with complete information and the validation set with incomplete information.

The next table renders the performance measurement of the dual approach on the global validation set.

	Sensitivity	Precision
Global Validation Set	87 %	60 %

Takeaways

After reaching, and exceeding, the minimum target, let us devote some time to takeaways.

Data Science

In this Data Science project, Machine Learning models have predict home equity loan default with a sensitivity of 87% and a precision of 60%, well above the minimum target of 75–50.

Exploratory Data Analysis has paved the way for this result. In many predictors, it has showed subcategories having different average default rates. But, maybe more important, it showed missing information being associated, in some predictors, with extreme — high or low — average default rates.

This finding has unblocked a situation with many missing values. A double-track approach has been applied in Machine Learning, loans with complete information and loans with partial information being dealt with separately in the Machine Learning procedure.

The partial information track has harnessed the predictive power of the dichotomy between absence or presence of information. All missing values have been switched for 1s and all pieces of information for zeroes. This has enabled the adaboost algorithm to produce on the validation set a sensitivity of 89% and a precision of 66%.

The complete information track has deployed in several stages: training of twenty-one algorithms on the training set, shortlist of six of them, thresholding, construction of precision-sensitivity (precision-recall) curves, model ensembling, choice of a probability threshold, loan default prediction.

In its current version, the global Machine Learning model has produced a binary classification into predicted default and predicted repayment. Two model extensions would be straightforward.

On the one hand, another balance between sensitivity and precision could be triggered by modulating the probability threshold, corresponding to another risk management policy.

On the other hand, the model could produce probabilities of default — or repayment — making it possible to refine the assessment of a loan file.

Domain Perspective

In the dataset under analysis, missing information should be considered a red flag since the default rate is 35% on loans with missing information and only 8% on other loans.

More specifically, in some fields, missing information should be considered a deal breaker. These fields are the current market value of the property and the debt-to-income ratio. Indeed, missing information in the current market value of the property is associated with an average default rate of 91% in the training set; missing values in the debt-to-income ratio correspond to an average default rate of 61%.

These considerations are expressed on the sole basis of the dataset under review, which is supposed to be comprised of real-world data. If these data are real, maybe complementary information was available. Maybe some borrowers had strong points that are not mentioned in the dataset, such as for instance capacity to pay, liquid assets, or guarantors…

Loan Default Prediction with Machine Learning