Loan Prediction Model

March 1, 2017
This Azure Machine Learning experiment creates a statistical model to predict if a customer will default or fully pay off their loan.
Author: Edward Ansong Description ----------- **Binary Classification: Loan Granting** This experiment creates a statistical model to predict if a customer will default or fully pay off a loan. **Data** A synthetic data set based on real data was created for the competition. The data set included the following columns. Loan ID <br> Customer ID <br> Loan Status <br> Current Loan Amount <br> Term <br> Credit Score <br> Years in current job <br> Home Ownership <br> Annual Income <br> Purpose <br> Monthly Debt <br> Years of Credit History <br> Months since last delinquent <br> Number of Open Accounts <br> Number of Credit Problems <br> Current Credit Balance <br> Bankruptcies <br> Tax Liens <br> For this experiment, the Loan Status serves as the label, or attribute to predict. The rest of the columns excluding Customer ID will be used to predict the outcome of the Loan Status for each customer. The data set was available as it was stored in a web URL in csv form by Azure. **Data Preparation** Part 1: ![Data Cleaning Modules][1] Data preparation was carried out in many steps. First the Select Columns module was used to filter out Customer ID since there was no statistical significance to the variable. This variable only uniquely identifies each customer. The second step was to clean the data with R scripts. **R code:** library("stats") loan_df <- maml.mapInputPort(1) # class: data.frame loan_df$"Maximum Open Credit" <- gsub("[^0-9]", "", loan_df$"Maximum Open Credit") loan_df$"Maximum Open Credit" <- as.numeric(loan_df$"Maximum Open Credit") loan_df$"Monthly Debt" <- gsub("[^0-9]", "", loan_df$"Monthly Debt") loan_df$"Monthly Debt" <- as.numeric(loan_df$"Monthly Debt") loan_df$"Months since last delinquent" <- gsub("[^0-9]", "", loan_df$"Months since last delinquent") loan_df$"Months since last delinquent" <- as.numeric(loan_df$"Months since last delinquent") loan_df$"Bankruptcies" <- gsub("[^0-9]", "", loan_df$"Bankruptcies") loan_df$"Bankruptcies" <- as.numeric(loan_df$"Bankruptcies") loan_df$"Tax Liens" <- gsub("[^0-9]", "", loan_df$"Tax Liens") loan_df$"Tax Liens" <- as.numeric(loan_df$"Tax Liens") loan_df$"Years in current job" <- gsub("[1-5]\\syears", "1-5 years", loan_df$"Years in current job") loan_df$"Years in current job" <- ifelse(loan_df$"Years in current job" == "1 year", "1-5 years", loan_df$"Years in current job") loan_df$"Years in current job" <- gsub("[6-9]\\syears", "6-9 years", loan_df$"Years in current job") loan_df$"Years in current job" <- gsub("n/a", NA, loan_df$"Years in current job") loan_df$"Credit Score" <- ifelse(loan_df$"Credit Score" > 1000, 650, loan_df$"Credit Score") maml.mapOutputPort("loan_df") First the Loan Status columns were altered slightly, making the "Charged Off" status 1 and the "Fully Paid" status 0. The next R script individually address most columns. Many of the numeric columns had extra non numerical characters in them suck as "$". With the excepction of "." non numerical characters were removed from numerical columns. In addition, the columns meant to be numerical were all typecasted to numeric. To add, the "Years in current job" column was condensed. Instead of indivudal years ranging from 1-9, the years of experience were placed into bins. For example, people who have 1-5 years were all placed into one bin. Finally, some credit scores were over 1000. After many iterations of testing, the credit scores of that nature were lowered to 650 as they all consisted of people who were Charged Off. <br> Next, extreme outliers were clipped and replaced with the median. Some columns contained data entries with extremely high values such as 99999999. These values are most likely not valid and were put into place to signify that the value is missing. Duplicate customer ID's were removed. In the next execute R script, feature engineering was used to create the dti column (debt to income ratio) by dividing annual salary by the amount of debt. loan_df <- maml.mapInputPort(1) month_income <- loan_df$"Annual Income"/12 loan_df$"dti" <- loan_df$"Monthly Debt"/month_income loan_df$"dti" <- ifelse(is.infinite(loan_df$"dti") == TRUE, 0, loan_df$"dti") maml.mapOutputPort("loan_df"); Part 2: ![Data Cleaning Modules Part 2][2] The next few modules make the string columns into categorical columns and the column Purpose has been condensed into two main groups (debt, other) from the several that previously existed. Part 3: ![Building and Testing the Algorithm][3] From here the model is created. The label (Loan Status) is split off from the other columns before creating the model. A boosted decision tree is used to build the model. In addition, the Tune Model Parameters module is used to find the most efficient parameters to use for the model. A random sweep was used, and module chooses to maximize AUC (Area Under the Curve). The best parameters from the Tune Model Parameters is used as the model to train the data. The data is split into training and test datasets using the Split Data module. The training data is used to train the model and test data is used in the score model to evaluate performance. The confusion matrix can be seen in the Evaluate Model module. **Final Thoughts**: Creating a model to accurately predict the hidden dataset for this competition was very tough for several reasons. Many of the participants including myself tried dealing with the imbalanced data. About 75% of the data set consisted of people who fully paid off their loan, and the remainder were people who were charged off. This posed many challenges since the models became biased towards people who fully paid off their loans. In addition, the distributions of credit score, current loan amount, and many other columns were very similar between customers who fully paid off their loans and people who did not. If there was some more variability between the two types of customers this may have been easier. Overall this was a fun model to create and hope to tackle so more fun data science challenges in the future. [1]: [2]: [3]: