This is the experiment created using an example credit risk prediction walkthrough.
When we apply for a credit card, the bank representative always asks us to fill the application and we receive it based on positive rating only. Have you ever wondered how does the bank take the decision? Today I have the answers. In fact, I will use the very intuitive platform, Azure Machine Learning, to demonstrate the process. Once you fill the form, the bank representative will input those values in its system and will give you the answer. Let’s assume that your application was rejected. And you asked why? The representative told you that you scored ‘2’ which means bad credit risk. Now you are really upset and asked him why you did not receive good credit risk or ‘1’. Then the representative said that although you are our customer for a long time, but you are not currently employed. You were still not convinced and when returned home, you searched on internet about the credit card process. What you found was a German credit data from UCI and online tutorial to classify ‘Yes/No’ for the credit card approval. Since Azure machine learning does not necessarily need programming skill, you decide to replicate this process by performing data engineering, selecting the algorithms, training the model and finally evaluating the performance of the model. Don’t get scared from the below diagram. You are going to understand each step one by one. To check the details of the experiment, here is the link Data Engineering: In the dataset description, you saw that there is no risk for the bank if you rightly classify application. However, there is a problem for misclassifying the credit application. The bad credit applications have five times risk than the good credit application, meaning bank can have a financial loss if the loan is not repaid. That is why you need to build a model that understand this issue. So, you decided to separate the data of bad credit risk (column 21) and replicate it by five times. It is possible to do it through a simple R script program. Before doing that, you must edit the names of the column and separate the data for training the model and evaluating its performance. Normally it is advised to split 70/30. Now your data is ready for building the model, 70% for training on the left and 30% for test/validation. Model Selection: There are many types of models. What you are interested in the simple and fast algorithm. Again, I repeat, you don’t need to know any programming. Microsoft has made a cheat sheet for us which makes our life easier. Following this sheet, you decided to choose Logistic Regression and Naive Bayes Classifier, both for two classes, good or bad credit. Not to confuse, you should go through each technique separately and then, and the end compares the results. Logistic Regression: It is a statistical model that predicts a categorical response variable, which is labeled as discrete or binary, e.g. ‘Yes or No’, ‘Good or Bad, ‘Buying or Not Buying’, ‘Win or Loss’, ‘0 or 1’, ‘Success or Failure’. Logistic regression makes predictions using probability which lies between 0 and 1 through a sigmoid function and increases predictive quality using Maximum Likelihood Estimation. Naive Bayes classifier: It is all about how much you can trust the evidence. Naive Bayes is a conditional probability model that is easy and fast to predict a given class by using a small number of training data. The model assumes that the predictors are independent which is almost impossible. However, in many cases, Naive Bayes classifier performs better than the other models such as logistic regression. Train & Score Models: You must run both models separately. That means you need to have two different nodes for training and scoring the data. The model node, e.g., Two Class Logistic Regression will connect to Train Model along with the 70% training data. While the score Model will have the input from the Training and 30% test/validation data. This process is similar to the Two-class Bayes Point machine, an advanced type of Naive Bayes classifier. Both processes will generate probabilities to predict the good or bad credit in this process. Evaluate Model: Let’s see which model performed better. Your target variable is bad credit (Positive Label, 2) which you would like to minimize. While 1 is for good credit and labeled as negative. Looking at the curve, as the area under the curve for both is similar, it is hard to generalize the results. However, the confusion matrix gave the useful insight, especially False Negative, which has five times credit risk compared to False Positive, has reduced drastically. Besides, Recall is also improved significantly. To know more about this terminology, please click this link. In terms of classifying the bad credit, Naïve Bayes is a better model for this purpose. Now you know the science behind the Credit Risk Analysis. You can also deploy this model online and predict the score of new credit applications. Not only this, but you can also update this model with fresh or additional data. Next time, when you will apply for any type of loan, you will already know what credit risk is.