Detect the credit card fraud problems. The dataset is downloaded from the kaggle, Give Me Some Credit (https://www.kaggle.com/c/GiveMeSomeCredit/)
There are two main problems in the dataset. Firstly, it contains some missing values. Around 20% of the observations are uncompleted (saved as NA). Also, most of the input features have some outliers which need to be dealt with. In total, the dataset has 250,000 observations which the training set has 150,000 observations and the test set has 100,000 observations. After the data cleaning and transformation, we split the data into two parts. We use 70% of the data as training and the rest 30% as testing. Because the class labels of training data are given. Therefore, we use the supervised learning to train the data and considering the labels’ output is 0 or 1 which corresponds to default or not, so we decide to use the supervised learning classification model. We use two algorithms, two-class boosted decision tree and logistic regression. After the training, the trained models are tested on the testing data and the overall accuracy, one gives us 93% and the other gives us 92%.