Predicting Credit Card Approvals using Random Forest

shilpsgohil
Apr 28, 2021
5 min read

Commercial banks receive many credit card applications on a day-to-day basis. Some of these applications get rejected based on variables such as high bank loan balances, low income levels, history of unemployment and many other factors. Manually analyzing each of these applications is mundane, error-prone and time-consuming. Therefore, we can implement a machine learning algorithm to do this task for us.

brief introduction to random forest

Random forest is an ensemble method that uses decision trees as a base estimator. Each estimator in the Random Forest is trained on a different bootstrap sample having the same size as the training set. Random Forest enables randomization since each tree is trained with a different sample of data and only some features (d) are sampled at each node without replacement. Random Forest also allows variable importance to be determined from the dataset using permutation importance which is a measure that tracks prediction accuracy where the variables are randomly permutated from out-of-bag samples.

Each decision tree in a random forest learns a sequence of if-else questions about individual features in order to predict the labels. In contrast to linear models, trees are able to capture non-linear relationships between features and labels. The root is the node at which the decision-tree starts growing and involves a question that gives rise to 2 children nodes through two branches. An internal node is a node that has a parent and also involves a question that gives rise to two children nodes. Lastly, a leaf node has one parent node and involves no question but rather it is where the final predictions are determined. The classification trees are grown recursively and the splitting of information at each internal node depends on the state of its predecessors. At each node, a question involving a single feature is asked and the tree is split so that there is maximum information gain at each split.

An example of a decision tree:

Data

The dataset used for this model was "Credit Approval Data Set" and it was obtained from the UCI Machine Learning Repository. All feature names and values were changed to meaningless symbols to protect the confidentiality of the data and this created more complexity in the dataset which required a lot of data cleaning. The data also contains a mix of continuous variables and categorical variables with missing values in each.

The data in a Jupyter notebook looks as follows:

exploring the dataset

The column names were found in the following blog and was informative in terms of knowing what features I was working with.

From the image above, we see a mix of ints, floats and objects therefore the dataset has a mixture of numerical and non-numerical features. We will clean and process that data later.

Data cleaning

The missing values in the dataset are labelled with "?" and these needed to be replace by NaNs in order to further work with these values in a Pandas context.
The NaNs pertaining to numerical variables were subsequently imputed with the respective mean values.
The NaN's pertaining to the categorical variables were imputed with the most frequent values.
In order to verify that all NaN values were replaced, "isnull().sum() was utilized - all NaN values had been successfully replaced with values as shown below:

Exploratory data analysis

Histogram for the numerical variables in the dataset

From the bar plot above, we see that there are some differences that exist between education level attained and respective incomes. The "aa" and "c" groups have significantly higher levels of income than some of the the other groups.

We see that some ethnicities have higher debt levels than other. For example, "v" ethnicity have profoundly higher rates of debt other ethnicities.

Different genders also have varying levels of debt and there is significant spread of data for the "a" gender however, there is significantly more outliers for the "b" gender.

We also see varying levels of debt based on different categories of marital status.

We see a dense cluster of individuals between the ages of 20-40 years having moderate to low amount of debt. We also see that as individuals get older they tend to have higher levels of debt.

No significant correlations observed.

High variance observed in the income variable.

There is a slight class imbalance between credit card approvals and disapproval.

Data Preprocessing

The data processing steps performed include:

Converting the categorical variables into a proper format for machine learning using the pd.get_dummies method in pandas.
Dropping variables such as drivers_license and zipcode as these were deemed to be irrelevant.
The target variable (approval status) had "+" to represent approvals and "-" to represent disapproval. This had to be replace by either 1 (if person defaulted) and 0 (if person did not default).
The target variable was assigned to the variable X and the rest of the variables (predictors) were assigned to y.
There were 11 rows found to have "b" in place of a numerical value in the age column and these rows were dropped from the final dataset.
The numerical variables were standardized using the StandardScaler from Sci-kit learn.

t-sne visualization for high dimensional data (unsupervised machine learning).

t-SNE stands for t-distributed stochastic neighbour embedding and it essentially maps samples from their multi-dimensional space into 2 or 3 dimensional space for the purposes of visualization. Although there is some distortion involved, t-SNE enables an approximation representation of the distance between the samples in the data and hence they can be vital visual aid for further understanding the dataset. Some disadvantages of this methodology is that different values of the learning rate (hyperparameters) need to be tried in order to achieve an ideal image. Furthermore, the t-SNE plot axes does not have any particular interpretable meaning and every time a t-SNE is applied a different visual is generated even with the same dataset.

The t-SNE visualization for the credit card dataset is shown below.

From the image above, we do see some good demarcation between the non-approvals and approvals (target variable) however, we also see some overlap between the 2 classes. Thus we can conclude that based on the different predictor variables, some data points would be more difficult to predict than others due to this overlap.

Implementing random forest

In order to implement the random forest, data was first divided into X_train, X_test, y_train and y_test using sci-kit learn train_test_split wit 20% of the data assigned to the test set. I first trained the model with 5 fold cross validation with no hyperparameter tuning for model comparison purposes and obtained the following metrics:

I repeated this with hyperparameter tuning of n_estimators, max_features and max_depth using 5 fold cross validation and randomized search and got the best parameters as (n_estimator: "400" , max_features: "auto" and max_depth: "None".

I trained the model using the following hyperparameters and tested on the training set and predicted values for the test data set (X_test). The final model tested on the test dataset gave the following results:

The ROC curve which represents the true positive rate on the y-axis and the false positive rate on the x-axis can be seen below.

Feature importance

With RandomForest feature importance can be determined and as such the feature importance plot for the dataset is shown below:

we can see that the most important feature for determining credit card approval or disapproval is whether there is prior default, years employed, debt, credit score, age and income etc. We also notice that variables such as whether the person is a bank customer or not and whether a person is married or not are not important features in determining whether the credit card is approved or not.