Trends of defaulted loans based off of financial and lifestyle information
This study explores a dataset with financial and lifestyle information on over 31,000 people to look at the response variable of if they defaulted on their loan or not. In this study we will look at the different variables that affect if a person has defaulted on their loans or not and find the most important variables that tell if a person is a reliable person to loan money to or not.
Rows: 32,581
Columns: 11
$ age <dbl> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21, 22, 21, …
$ income <dbl> 59000, 9600, 9600, 65500, 54400, 9900, 77100, 78…
$ home_ownership <fct> RENT, OWN, MORTGAGE, RENT, RENT, OWN, RENT, RENT…
$ employment_length <dbl> 123, 5, 1, 4, 8, 2, 8, 5, 8, 6, 6, 2, 2, 4, 2, 7…
$ loan_intent <fct> PERSONAL, EDUCATION, MEDICAL, MEDICAL, MEDICAL, …
$ loan_amnt <dbl> 35000, 1000, 5500, 35000, 35000, 2500, 35000, 35…
$ loan_int_rate <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12.42, …
$ loan_status <fct> Default, Non-Default, Default, Default, Default,…
$ loan_percent_income <dbl> 0.59, 0.10, 0.57, 0.53, 0.55, 0.25, 0.45, 0.44, …
$ default_history <fct> Y, N, N, N, Y, N, N, N, N, N, N, N, N, N, N, N, …
$ credit_history_length <dbl> 3, 2, 3, 2, 4, 2, 3, 4, 2, 3, 4, 2, 2, 4, 4, 3, …
age: Person age in years
income: Annual income in USD
home_ownership: Home ownership classifications on current home
employment_length: Employment length in years
loan_intent: What the loan is being used for
loan_amnt: Total amount of the loan
loan_int_rate: Interest rate of the loan in %
loan_status: Whether or not the person is in default on their loan
loan_percent_income: Percentage of the persons annual income that the loan is
default_history: If a person has ever defaulted on a loan before or not
credit_history_length: Length of credit history in years
Before we can start our data analysis we need to make sure our dataset is sufficient and make sure there are no values in the data that don’t make sense or are not possible. We will look at the five number summary of each variable to make sure nothing is wrong with the data.
age income employment_length loan_amnt loan_int_rate loan_percent_income
[1,] 20 4000 0 500 5.42 0.00
[2,] 23 38500 2 5000 7.90 0.09
[3,] 26 55000 4 8000 10.99 0.15
[4,] 30 79200 7 12200 13.47 0.23
[5,] 144 6000000 123 35000 23.22 0.83
credit_history_length
[1,] 2
[2,] 3
[3,] 4
[4,] 8
[5,] 30
From this table we can see that there are some age values and employment length values that don’t make sense. Since the oldest person to ever live died at 122 years old it is safe to get rid of ages over 122 and assume that even the oldest person alive didn’t start working until they were 10 years old. We will get rid of the data that doesn’t make sense.
age income employment_length loan_amnt loan_int_rate loan_percent_income
[1,] 20 4000 0 500 5.42 0.00
[2,] 23 39366 2 5000 7.90 0.09
[3,] 26 56000 4 8000 10.99 0.15
[4,] 30 80000 7 12500 13.48 0.23
[5,] 94 2039784 41 35000 23.22 0.83
credit_history_length
[1,] 2
[2,] 3
[3,] 4
[4,] 8
[5,] 30
Now that we know all of our quantitative data looks good it can be used in the study
# A tibble: 4 × 2
home_ownership n
<fct> <int>
1 RENT 16072
2 OWN 2410
3 MORTGAGE 13090
4 OTHER 107
# A tibble: 6 × 2
loan_intent n
<fct> <int>
1 PERSONAL 5367
2 EDUCATION 6288
3 MEDICAL 5897
4 VENTURE 5553
5 HOMEIMPROVEMENT 3510
6 DEBTCONSOLIDATION 5064
# A tibble: 2 × 2
loan_status n
<fct> <int>
1 Default 6825
2 Non-Default 24854
# A tibble: 2 × 2
default_history n
<fct> <int>
1 Y 5628
2 N 26051
We can see that all of our categorical data looks good and now we can continue to the study
First we will look at the categorical data and see how the different categories impact whether people have defaulted on their loans or not.
From these graphs we can see that:
Home ownership and default history are the main categorical variables that have an impact on loan status
This could be because people who own their home can manage their money well to pay their mortgage every month and not get their house taken from them and if a person has defaulted on a loan in the past it shows a trend that they aren’t as good with managing money to pay off their loans
Now we will look at the trends of the quantitative data and see how the different variables affect if a person defaults on their loan or not.
From these graphs we can see:
A lot of the quantitative variables like age, employment length, and credit history length don’t make much of a difference on if a person has defaulted on their loan or not
Income, loan amount, loan interest rate and loan percentage of income show a significant trend on if people default on their loans or not
Now we will look at the relationship between different numerical variables and their relationship with loan status.
From these graphs we can see that:
Younger people with lower incomes defualt on their loans much more than older people with higher incomes
Loan amount doesn’t really affect if people default on their loans or not but loan interest rate makes a huge difference
People with a loan that is a low percentage of their income and have a low interest rate rarely default on their loans
People with a loan that is a high percentage of their income and have a high interest rate often default on their loans
In this study we can conclude that there are some variables in this data set that do a pretty good job of showing if a person is likely to default on their loan or not. In the categorical data we can see that if a person owns their home they are much less likely to default on their loan than a person who rents their home or is in an “other” living situation. We can also see a significant trend in default history, if a person has a history of defaulting on their loans they are much more likely to default on their loan again. In the quantitative variables we can see that income, loan amount, loan interest rate and loan percentage of income have a significant impact on whether a person is likely to default on their loan or not. When looking at income we can see that the median income of people who have defaulted on their loan is at 42,000 USD annualy and the median income of people who have not defaulted on their loan is significantly higher at 60,000 USD annualy. We can also see that people with a higher loan amount and a higher interest rate are more likely to default on their loans because these variables make a persons monthly payment much higher. We can also see that the median loan percent of income for people who have defaulted on their loan is up at 24% while it is only at 13% for people who have not defaulted on their loan. This analysis shows that there are financial and lifestyle information that you should look at when a person applies for a loan to see how reliable it is that they will not default on their loan.
---
title: "Loan Defult Trends"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: yeti
navbar-bg: "purple"
orientation: columns
vertical_layout: fill
source_code: embed
---
<style>
.chart-title { /* chart_title */
font-size: 16px;
}
body{ /* Normal */
font-size: 18px;
}
</style>
```{r setup, include=FALSE}
library(flexdashboard)
```
Basic Information
===
Column {data-width=500}
---
***Trends of defaulted loans based off of financial and lifestyle information***
### Introduction
This study explores a dataset with financial and lifestyle information on over 31,000 people to look at the response variable of if they defaulted on their loan or not. In this study we will look at the different variables that affect if a person has defaulted on their loans or not and find the most important variables that tell if a person is a reliable person to loan money to or not.
<head>
<base target="_blank">
</head>
Here is a glimpse of the data:
```{r package_data}
library(pacman)
p_load(cowplot,dplyr,DT,gridExtra,plotly,tidyverse, maps, viridis)
df<-read_csv("Credit_Risk_Data_Set.csv",col_types="nnfnfnnfnfn")
df<-df %>%
rename("age"="person_age",
"income"="person_income",
"home_ownership"="person_home_ownership",
"employment_length"="person_emp_length",
"default_history"="cb_person_default_on_file",
"credit_history_length"="cb_person_cred_hist_length")
df$loan_status<-df$loan_status %>% recode(
'0'="Non-Default",
'1'="Default")
glimpse(df)
```
Column {data-width=500}
---
### Variable Definitions
age: Person age in years
income: Annual income in USD
home_ownership: Home ownership classifications on current home
employment_length: Employment length in years
loan_intent: What the loan is being used for
loan_amnt: Total amount of the loan
loan_int_rate: Interest rate of the loan in %
loan_status: Whether or not the person is in default on their loan
loan_percent_income: Percentage of the persons annual income that the loan is
default_history: If a person has ever defaulted on a loan before or not
credit_history_length: Length of credit history in years
Outliers
===
Column {data-width=500}
---
### Quantitative Data
Before we can start our data analysis we need to make sure our dataset is sufficient and make sure there are no values in the data that don't make sense or are not possible. We will look at the five number summary of each variable to make sure nothing is wrong with the data.
```{r outliers}
sapply(df[c('age','income','employment_length','loan_amnt','loan_int_rate','loan_percent_income','credit_history_length')],fivenum)
```
From this table we can see that there are some age values and employment length values that don't make sense. Since the oldest person to ever live died at 122 years old it is safe to get rid of ages over 122 and assume that even the oldest person alive didn't start working until they were 10 years old. We will get rid of the data that doesn't make sense.
```{r remove data}
df.keep<-df %>%
filter(age<122,employment_length<112)
df.removed<-df %>%
filter(age>122)
sapply(df.keep[c('age','income','employment_length','loan_amnt','loan_int_rate','loan_percent_income','credit_history_length')],fivenum)
```
Now that we know all of our quantitative data looks good it can be used in the study
Column {data-width=500}
---
### Categorical Data
Now we need to look at our categorical data and make sure all of the data in the dataset is plausible.
```{r}
dplyr::count(df.keep,home_ownership)
dplyr::count(df.keep,loan_intent)
dplyr::count(df.keep,loan_status)
dplyr::count(df.keep,default_history)
```
We can see that all of our categorical data looks good and now we can continue to the study
Kept and Removed Values
===
Column {data-width=500}
---
### Kept Values
```{r}
DT::datatable(df.keep)
```
Column {data-width=500}
---
### Removed Values
```{r}
DT::datatable(df.removed)
```
Categorical Variables
===
column{.tabset data-width=500}
---
First we will look at the categorical data and see how the different categories impact whether people have defaulted on their loans or not.
### Home Ownership
```{r}
g1<-ggplot(df.keep,aes(x=home_ownership,fill=loan_status))+
geom_bar(width=0.75,position="fill")+
scale_y_continuous(breaks=seq(0,1,by=0.2),labels=scales::percent)+
labs(y="Percent",x="Home Ownership",fill="Loan Status",title="Home Ownership vs. Loan Status")+
theme(text=element_text(size=15),axis.text.x=element_text(angle=90),legend.position = "top")
g1<-g1 %>%
style(hoverinfo="y")
ggplotly(g1)
```
### Loan Intent
```{r}
g2<-ggplot(df.keep,aes(x=loan_intent,fill=loan_status))+
geom_bar(width=0.75,position="fill")+
scale_y_continuous(breaks=seq(0,1,by=0.2),labels=scales::percent)+
labs(y="Percent",x="Loan Intent",fill="Loan Status",title="Loan Intent vs. Loan Status")+
theme(text=element_text(size=15),axis.text.x=element_text(angle=90),legend.position = "top")
g2<-g2 %>%
style(hoverinfo="y")
ggplotly(g2)
```
### Default History
```{r}
g3<-ggplot(df.keep,aes(x=default_history,fill=loan_status))+
geom_bar(width=0.75,position="fill")+
scale_y_continuous(breaks=seq(0,1,by=0.2),labels=scales::percent)+
labs(y="Percent",x="Default History",fill="Loan Status",title="Default History vs. Loan Status")+
theme(text=element_text(size=20),legend.position = "top")
g3<-g3 %>%
style(hoverinfo="y")
ggplotly(g3)
```
column{data-width=500}
---
### Explanation
From these graphs we can see that:
- Home ownership and default history are the main categorical variables that have an impact on loan status
- This could be because people who own their home can manage their money well to pay their mortgage every month and not get their house taken from them and if a person has defaulted on a loan in the past it shows a trend that they aren't as good with managing money to pay off their loans
Numeric Variables
===
column{.tabset data-width=500}
---
Now we will look at the trends of the quantitative data and see how the different variables affect if a person defaults on their loan or not.
### Age
```{r}
g4<-ggplot(df.keep,aes(x=loan_status,y=age,fill=loan_status))+
geom_boxplot()+
labs(y="Age",x="Loan Status",fill="Loan Status",title="Age vs. Loan Status")+
theme(text=element_text(size=15),legend.position = "none")
g4<-g4 %>%
style(hoverinfo="y")
ggplotly(g4)
```
### Income
```{r}
g5<-ggplot(df.keep,aes(x=loan_status,y=income,fill=loan_status))+
geom_boxplot(outlier.shape=NA)+
ylim(c(0,150000))+
labs(y="Income",x="Loan Status",fill="Loan Status",title="Income vs. Loan Status")+
theme(text=element_text(size=15),legend.position = "none")
g5<-g5 %>%
style(hoverinfo="y")
ggplotly(g5)
```
### Employment Length
```{r}
g6<-ggplot(df.keep,aes(x=loan_status,y=employment_length,fill=loan_status))+
geom_boxplot()+
labs(y="Employment Length",x="Loan Status",fill="Loan Status",title="Employment Length vs. Loan Status")+
theme(text=element_text(size=15),legend.position = "none")
g6<-g6 %>%
style(hoverinfo="y")
ggplotly(g6)
```
### Loan Amount
```{r}
g7<-ggplot(df.keep,aes(x=loan_status,y=loan_amnt,fill=loan_status))+
geom_boxplot()+
labs(y="Loan Amount",x="Loan Status",fill="Loan Status",title="Loan Amount vs. Loan Status")+
theme(text=element_text(size=15),legend.position = "none")
g7<-g7 %>%
style(hoverinfo="y")
ggplotly(g7)
```
### Loan Interest Rate
```{r}
g8<-ggplot(df.keep,aes(x=loan_status,y=loan_int_rate,fill=loan_status))+
geom_boxplot()+
labs(y="Interest Rate",x="Loan Status",fill="Loan Status",title="Loan Interest Rate vs. Loan Status")+
theme(text=element_text(size=15),legend.position = "none")
g8<-g8 %>%
style(hoverinfo="y")
ggplotly(g8)
```
### Loan Percent of Income
```{r}
g9<-ggplot(df.keep,aes(x=loan_status,y=loan_percent_income,fill=loan_status))+
geom_boxplot()+
labs(y="Loan Percent of Income",x="Loan Status",fill="Loan Status",title="Loan Percent of Income vs. Loan Status")+
theme(text=element_text(size=15),legend.position = "none")
g9<-g9 %>%
style(hoverinfo="y")
ggplotly(g9)
```
### Credit History Length
```{r}
g10<-ggplot(df.keep,aes(x=loan_status,y=credit_history_length,fill=loan_status))+
geom_boxplot()+
labs(y="Credit History Length",x="Loan Status",fill="Loan Status",title="Credit History Length vs. Loan Status")+
theme(text=element_text(size=15),legend.position = "none")
g10<-g10 %>%
style(hoverinfo="y")
ggplotly(g10)
```
column{data-width=500}
---
### Explanation
From these graphs we can see:
- A lot of the quantitative variables like age, employment length, and credit history length don't make much of a difference on if a person has defaulted on their loan or not
- Income, loan amount, loan interest rate and loan percentage of income show a significant trend on if people default on their loans or not
EDA
===
column{.tabset data-width=500}
---
Now we will look at the relationship between different numerical variables and their relationship with loan status.
### Age vs. Income
```{r}
ggplot(df.keep,aes(x=age,y=income,color=loan_status))+
geom_point()+
labs(y="Income",x="Age",color="Loan Status",title="Age vs. Income and Loan Status")+
theme(text=element_text(size=15),legend.position = "top")
```
### Loan Interest Rate vs. Loan Amount
```{r}
ggplot(df.keep,aes(x=loan_int_rate,y=loan_amnt,color=loan_status))+
geom_point()+
labs(y="Loan Amount",x="Loan Interest Rate",color="Loan Status",title="Loan Interest Rate vs. Loan Amount")+
theme(text=element_text(size=15),legend.position = "top")
```
### Loan Interest Rate vs. Loan Percent of Income
```{r}
ggplot(df.keep,aes(x=loan_int_rate,y=loan_percent_income,color=loan_status))+
geom_point()+
labs(y="Loan Percent of Income",x="Loan Interest Rate",color="Loan Status",title="Loan Interest Rate vs. Loan Percent of Income")+
theme(text=element_text(size=15),legend.position = "top")
```
column{data-width=500}
---
### Explanation
From these graphs we can see that:
- Younger people with lower incomes defualt on their loans much more than older people with higher incomes
- Loan amount doesn't really affect if people default on their loans or not but loan interest rate makes a huge difference
- People with a loan that is a low percentage of their income and have a low interest rate rarely default on their loans
- People with a loan that is a high percentage of their income and have a high interest rate often default on their loans
Conclusion
===
Column(Data-width=600)
---
### Conclusion
In this study we can conclude that there are some variables in this data set that do a pretty good job of showing if a person is likely to default on their loan or not. In the categorical data we can see that if a person owns their home they are much less likely to default on their loan than a person who rents their home or is in an "other" living situation. We can also see a significant trend in default history, if a person has a history of defaulting on their loans they are much more likely to default on their loan again. In the quantitative variables we can see that income, loan amount, loan interest rate and loan percentage of income have a significant impact on whether a person is likely to default on their loan or not. When looking at income we can see that the median income of people who have defaulted on their loan is at 42,000 USD annualy and the median income of people who have not defaulted on their loan is significantly higher at 60,000 USD annualy. We can also see that people with a higher loan amount and a higher interest rate are more likely to default on their loans because these variables make a persons monthly payment much higher. We can also see that the median loan percent of income for people who have defaulted on their loan is up at 24% while it is only at 13% for people who have not defaulted on their loan. This analysis shows that there are financial and lifestyle information that you should look at when a person applies for a loan to see how reliable it is that they will not default on their loan.
Column(Data-width=400)
---
### References
Find this data set here on [Kaggle](https://www.kaggle.com/datasets/laotse/credit-risk-dataset)
See data on the oldest person to ever live [Here](https://en.wikipedia.org/wiki/Oldest_people)
### About the Author
My name is Evan Dolley.
I am a senior Mechanical Engineering major at the University of Dayton.
Connect with me on [LinkedIn](https://www.linkedin.com/in/evan-dolley-46624b180/).