Basic Information

Column

Trends of defaulted loans based off of financial and lifestyle information

Introduction

This study explores a dataset with financial and lifestyle information on over 31,000 people to look at the response variable of if they defaulted on their loan or not. In this study we will look at the different variables that affect if a person has defaulted on their loans or not and find the most important variables that tell if a person is a reliable person to loan money to or not.

Here is a glimpse of the data:
Rows: 32,581
Columns: 11
$ age                   <dbl> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21, 22, 21, …
$ income                <dbl> 59000, 9600, 9600, 65500, 54400, 9900, 77100, 78…
$ home_ownership        <fct> RENT, OWN, MORTGAGE, RENT, RENT, OWN, RENT, RENT…
$ employment_length     <dbl> 123, 5, 1, 4, 8, 2, 8, 5, 8, 6, 6, 2, 2, 4, 2, 7…
$ loan_intent           <fct> PERSONAL, EDUCATION, MEDICAL, MEDICAL, MEDICAL, …
$ loan_amnt             <dbl> 35000, 1000, 5500, 35000, 35000, 2500, 35000, 35…
$ loan_int_rate         <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12.42, …
$ loan_status           <fct> Default, Non-Default, Default, Default, Default,…
$ loan_percent_income   <dbl> 0.59, 0.10, 0.57, 0.53, 0.55, 0.25, 0.45, 0.44, …
$ default_history       <fct> Y, N, N, N, Y, N, N, N, N, N, N, N, N, N, N, N, …
$ credit_history_length <dbl> 3, 2, 3, 2, 4, 2, 3, 4, 2, 3, 4, 2, 2, 4, 4, 3, …

Column

Variable Definitions

age: Person age in years

income: Annual income in USD

home_ownership: Home ownership classifications on current home

employment_length: Employment length in years

loan_intent: What the loan is being used for

loan_amnt: Total amount of the loan

loan_int_rate: Interest rate of the loan in %

loan_status: Whether or not the person is in default on their loan

loan_percent_income: Percentage of the persons annual income that the loan is

default_history: If a person has ever defaulted on a loan before or not

credit_history_length: Length of credit history in years

Outliers

Column

Quantitative Data

Before we can start our data analysis we need to make sure our dataset is sufficient and make sure there are no values in the data that don’t make sense or are not possible. We will look at the five number summary of each variable to make sure nothing is wrong with the data.

     age  income employment_length loan_amnt loan_int_rate loan_percent_income
[1,]  20    4000                 0       500          5.42                0.00
[2,]  23   38500                 2      5000          7.90                0.09
[3,]  26   55000                 4      8000         10.99                0.15
[4,]  30   79200                 7     12200         13.47                0.23
[5,] 144 6000000               123     35000         23.22                0.83
     credit_history_length
[1,]                     2
[2,]                     3
[3,]                     4
[4,]                     8
[5,]                    30

From this table we can see that there are some age values and employment length values that don’t make sense. Since the oldest person to ever live died at 122 years old it is safe to get rid of ages over 122 and assume that even the oldest person alive didn’t start working until they were 10 years old. We will get rid of the data that doesn’t make sense.

     age  income employment_length loan_amnt loan_int_rate loan_percent_income
[1,]  20    4000                 0       500          5.42                0.00
[2,]  23   39366                 2      5000          7.90                0.09
[3,]  26   56000                 4      8000         10.99                0.15
[4,]  30   80000                 7     12500         13.48                0.23
[5,]  94 2039784                41     35000         23.22                0.83
     credit_history_length
[1,]                     2
[2,]                     3
[3,]                     4
[4,]                     8
[5,]                    30

Now that we know all of our quantitative data looks good it can be used in the study

Column

Categorical Data

Now we need to look at our categorical data and make sure all of the data in the dataset is plausible.
# A tibble: 4 × 2
  home_ownership     n
  <fct>          <int>
1 RENT           16072
2 OWN             2410
3 MORTGAGE       13090
4 OTHER            107
# A tibble: 6 × 2
  loan_intent           n
  <fct>             <int>
1 PERSONAL           5367
2 EDUCATION          6288
3 MEDICAL            5897
4 VENTURE            5553
5 HOMEIMPROVEMENT    3510
6 DEBTCONSOLIDATION  5064
# A tibble: 2 × 2
  loan_status     n
  <fct>       <int>
1 Default      6825
2 Non-Default 24854
# A tibble: 2 × 2
  default_history     n
  <fct>           <int>
1 Y                5628
2 N               26051

We can see that all of our categorical data looks good and now we can continue to the study

Kept and Removed Values

Column

Kept Values

Column

Removed Values

Categorical Variables

column

First we will look at the categorical data and see how the different categories impact whether people have defaulted on their loans or not.

Home Ownership

Loan Intent

Default History

column

Explanation

From these graphs we can see that:

  • Home ownership and default history are the main categorical variables that have an impact on loan status

  • This could be because people who own their home can manage their money well to pay their mortgage every month and not get their house taken from them and if a person has defaulted on a loan in the past it shows a trend that they aren’t as good with managing money to pay off their loans

Numeric Variables

column

Now we will look at the trends of the quantitative data and see how the different variables affect if a person defaults on their loan or not.

Age

Income

Employment Length

Loan Amount

Loan Interest Rate

Loan Percent of Income

Credit History Length

column

Explanation

From these graphs we can see:

  • A lot of the quantitative variables like age, employment length, and credit history length don’t make much of a difference on if a person has defaulted on their loan or not

  • Income, loan amount, loan interest rate and loan percentage of income show a significant trend on if people default on their loans or not

EDA

column

Now we will look at the relationship between different numerical variables and their relationship with loan status.

Age vs. Income

Loan Interest Rate vs. Loan Amount

Loan Interest Rate vs. Loan Percent of Income

column

Explanation

From these graphs we can see that:

  • Younger people with lower incomes defualt on their loans much more than older people with higher incomes

  • Loan amount doesn’t really affect if people default on their loans or not but loan interest rate makes a huge difference

  • People with a loan that is a low percentage of their income and have a low interest rate rarely default on their loans

  • People with a loan that is a high percentage of their income and have a high interest rate often default on their loans

Conclusion

Column(Data-width=600)

Conclusion

In this study we can conclude that there are some variables in this data set that do a pretty good job of showing if a person is likely to default on their loan or not. In the categorical data we can see that if a person owns their home they are much less likely to default on their loan than a person who rents their home or is in an “other” living situation. We can also see a significant trend in default history, if a person has a history of defaulting on their loans they are much more likely to default on their loan again. In the quantitative variables we can see that income, loan amount, loan interest rate and loan percentage of income have a significant impact on whether a person is likely to default on their loan or not. When looking at income we can see that the median income of people who have defaulted on their loan is at 42,000 USD annualy and the median income of people who have not defaulted on their loan is significantly higher at 60,000 USD annualy. We can also see that people with a higher loan amount and a higher interest rate are more likely to default on their loans because these variables make a persons monthly payment much higher. We can also see that the median loan percent of income for people who have defaulted on their loan is up at 24% while it is only at 13% for people who have not defaulted on their loan. This analysis shows that there are financial and lifestyle information that you should look at when a person applies for a loan to see how reliable it is that they will not default on their loan.

Column(Data-width=400)

References

Find this data set here on Kaggle

See data on the oldest person to ever live Here

About the Author

My name is Evan Dolley.

I am a senior Mechanical Engineering major at the University of Dayton.

Connect with me on LinkedIn.

---
title: "Loan Defult Trends"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: yeti
      navbar-bg: "purple"
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

<style>
.chart-title {  /* chart_title  */
   font-size: 16px;
  }
body{ /* Normal  */
      font-size: 18px;
  }
</style>


```{r setup, include=FALSE}
library(flexdashboard)
```

Basic Information
===

Column {data-width=500}
---
***Trends of defaulted loans based off of financial and lifestyle information***

### Introduction

This study explores a dataset with financial and lifestyle information on over 31,000 people to look at the response variable of if they defaulted on their loan or not. In this study we will look at the different variables that affect if a person has defaulted on their loans or not and find the most important variables that tell if a person is a reliable person to loan money to or not.

<head>
  <base target="_blank">
</head>

Here is a glimpse of the data:
```{r package_data}
library(pacman)
p_load(cowplot,dplyr,DT,gridExtra,plotly,tidyverse, maps, viridis)

df<-read_csv("Credit_Risk_Data_Set.csv",col_types="nnfnfnnfnfn")
df<-df %>% 
  rename("age"="person_age",
         "income"="person_income",
         "home_ownership"="person_home_ownership",
         "employment_length"="person_emp_length",
         "default_history"="cb_person_default_on_file",
         "credit_history_length"="cb_person_cred_hist_length")
df$loan_status<-df$loan_status %>% recode(
  '0'="Non-Default",
  '1'="Default")
glimpse(df)

```

Column {data-width=500}
---

### Variable Definitions

age: Person age in years

income: Annual income in USD

home_ownership: Home ownership classifications on current home

employment_length: Employment length in years

loan_intent: What the loan is being used for

loan_amnt: Total amount of the loan

loan_int_rate: Interest rate of the loan in %

loan_status: Whether or not the person is in default on their loan

loan_percent_income: Percentage of the persons annual income that the loan is 

default_history: If a person has ever defaulted on a loan before or not

credit_history_length: Length of credit history in years


Outliers
===

Column {data-width=500}
---

### Quantitative Data 

Before we can start our data analysis we need to make sure our dataset is sufficient and make sure there are no values in the data that don't make sense or are not possible. We will look at the five number summary of each variable to make sure nothing is wrong with the data.

```{r outliers}
sapply(df[c('age','income','employment_length','loan_amnt','loan_int_rate','loan_percent_income','credit_history_length')],fivenum)
```

From this table we can see that there are some age values and employment length values that don't make sense. Since the oldest person to ever live died at 122 years old it is safe to get rid of ages over 122 and assume that even the oldest person alive didn't start working until they were 10 years old. We will get rid of the data that doesn't make sense.

```{r remove data}
df.keep<-df %>% 
  filter(age<122,employment_length<112)
df.removed<-df %>% 
  filter(age>122)
sapply(df.keep[c('age','income','employment_length','loan_amnt','loan_int_rate','loan_percent_income','credit_history_length')],fivenum)
```

Now that we know all of our quantitative data looks good it can be used in the study

Column {data-width=500}
---

### Categorical Data

Now we need to look at our categorical data and make sure all of the data in the dataset is plausible.
```{r}
dplyr::count(df.keep,home_ownership)
dplyr::count(df.keep,loan_intent)
dplyr::count(df.keep,loan_status)
dplyr::count(df.keep,default_history)
```

We can see that all of our categorical data looks good and now we can continue to the study


Kept and Removed Values
===

Column {data-width=500}
---

### Kept Values
```{r}
DT::datatable(df.keep)
```

Column {data-width=500}
---

### Removed Values
```{r}
DT::datatable(df.removed)
```

Categorical Variables
===

column{.tabset data-width=500}
---

  First we will look at the categorical data and see how the different categories impact whether people have defaulted on their loans or not.

### Home Ownership
```{r}
g1<-ggplot(df.keep,aes(x=home_ownership,fill=loan_status))+
  geom_bar(width=0.75,position="fill")+
  scale_y_continuous(breaks=seq(0,1,by=0.2),labels=scales::percent)+
  labs(y="Percent",x="Home Ownership",fill="Loan Status",title="Home Ownership vs. Loan Status")+
  theme(text=element_text(size=15),axis.text.x=element_text(angle=90),legend.position = "top")
g1<-g1 %>% 
  style(hoverinfo="y")
ggplotly(g1)
```

### Loan Intent
```{r}
g2<-ggplot(df.keep,aes(x=loan_intent,fill=loan_status))+
  geom_bar(width=0.75,position="fill")+
  scale_y_continuous(breaks=seq(0,1,by=0.2),labels=scales::percent)+
  labs(y="Percent",x="Loan Intent",fill="Loan Status",title="Loan Intent vs. Loan Status")+
  theme(text=element_text(size=15),axis.text.x=element_text(angle=90),legend.position = "top")
g2<-g2 %>% 
  style(hoverinfo="y")
ggplotly(g2)
```

### Default History
```{r}
g3<-ggplot(df.keep,aes(x=default_history,fill=loan_status))+
  geom_bar(width=0.75,position="fill")+
  scale_y_continuous(breaks=seq(0,1,by=0.2),labels=scales::percent)+
  labs(y="Percent",x="Default History",fill="Loan Status",title="Default History vs. Loan Status")+
  theme(text=element_text(size=20),legend.position = "top")
g3<-g3 %>% 
  style(hoverinfo="y")
ggplotly(g3)
```


column{data-width=500}
---
### Explanation

From these graphs we can see that:

- Home ownership and default history are the main categorical variables that have an impact on loan status

- This could be because people who own their home can manage their money well to pay their mortgage every month and not get their house taken from them and if a person has defaulted on a loan in the past it shows a trend that they aren't as good with managing money to pay off their loans

Numeric Variables
===

column{.tabset data-width=500}
---

Now we will look at the trends of the quantitative data and see how the different variables affect if a person defaults on their loan or not.

### Age
```{r}
g4<-ggplot(df.keep,aes(x=loan_status,y=age,fill=loan_status))+
  geom_boxplot()+
  labs(y="Age",x="Loan Status",fill="Loan Status",title="Age vs. Loan Status")+
  theme(text=element_text(size=15),legend.position = "none")
g4<-g4 %>% 
  style(hoverinfo="y")
ggplotly(g4)
```

### Income
```{r}
g5<-ggplot(df.keep,aes(x=loan_status,y=income,fill=loan_status))+
  geom_boxplot(outlier.shape=NA)+
  ylim(c(0,150000))+
  labs(y="Income",x="Loan Status",fill="Loan Status",title="Income vs. Loan Status")+
  theme(text=element_text(size=15),legend.position = "none")
g5<-g5 %>% 
  style(hoverinfo="y")
ggplotly(g5)
```

### Employment Length
```{r}
g6<-ggplot(df.keep,aes(x=loan_status,y=employment_length,fill=loan_status))+
  geom_boxplot()+
  labs(y="Employment Length",x="Loan Status",fill="Loan Status",title="Employment Length vs. Loan Status")+
  theme(text=element_text(size=15),legend.position = "none")
g6<-g6 %>% 
  style(hoverinfo="y")
ggplotly(g6)
```

### Loan Amount
```{r}
g7<-ggplot(df.keep,aes(x=loan_status,y=loan_amnt,fill=loan_status))+
  geom_boxplot()+
  labs(y="Loan Amount",x="Loan Status",fill="Loan Status",title="Loan Amount vs. Loan Status")+
  theme(text=element_text(size=15),legend.position = "none")
g7<-g7 %>% 
  style(hoverinfo="y")
ggplotly(g7)
```

### Loan Interest Rate
```{r}
g8<-ggplot(df.keep,aes(x=loan_status,y=loan_int_rate,fill=loan_status))+
  geom_boxplot()+
  labs(y="Interest Rate",x="Loan Status",fill="Loan Status",title="Loan Interest Rate vs. Loan Status")+
  theme(text=element_text(size=15),legend.position = "none")
g8<-g8 %>% 
  style(hoverinfo="y")
ggplotly(g8)
```


### Loan Percent of Income
```{r}
g9<-ggplot(df.keep,aes(x=loan_status,y=loan_percent_income,fill=loan_status))+
  geom_boxplot()+
  labs(y="Loan Percent of Income",x="Loan Status",fill="Loan Status",title="Loan Percent of Income vs. Loan Status")+
  theme(text=element_text(size=15),legend.position = "none")
g9<-g9 %>% 
  style(hoverinfo="y")
ggplotly(g9)
```

### Credit History Length
```{r}
g10<-ggplot(df.keep,aes(x=loan_status,y=credit_history_length,fill=loan_status))+
  geom_boxplot()+
  labs(y="Credit History Length",x="Loan Status",fill="Loan Status",title="Credit History Length vs. Loan Status")+
  theme(text=element_text(size=15),legend.position = "none")
g10<-g10 %>% 
  style(hoverinfo="y")
ggplotly(g10)
```

column{data-width=500}
---
### Explanation


From these graphs we can see:

- A lot of the quantitative variables like age, employment length, and credit history length don't make much of a difference on if a person has defaulted on their loan or not

- Income, loan amount, loan interest rate and loan percentage of income show a significant trend on if people default on their loans or not

EDA
===

column{.tabset data-width=500}
---
Now we will look at the relationship between different numerical variables and their relationship with loan status.

### Age vs. Income
```{r}
ggplot(df.keep,aes(x=age,y=income,color=loan_status))+
  geom_point()+
  labs(y="Income",x="Age",color="Loan Status",title="Age vs. Income and Loan Status")+
  theme(text=element_text(size=15),legend.position = "top")
```

### Loan Interest Rate vs. Loan Amount
```{r}
ggplot(df.keep,aes(x=loan_int_rate,y=loan_amnt,color=loan_status))+
  geom_point()+
  labs(y="Loan Amount",x="Loan Interest Rate",color="Loan Status",title="Loan Interest Rate vs. Loan Amount")+
  theme(text=element_text(size=15),legend.position = "top")
```

### Loan Interest Rate vs. Loan Percent of Income
```{r}
ggplot(df.keep,aes(x=loan_int_rate,y=loan_percent_income,color=loan_status))+
  geom_point()+
  labs(y="Loan Percent of Income",x="Loan Interest Rate",color="Loan Status",title="Loan Interest Rate vs. Loan Percent of Income")+
  theme(text=element_text(size=15),legend.position = "top")
```

column{data-width=500}
---
### Explanation

From these graphs we can see that:

- Younger people with lower incomes defualt on their loans much more than older people with higher incomes

- Loan amount doesn't really affect if people default on their loans or not but loan interest rate makes a huge difference

- People with a loan that is a low percentage of their income and have a low interest rate rarely default on their loans

- People with a loan that is a high percentage of their income and have a high interest rate often default on their loans


Conclusion
===

Column(Data-width=600)
---
### Conclusion

In this study we can conclude that there are some variables in this data set that do a pretty good job of showing if a person is likely to default on their loan or not. In the categorical data we can see that if a person owns their home they are much less likely to default on their loan than a person who rents their home or is in an "other" living situation. We can also see a significant trend in default history, if a person has a history of defaulting on their loans they are much more likely to default on their loan again. In the quantitative variables we can see that income, loan amount, loan interest rate and loan percentage of income have a significant impact on whether a person is likely to default on their loan or not. When looking at income we can see that the median income of people who have defaulted on their loan is at 42,000 USD annualy and the median income of people who have not defaulted on their loan is significantly higher at 60,000 USD annualy. We can also see that people with a higher loan amount and a higher interest rate are more likely to default on their loans because these variables make a persons monthly payment much higher. We can also see that the median loan percent of income for people who have defaulted on their loan is up at 24% while it is only at 13% for people who have not defaulted on their loan. This analysis shows that there are financial and lifestyle information that you should look at when a person applies for a loan to see how reliable it is that they will not default on their loan.

Column(Data-width=400)
---

### References

Find this data set here on [Kaggle](https://www.kaggle.com/datasets/laotse/credit-risk-dataset)

See data on the oldest person to ever live [Here](https://en.wikipedia.org/wiki/Oldest_people)

### About the Author
My name is Evan Dolley. 

I am a senior Mechanical Engineering major at the University of Dayton. 

Connect with me on [LinkedIn](https://www.linkedin.com/in/evan-dolley-46624b180/).