Statistical Analysis with Linear Regression on Insurance Data
Problem Statement
Given the person’s attribute: Age, Sex, BMI, Smoker etc. we have to predict insurance cost.
Data description
- Age is a real number
- Sex binary variable male and female
- bmi (body mass index) is a real number
- children is number of children a person has
- smoker is a binary variable
- region is class variable
- charges is dependent variable (y)
Importing required libraries and loading dataset
Data Analysis
Distribution of Insurance cost: Right Skewed
How many people smoke?
Most of the people does not smoke.
Effect of age on insurance cost comparison between smoker vs non smoke
- Overall age and insurance cost(for smoker and non smoker) has positive correlation, it means as the age is increasing insurance costs are also increasing.
Effect of smoking on insurance cost
- First graph shows the average cost of non-smokers (around 9000) is very less than the smoker (more than 30000).
- Second graph shows the overall distribution of insurance charges of smoker and non smoker. With some outliers in the nonsmoker category.
Effect of BMI on insurance cost based on smoking habit
- For non-smokers there is almost 0 correlation between BMI and insurance cost, which implies if a person doesn’t smoke even though his/her BMI is high, insurance cost will not change.
- However for smokers there is a strong positive correlation between BMI and insurance cost, which implies if a person smokes and his/her BMI is high, insurance cost will also increase.
Distribution of smokers among the different age group
Distribution of smokers among male and female
Distribution of people among the different BMI
Convert BMI category using BMI category
Effect of BMI on insurance cost based on smoking habit
- In general, smoker pays more insurance cost than non smoker, along all categories of BMI, However people from
O.W.
(Overweight),N
(Normal),Mild T.
(Mild thin) category of BMI pays less even if they smoke than other BMI categories who smoke.
Statistical modelling
Building a linear model, predicting insurance cost using age, bmi, smoker variable.
Understanding the model.
Let’s consider we have simple linear regression with one independent variable (X1) and one dependent variable (y). R square is one way to check model performance. R square shows the variability in (y) explained by (X1). In simple words if R square is 0.85 so with X1 85% of variation is explained in (y).
Problem with R square
Now let’s say we add one more independent variable (X2), R2 will increase even though there is no relationship between independent variable (X2) and dependent variable (y) to solve this problem there is something called Adjusted R square.
Adjusted R Square
Adjusted R square does not increase with addition of insignificant variable, rather it decreases (it penalize addition of insignificant variable). Adjusted R square only increases when a significant variable is added. Hence Adjusted R square is more reliable then R square for multiple linear regression.
Our model
Our model has both R square and adjusted R square = 0.74, which means with 3 variables we can explain 74% variability.
Coefficient:
Equation of regression with p feature is given as:
y = b0 + b1X1 + b2X2 + b2X3 + .... + bpXp
For simplicity let’s consider y is only dependent on X1.
y = b0 + b1X1
Interpretation:
With every unit increase in X1, y will increase b1 time.
- What if there are multiple coefficients?
y = b0 + b1X1 + b2X2 + b2X3
Interpretation:
With every unit increase in X1, y will increase b1 time with X2, X3 held constant.
Our model Interpretation:
We are predicting insurance costs with age, bmi and smoking habits.
insurance = - 11680 + 259.54 * age + 322.61 * bmi + 23820 * smoker_yes
- With every unit increase in age, insurance will increase 259.54 times with bmi, smoker yes held constant.
- With every unit increase in bmi, insurance will increase 322.61 times with age, smoker yes held constant.
- With every unit increase in smokers, yes, insurance will increase 23820 times with age, bmi held constant.
Dealing with Categorical variable:
Machine learning models do not understand categorical variables, In order to use those variables in our ML model we have to convert them into numbers and one of the effective ways to convert categories into numbers is one hot encoding
which creates columns containing binary values {0,1}.
In our model:
insurance = - 11680 + 259.54 * age + 322.61 * bmi + 23820 * smoker_yes
- Smoker variable is a binary, if a person smokes it has value 1 and if the person doesn’t smoke the value is 0.
If person smokes regression equation
insurance = - 11680 + 259.54 * age + 322.61 * bmi + 23820 * smoker_yes
If person doesn’t smoke regression equation
insurance = - 11680 + 259.54 * age + 322.61 * bmi
Coefficient of smoke variable is positive (23820) which means they have to pay 23820 more than the who doesn’t smoke.
P value:
Let’s say we are building a regression model which is dependent on just variable (y~X1) y is dependent on X1. At the start it is believed that there is no relationship between X1 and y, which means the best prediction of y is y_mean.
Arithmetically,
y = b0 + b1.X1 (b1=0 believed H0)
y = b0 or y = y_mean
H0 : b1 = 0 (y is not dependent on X1) null hypothesis
H1 : b1!=0 (y is dependent on X1) alternate hypothesis
In simple words, p value is the measure of probability of b1=0.
- If p value is greater than or equal to 0.05 which means there is at least 5% chance of b1=0, hence the variable is less significant. Vice Versa If p value is less than 0.05 the chance of b1=0 is less than 5%. This is calculated by constructing a 95% Confidence interval of b1.
In our model
Variable P Value
age 0.000
bmi 0.000
smoker_yes 0.000
for all the variable p value<0.05 which means our features are significant.
click to check the code
Thank you