Statistical Analysis with Linear Regression on Insurance Data

Mayank Dubey
6 min readApr 3, 2021

Problem Statement

Given the person’s attribute: Age, Sex, BMI, Smoker etc. we have to predict insurance cost.

Data description

  • Age is a real number
  • Sex binary variable male and female
  • bmi (body mass index) is a real number
  • children is number of children a person has
  • smoker is a binary variable
  • region is class variable
  • charges is dependent variable (y)

Importing required libraries and loading dataset

importing libraries
loading dataset

Data Analysis

Distribution of Insurance cost: Right Skewed

How many people smoke?

Generated using python

Most of the people does not smoke.

Effect of age on insurance cost comparison between smoker vs non smoke

  • Overall age and insurance cost(for smoker and non smoker) has positive correlation, it means as the age is increasing insurance costs are also increasing.

Effect of smoking on insurance cost

Generated using python
  • First graph shows the average cost of non-smokers (around 9000) is very less than the smoker (more than 30000).
  • Second graph shows the overall distribution of insurance charges of smoker and non smoker. With some outliers in the nonsmoker category.

Effect of BMI on insurance cost based on smoking habit

Generated using python
  • For non-smokers there is almost 0 correlation between BMI and insurance cost, which implies if a person doesn’t smoke even though his/her BMI is high, insurance cost will not change.
  • However for smokers there is a strong positive correlation between BMI and insurance cost, which implies if a person smokes and his/her BMI is high, insurance cost will also increase.

Distribution of smokers among the different age group

Age wise distribution of smokers

Distribution of smokers among male and female

Sex wise distribution of smokers

Distribution of people among the different BMI

Convert BMI category using BMI category

Distribution of people among the different BMI

Effect of BMI on insurance cost based on smoking habit

Insurance cost based on smoking habit of different BMI categories
  • In general, smoker pays more insurance cost than non smoker, along all categories of BMI, However people from O.W. (Overweight), N (Normal), Mild T. (Mild thin) category of BMI pays less even if they smoke than other BMI categories who smoke.

Statistical modelling

Building a linear model, predicting insurance cost using age, bmi, smoker variable.

Regression model summary — generated using stats model library pyhon

Understanding the model.

Let’s consider we have simple linear regression with one independent variable (X1) and one dependent variable (y). R square is one way to check model performance. R square shows the variability in (y) explained by (X1). In simple words if R square is 0.85 so with X1 85% of variation is explained in (y).

Problem with R square

Now let’s say we add one more independent variable (X2), R2 will increase even though there is no relationship between independent variable (X2) and dependent variable (y) to solve this problem there is something called Adjusted R square.

Adjusted R Square

Adjusted R square does not increase with addition of insignificant variable, rather it decreases (it penalize addition of insignificant variable). Adjusted R square only increases when a significant variable is added. Hence Adjusted R square is more reliable then R square for multiple linear regression.

Our model

Our model has both R square and adjusted R square = 0.74, which means with 3 variables we can explain 74% variability.

Coefficient:

Equation of regression with p feature is given as:

y = b0 + b1X1 + b2X2 + b2X3 + .... + bpXp

For simplicity let’s consider y is only dependent on X1.

y = b0 + b1X1

Interpretation:

With every unit increase in X1, y will increase b1 time.

  • What if there are multiple coefficients?

y = b0 + b1X1 + b2X2 + b2X3

Interpretation:

With every unit increase in X1, y will increase b1 time with X2, X3 held constant.

Our model Interpretation:

We are predicting insurance costs with age, bmi and smoking habits.

insurance  = - 11680 + 259.54 * age + 322.61 * bmi + 23820 * smoker_yes
  • With every unit increase in age, insurance will increase 259.54 times with bmi, smoker yes held constant.
  • With every unit increase in bmi, insurance will increase 322.61 times with age, smoker yes held constant.
  • With every unit increase in smokers, yes, insurance will increase 23820 times with age, bmi held constant.

Dealing with Categorical variable:

Machine learning models do not understand categorical variables, In order to use those variables in our ML model we have to convert them into numbers and one of the effective ways to convert categories into numbers is one hot encoding which creates columns containing binary values {0,1}.

In our model:

insurance  = - 11680 + 259.54 * age + 322.61 * bmi + 23820 * smoker_yes
  • Smoker variable is a binary, if a person smokes it has value 1 and if the person doesn’t smoke the value is 0.

If person smokes regression equation

insurance  = - 11680 + 259.54 * age + 322.61 * bmi + 23820 * smoker_yes

If person doesn’t smoke regression equation

insurance  = - 11680 + 259.54 * age + 322.61 * bmi

Coefficient of smoke variable is positive (23820) which means they have to pay 23820 more than the who doesn’t smoke.

P value:

Let’s say we are building a regression model which is dependent on just variable (y~X1) y is dependent on X1. At the start it is believed that there is no relationship between X1 and y, which means the best prediction of y is y_mean.

Arithmetically,

y = b0 + b1.X1 (b1=0 believed H0)
y = b0 or y = y_mean

H0 : b1 = 0 (y is not dependent on X1) null hypothesis

H1 : b1!=0 (y is dependent on X1) alternate hypothesis

In simple words, p value is the measure of probability of b1=0.

  • If p value is greater than or equal to 0.05 which means there is at least 5% chance of b1=0, hence the variable is less significant. Vice Versa If p value is less than 0.05 the chance of b1=0 is less than 5%. This is calculated by constructing a 95% Confidence interval of b1.

In our model

 Variable            P Value          
age 0.000
bmi 0.000
smoker_yes 0.000

for all the variable p value<0.05 which means our features are significant.

click to check the code

Thank you

--

--