# Statistical Analysis with Linear Regression on Insurance Data

**Problem Statement**

Given the person’s attribute: Age, Sex, BMI, Smoker etc. we have to predict insurance cost.

# Data description

**Age**is a real number**Sex**binary variable male and female**bmi**(body mass index) is a real number**children**is number of children a person has**smoker**is a binary variable**region**is class variable**charges**is dependent variable (y)

# Importing required libraries and loading dataset

# Data Analysis

**Distribution of Insurance cost: Right Skewed**

**How many people smoke?**

Most of the people does not smoke.

**Effect of age on insurance cost comparison between smoker vs non smoke**

- Overall age and insurance cost(for smoker and non smoker) has positive correlation, it means as the age is increasing insurance costs are also increasing.

**Effect of smoking on insurance cost**

- First graph shows the average cost of non-smokers (around 9000) is very less than the smoker (more than 30000).
- Second graph shows the overall distribution of insurance charges of smoker and non smoker. With some outliers in the nonsmoker category.

## Effect of BMI on insurance cost based on smoking habit

- For non-smokers there is almost 0 correlation between BMI and insurance cost, which implies if a person doesn’t smoke even though his/her BMI is high, insurance cost will not change.
- However for smokers there is a strong positive correlation between BMI and insurance cost, which implies if a person smokes and his/her BMI is high, insurance cost will also increase.

## Distribution of smokers among the different age group

## Distribution of smokers among male and female

## Distribution of people among the different BMI

Convert BMI category using BMI category

## Effect of BMI on insurance cost based on smoking habit

- In general, smoker pays more insurance cost than non smoker, along all categories of BMI, However people from
`O.W.`

(Overweight),`N`

(Normal),`Mild T.`

(Mild thin) category of BMI pays less even if they smoke than other BMI categories who smoke.

# Statistical modelling

Building a linear model, predicting insurance cost using age, bmi, smoker variable.

**Understanding the model.**

Let’s consider we have simple linear regression with one independent variable (X1) and one dependent variable (y). R square is one way to check model performance. R square shows the variability in (y) explained by (X1). In simple words if R square is 0.85 so with X1 85% of variation is explained in (y).

## Problem with R square

Now let’s say we add one more independent variable (X2), R2 will increase even though there is no relationship between independent variable (X2) and dependent variable (y) to solve this problem there is something called Adjusted R square.

## Adjusted R Square

Adjusted R square does not increase with addition of insignificant variable, rather it decreases (it penalize addition of insignificant variable). Adjusted R square only increases when a significant variable is added. Hence Adjusted R square is more reliable then R square for multiple linear regression.

## Our model

Our model has both R square and adjusted R square = 0.74, which means with 3 variables we can explain 74% variability.

# Coefficient:

Equation of regression with p feature is given as:

`y = b0 + b1X1 + b2X2 + b2X3 + .... + bpXp`

For simplicity let’s consider y is only dependent on X1.

`y = b0 + b1X1`

## Interpretation:

With every unit increase in X1, y will increase b1 time.

- What if there are multiple coefficients?

`y = b0 + b1X1 + b2X2 + b2X3`

## Interpretation:

With every unit increase in X1, y will increase b1 time with X2, X3 held constant.

## Our model Interpretation:

We are predicting insurance costs with age, bmi and smoking habits.

`insurance = - 11680 + 259.54 * age + 322.61 * bmi + 23820 * smoker_yes`

- With every unit increase in age, insurance will increase 259.54 times with bmi, smoker yes held constant.
- With every unit increase in bmi, insurance will increase 322.61 times with age, smoker yes held constant.
- With every unit increase in smokers, yes, insurance will increase 23820 times with age, bmi held constant.

## Dealing with Categorical variable:

Machine learning models do not understand categorical variables, In order to use those variables in our ML model we have to convert them into numbers and one of the effective ways to convert categories into numbers is `one hot encoding`

which creates columns containing binary values {0,1}.

In our model:

`insurance = - 11680 + 259.54 * age + 322.61 * bmi + 23820 * smoker_yes`

- Smoker variable is a binary, if a person smokes it has value 1 and if the person doesn’t smoke the value is 0.

If person smokes regression equation

`insurance = - 11680 + 259.54 * age + 322.61 * bmi + 23820 * smoker_yes`

If person doesn’t smoke regression equation

`insurance = - 11680 + 259.54 * age + 322.61 * bmi`

Coefficient of smoke variable is positive (23820) which means they have to pay 23820 more than the who doesn’t smoke.

# P value:

Let’s say we are building a regression model which is dependent on just variable (y~X1) y is dependent on X1. At the start it is believed that there is no relationship between X1 and y, which means the best prediction of y is y_mean.

Arithmetically,

`y = b0 + b1.X1 (b1=0 believed H0)`

y = b0 or y = y_mean

H0 : b1 = 0 (y is not dependent on X1) null hypothesis

H1 : b1!=0 (y is dependent on X1) alternate hypothesis

`In simple words, p value is the measure of probability of b1=0.`

- If p value is greater than or equal to 0.05 which means there is at least 5% chance of b1=0, hence the variable is less significant. Vice Versa If p value is less than 0.05 the chance of b1=0 is less than 5%. This is calculated by constructing a 95% Confidence interval of b1.

**In our model**

` Variable P Value `

age 0.000

bmi 0.000

smoker_yes 0.000

for all the variable p value<0.05 which means our features are significant.

*click ***to check the code**

Thank you