For this article the prerequisite is: Andrew Ng’s linear regression lecture.
clickhere to check out the course.
- Given the Years of experience we have to predict salary of the person. Since dependent variable salary is continuous, it is a regression problem.
- We will use three methods i.e. Gradient descent (Optimization method), Statistical method (formula) and Scikit learn Linear regression library to estimate the regression parameter.
- After that we will compare the parameters learned from all these three methods.
Importing libraries and Loading dataset
Plotting X, y
- The graph shows there is a linear positive relationship between Years of experience and salary, which means as Experience increases salary also increases linearly.
Normalizing the data to make learning faster
- Min-Max normalization is a scaling technique, for every feature it will bring value between 0 and 1.
- Minimum value will get transformed to 0.
- Maximum Value will get transformed to 1.
- Rest of the value will be between 0 and 1.
Reason: why scaling?
- The one of the important reasons we apply scaling is so that optimization algorithms such as gradient descent will converge faster (will reach to minima faster).
Problem with scaling
- Problem with scaling is it makes parameters hard to interpret.
- Scatter plot shows, if we draw a straight line from Years of experience we can predict the salary of the person.
- The equation of straight line is given by:
- We know the value of
Xi.e. years of experience and we also know the value of
yi.e. salary. We don't know
mi.e. slope of the line and
ci.e. y intercept.
- If we would have known
cwith respect to
X(years of experience) and
y(salary) we would have drawn the best fitted straight line.
What ML algorithm learns.
- In Machine learning terms
care called as the
parameters, and that is what the machine learns.
parameter (theta_0, theta_1): only when we start training.
itr: list of value of iteration.
cost_itr: list which contains cost at every iteration.
y after updating parameter (theta_0, theta_1)
It is not required function to implement gradient descent, however it can be used to understand how machine is actually learning.
Every important function, It will update parameters at every iteration, It will also call above functions initialize, predict, plotting and help model train and visualize.
Plotting after randomly initialize theta_0 and theta_1
Plotting after some learning happens theta_0 and theta_1
Compare lines at iteration = 300 and iteration = 600, model is improving.
More improved model
Finally model is trained for 3000 iteration.
Other than gradient descent approach there is statistical approach to find the parameter (m,c) of regression line. The statistical approach is direct or formula based approach to find parameters.
Comparing: Actual vs Predicted by Statistical parameter vs Gradient descent parameters
Statistical approach predicts
y more accurately than gradient descent approach hence parameters of Statistical approach is more accurate than Gradient descent approach.
Scikit Learn: Linear Regression API
Coefficient estimated by Statistical approach and Scikit learn same, hence the error will be less however there is still improvement in self defined Gradient descent approach.
Improving Gradient Descent
Training model for 6001 iteration, earlier it was only trained for 3001 iteration.
Comparing: Actual vs Predicted by Statistical parameter vs Improved Gradient descent parameters
Our improved gradient descent has almost reached near to the best parameter. If we train for few more iterations it will surely find best parameter.
clickhere to check the code