Tuesday, July 14, 2020

ML Series: Simple Linear Regression in Python

Machine Learning 1.01
In this series we are going to look at the basics of machine learning concepts from stats to running a few models and applying on actual data.

Lets first start learning simple linear regression in Python:
Case: We will create a regression which will predict the GPA based on SAT scores obtained by the students.

Sample Data (csv): 

SAT GPA
1714 2.4
1664 2.52
1760 2.54
1685 2.74
1693 2.83
1670 2.91
1764 3
1764 3
1792 3.01
1850 3.01
1735 3.02
1775 3.07

1. Importing the libraries (These are the most common and important libraries which you need to import at most of the times - pandas, numpy, matplotlib, seaborn and linear regression from sklearn package)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
From sklearn.linear_model import LinearRegression
import seaborn as sns
seaborn.set()

2. Loading the data

data = pd.read_csv('C:/Users/abc/Downloads/1.01. Simple linear regression.csv')
data.head()

The data has 2 variable GPA & SAT. The notion is SAT can predict the GPA of a student. We can test our hypothesis and even predict the GPA if in case SAT comes out to be a good predictor of GPA.



3. Setting up the model by defining the dependent variable GPA as y and independent variable as y. Thus we can define the linear regression line as y = b0 + b1*x  where b0 & b1 are the constants

x = data['SAT']
y = data['GPA']
x.shape
y.shape

4. Now the Regression model in Sklearn takes only array as inputs. Thus we will convert the x to 'x_matrix' and see the shape

x_matrix = x.values.reshape(-1,1)
x_matrix.shape

5. The Regression - We need to set a variable to LinearRegression() function and fit the model for our x_matrix & y variables

reg = LinearRegression()
reg.fit(x_matrix,y)

Out: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

6. We can check the results to see if our model is a good fit or not (R-Squared)

reg.score(x_matrix,y)

which gives us a score of 40% which is not so bad! 
It means 40% of the data variability is explained by the model.


7. All Done. Now we can simply predict the GPA for any SAT

reg.predict([[1740]])

Out:  array([3.15593751])

We will discuss about multiple linear regression in the next post.
Thanks

No comments:

Post a Comment