In this series we are going to look at the basics of machine learning concepts from stats to running a few models and applying on actual data.
Lets first start learning simple linear regression in Python:
Case: We will create a regression which will predict the GPA based on SAT scores obtained by the students.
Sample Data (csv):
SAT | GPA |
1714 | 2.4 |
1664 | 2.52 |
1760 | 2.54 |
1685 | 2.74 |
1693 | 2.83 |
1670 | 2.91 |
1764 | 3 |
1764 | 3 |
1792 | 3.01 |
1850 | 3.01 |
1735 | 3.02 |
1775 | 3.07 |
1. Importing the libraries (These are the most common and important libraries which you need to import at most of the times - pandas, numpy, matplotlib, seaborn and linear regression from sklearn package)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
From sklearn.linear_model import LinearRegression
import seaborn as sns
seaborn.set()
import seaborn as sns
seaborn.set()
2. Loading the data
data = pd.read_csv('C:/Users/abc/Downloads/1.01. Simple linear regression.csv')
data.head()
The data has 2 variable GPA & SAT. The notion is SAT can predict the GPA of a student. We can test our hypothesis and even predict the GPA if in case SAT comes out to be a good predictor of GPA.
3. Setting up the model by defining the dependent variable GPA as y and independent variable as y. Thus we can define the linear regression line as y = b0 + b1*x where b0 & b1 are the constants
x = data['SAT']
y = data['GPA']
y = data['GPA']
x.shape
y.shape
4. Now the Regression model in Sklearn takes only array as inputs. Thus we will convert the x to 'x_matrix' and see the shape
4. Now the Regression model in Sklearn takes only array as inputs. Thus we will convert the x to 'x_matrix' and see the shape
x_matrix = x.values.reshape(-1,1)
x_matrix.shape
x_matrix.shape
5. The Regression - We need to set a variable to LinearRegression() function and fit the model for our x_matrix & y variables
reg = LinearRegression()
reg.fit(x_matrix,y)
Out: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
6. We can check the results to see if our model is a good fit or not (R-Squared)
reg.score(x_matrix,y)
which gives us a score of 40% which is not so bad!
It means 40% of the data variability is explained by the model.
It means 40% of the data variability is explained by the model.
7. All Done. Now we can simply predict the GPA for any SAT
reg.predict([[1740]])
Out: array([3.15593751])
We will discuss about multiple linear regression in the next post.
Thanks
We will discuss about multiple linear regression in the next post.
Thanks
No comments:
Post a Comment