Teaching helps in better understanding and retaining new materials learned and as part of my #100DaysOfCode challenge on Twitter, as I am learning Machine Learning algorithms, I will relay my learning here on this blog in hopes to help fellow beginners in this subject.
Linear Regression is an approach to modeling the relationship between two variables by fitting a straight line to the observed data. It is a basic and commonly used type of predictive analysis for a continuous dependent variable using a given set of independent variable. Thus, it can be said that Linear Regression is used for solving regression problems.
For better understanding of the definition:
Regression – statistical method that determines the strength and character of the relationship between a dependent variable and independent variable;
Continuous Variable – can take on unlimited number of values between its minimum and maximum value (e.g. price, salary, length, etc.)
An example of a relationship between an independent variable and dependent variable is shown below:
This is a simple bivariate data (data involving two variables) plotted showing the time between two eruptions and the duration of the second eruption for 10 eruptions of the geyser Old Faithful with y (dependent variable) being the duration of eruption and x (independent variable) being the time between eruptions:
#x = Time between eruptions (in seconds) #y = Duration of eruption (in seconds) x = [272, 227, 237, 238, 203, 270, 218, 226, 250, 245] y = [89, 79, 83, 82, 81, 85, 78, 81, 85, 79]
The regression line can be represented by the equation: y = mx + b
Where y and x are the variables describing a specific point on the graph, m is the slope of the line, and b the y-intercept describing where the line crosses the y-axis.
We calculate the R-squared to check if there is a relationship between the two variables because if there is no relationship between the two variables then linear regression cannot be used for prediction. R-squared value ranges from -1 to 1, where 0 means there is no relationship and -1 or 1 means there is a relationship.
We got 0.76 from the calculated r-squared value above example image, which shows that there is a relationship between the time between eruption and duration of eruption, albeit not perfect.
The dataset that I will be using on this tutorial is from GitHub user huzaifsayed’s USA_Housing, and I will be using Google Colaboratory notebook to write and execute the code.
import pandas as pd url = 'https://raw.githubusercontent.com/huzaifsayed/Linear-Regression-Model-for-House-Price-Prediction/master/USA_Housing.csv' dataset = pd.read_csv(url)
There are different ways to load a CSV file on Google Colaboratory notebook, one of the easiest is to upload from a GitHub repository. Copy the link of the raw dataset and store it in a variable. Then load that variable into Pandas read_csv to get the dataframe .
Pandas dataframe.info() function is used to get a summary of the dataframe which can be useful in doing an analysis of the data. As we are using linear regression, we will not be including the ‘Address’ column because it is an object that will not be useful in our linear regression model:
X = dataset[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']] y = dataset['Price']
Split Dataset into Train and Test
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
We will use train_test_split() method from the model_selection library of sklearn. sklearn.model_selection train_test_split, splits arrays or matrices into random train and test subsets. (Documentation)
test_size – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples;
random_state – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.
Training the Linear Regression Model
from sklearn.linear_model import LinearRegression linmodel = LinearRegression() #training the model using training set linmodel.fit(X_train, y_train)
Create linear regression object storing it to a variable called linmodel, then train the model using the training set X_train and y_train. (Documentation)
Prediction from Linear Regression Model
import matplotlib.pyplot as plt predictions = linmodel.predict(X_test) plt.scatter(y_test, predictions)
Using the trained model, we will now use it to predict the outcome of the test set. And to visualize the result, we will plot the data points of y_test and predictions.
The prediction from the test set:
Some other real life application of Linear Regression can be to predict one’s salary based on experience, predict gas money to pay when going on a road trip based on miles driven, predict product sale based on past buying behavior, or even predict economic growth of a country or state. It is a simple yet very useful algorithm to use for predictions.