Data Visualization Using Seaborn

Seaborn is a Python data visualization library that provides stunning and informative statistical graphics. In this article I will be lightly discussing a few functions used for data visualization with seaborn:

  • seaborn.jointplot()
  • seaborn.distplot()
  • seaborn.boxplot()


seaborn.jointplot() function displays a relationship between two variables (bivariate), x and y, and a univariate in the margins.

Univariate is a term used to describe a type of data that only observes a single attribute or characteristic, while a bivariate observes two types of data that are usually related. For example, number of tweets posted in a day vs number of engagements in tweets. If you only observed one of them, then it is considered univariate.

seaborn.jointplot() is intended to be a fairly lightweight wrapper [1].

There are a lot more parameters (see Seaborn official documentation) available but here are some of the important ones:

  1. x, y: vectors or keys in data
    • Variables that specify positions on the x and y axes
  2. data: pandas.DataFrame, numpy.ndarray, mapping or sequence
    • Input data structure. Either long-form or wide-form
  3. kind: {“scatter”, “kde”, “hist”, “hex”, “reg”, “resid”}
    • Kind of plot to draw

Here is an example of what a standard jointplot() function looks like when data is plotted- scatterplot with marginal histogram.

seaborn jointplot

You can load a data set of your own and assign values in parameter x and y, but in this example to show you what a jointplot() looks like I have assigned a value generated by randn(1000) to both variable’s data1 and data2, then assigned those variables to parameter x and y.

You can change the kind of plot by assigning a value from {“scatter”, “kde”, “hist”, “hex”, “reg”, “resid”} in the parameter kind.


seaborn.distplot() function displays univariate data in histogram with a line on it.

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions [2].

seaborn distplot displot

There are a lot more parameters (see Seaborn official documentation) available but for this example, I have only passed in data1 (our observed data in this example) to distplot() function to show you what it looks like.


seaborn.boxplot() function displays how the values in the data are spread out. It divides the data into sections that each contains approximately 25% of the data in a set.

A boxplot displays the distribution of data based on five number summary: minimum score, first lower quartile (Q1), median, third upper quartile (Q3), and maximum score [3].

  1. Minimum score: lowest score, left whisker
  2. Lower quartile (Q1): value between the minimum and median
  3. Median: mid-point of the data
    • If the median is in the middle of the box and whiskers are the same length on both sides, then the distribution is symmetric;
    • If the median is closer to the bottom and whisker is short on the lower end, then the distribution is positively skewed;
    • If the median is closer to the top and whisker is shorter on the upper end, then the distribution is negatively skewed.
  • Upper quartile (Q3): value between median and maximum
  • Maximum score: highest score, right whisker

The longer the box the more dispersed the data is, and the shorter the box the less dispersed the data is.

Important parameters (see Seaborn official documentation for more):

  • x, y, hue: names of variables in data or vector data, optional
  • data: DataFrame, array, or list of arrays, optional

In this example and output of boxplot() function, it shows the dispersion between Mfr Name data and CombMPG data taken from a DataFrame called df.


[1] Seaborn.jointplot()

[2] Seaborn.distplot()

[3] Understanding Boxplots

Machine Learning: House Price Prediction Using Linear Regression

Teaching helps in better understanding and retaining new materials learned and as part of my #100DaysOfCode challenge on Twitter, as I am learning Machine Learning algorithms, I will relay my learning here on this blog in hopes to help fellow beginners in this subject.


Linear Regression is an approach to modeling the relationship between two variables by fitting a straight line to the observed data. It is a basic and commonly used type of predictive analysis for a continuous dependent variable using a given set of independent variable. Thus, it can be said that Linear Regression is used for solving regression problems.

For better understanding of the definition:

Regression – statistical method that determines the strength and character of the relationship between a dependent variable and independent variable;

Continuous Variable – can take on unlimited number of values between its minimum and maximum value (e.g. price, salary, length, etc.)

An example of a relationship between an independent variable and dependent variable is shown below:

linear regression machine learning

This is a simple bivariate data (data involving two variables) plotted showing the time between two eruptions and the duration of the second eruption for 10 eruptions of the geyser Old Faithful with y (dependent variable) being the duration of eruption and x (independent variable) being the time between eruptions:

#x = Time between eruptions (in seconds)
#y = Duration of eruption (in seconds)

x = [272, 227, 237, 238, 203, 270, 218, 226, 250, 245]
y = [89, 79, 83, 82, 81, 85, 78, 81, 85, 79]

The regression line can be represented by the equation: y = mx + b

Where y and x are the variables describing a specific point on the graph, m is the slope of the line, and b the y-intercept describing where the line crosses the y-axis.

We calculate the R-squared to check if there is a relationship between the two variables because if there is no relationship between the two variables then linear regression cannot be used for prediction. R-squared value ranges from -1 to 1, where 0 means there is no relationship and -1 or 1 means there is a relationship.

We got 0.76 from the calculated r-squared value above example image, which shows that there is a relationship between the time between eruption and duration of eruption, albeit not perfect.


Getting Started

The dataset that I will be using on this tutorial is from GitHub user huzaifsayed’s USA_Housing, and I will be using Google Colaboratory notebook to write and execute the code.


Importing Dataset

import pandas as pd

url = ''
dataset = pd.read_csv(url)

There are different ways to load a CSV file on Google Colaboratory notebook, one of the easiest is to upload from a GitHub repository. Copy the link of the raw dataset and store it in a variable. Then load that variable into Pandas read_csv to get the dataframe [1].
dataset info linear regression machine learning

Pandas function is used to get a summary of the dataframe which can be useful in doing an analysis of the data. As we are using linear regression, we will not be including the ‘Address’ column because it is an object that will not be useful in our linear regression model:

X = dataset[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population']]
y = dataset['Price']


Split Dataset into Train and Test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

We will use train_test_split() method from the model_selection library of sklearn. sklearn.model_selection train_test_split, splits arrays or matrices into random train and test subsets. (Documentation)

test_size – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples;

random_state – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. 


Training the Linear Regression Model

from sklearn.linear_model import LinearRegression

linmodel = LinearRegression()

#training the model using training set, y_train)

Create linear regression object storing it to a variable called linmodel, then train the model using the training set X_train and y_train. (Documentation)


training linear regression machine learning output

Prediction from Linear Regression Model

import matplotlib.pyplot as plt

predictions = linmodel.predict(X_test)  
plt.scatter(y_test, predictions)

Using the trained model, we will now use it to predict the outcome of the test set. And to visualize the result, we will plot the data points of y_test and predictions.

The prediction from the test set:

prediction from linear regression model

Some other real life application of Linear Regression can be to predict one’s salary based on experience, predict gas money to pay when going on a road trip based on miles driven, predict product sale based on past buying behavior, or even predict economic growth of a country or state. It is a simple yet very useful algorithm to use for predictions.

Full code of what I have done // Full code from the original



[1] Get Started: 3 Ways to Load CSV files into Colab

[2] Linear Regression Machine Learning Project for House Price Prediction

[3 ] Machine Learning Project 1: Predict Salary using Simple Linear Regression