Data Visualization Using Seaborn

Seaborn is a Python data visualization library that provides stunning and informative statistical graphics. In this article I will be lightly discussing a few functions used for data visualization with seaborn:

  • seaborn.jointplot()
  • seaborn.distplot()
  • seaborn.boxplot()

seaborn.jointplot()

seaborn.jointplot() function displays a relationship between two variables (bivariate), x and y, and a univariate in the margins.

Univariate is a term used to describe a type of data that only observes a single attribute or characteristic, while a bivariate observes two types of data that are usually related. For example, number of tweets posted in a day vs number of engagements in tweets. If you only observed one of them, then it is considered univariate.

seaborn.jointplot() is intended to be a fairly lightweight wrapper [1].

There are a lot more parameters (see Seaborn official documentation) available but here are some of the important ones:

  1. x, y: vectors or keys in data
    • Variables that specify positions on the x and y axes
  2. data: pandas.DataFrame, numpy.ndarray, mapping or sequence
    • Input data structure. Either long-form or wide-form
  3. kind: {“scatter”, “kde”, “hist”, “hex”, “reg”, “resid”}
    • Kind of plot to draw

Here is an example of what a standard jointplot() function looks like when data is plotted- scatterplot with marginal histogram.

seaborn jointplot

You can load a data set of your own and assign values in parameter x and y, but in this example to show you what a jointplot() looks like I have assigned a value generated by randn(1000) to both variable’s data1 and data2, then assigned those variables to parameter x and y.

You can change the kind of plot by assigning a value from {“scatter”, “kde”, “hist”, “hex”, “reg”, “resid”} in the parameter kind.


seaborn.distplot()

seaborn.distplot() function displays univariate data in histogram with a line on it.

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions [2].

seaborn distplot displot

There are a lot more parameters (see Seaborn official documentation) available but for this example, I have only passed in data1 (our observed data in this example) to distplot() function to show you what it looks like.


seaborn.boxplot()

seaborn.boxplot() function displays how the values in the data are spread out. It divides the data into sections that each contains approximately 25% of the data in a set.

A boxplot displays the distribution of data based on five number summary: minimum score, first lower quartile (Q1), median, third upper quartile (Q3), and maximum score [3].

  1. Minimum score: lowest score, left whisker
  2. Lower quartile (Q1): value between the minimum and median
  3. Median: mid-point of the data
    • If the median is in the middle of the box and whiskers are the same length on both sides, then the distribution is symmetric;
    • If the median is closer to the bottom and whisker is short on the lower end, then the distribution is positively skewed;
    • If the median is closer to the top and whisker is shorter on the upper end, then the distribution is negatively skewed.
  • Upper quartile (Q3): value between median and maximum
  • Maximum score: highest score, right whisker

The longer the box the more dispersed the data is, and the shorter the box the less dispersed the data is.

Important parameters (see Seaborn official documentation for more):

  • x, y, hue: names of variables in data or vector data, optional
  • data: DataFrame, array, or list of arrays, optional

In this example and output of boxplot() function, it shows the dispersion between Mfr Name data and CombMPG data taken from a DataFrame called df.



Resources:

[1] Seaborn.jointplot()

[2] Seaborn.distplot()

[3] Understanding Boxplots

Create Digital Clock Using Tkinter

digital clock using tkinter python

A Graphical User Interface (GUI) is an interface that displays objects on screen that users can interact with. It is more user-friendly compared to a text-based command-line interface for it uses objects such as icons, buttons, cursors, and other graphical elements to represent actions. There are many GUI toolkits that can be use with Python such as wxPython and JPython but for this tutorial, we will be creating a GUI application using Tkinter.

Tkinter is the standard GUI library for Python. Tkinter provides a variety of common GUI elements or widgets such as buttons, text box, labels, frame, and many more that can be use to build an interface with. The following are widgets available in Tkinter [1]:

Containers: frame, label frame, top level, pane window.

Buttons: button, radio button, check button (checkbox), and menu button.

Text Widgets: label, message, text.

Entry Widgets: scale, scrollbar, list box, slider, spin box, entry (single line), option menu, text (multi line), and canvas (vector and pixel graphics).

In this tutorial, we will create a simple digital clock to get the hang of using Tkinter.

 

Getting Started

import tkinter as tk
import datetime

Since we are going to be creating a GUI application using Tkinter, we must import tkinter module and import datetime module to work with date and time.

But before jumping on further, it is good practice to plan out the design layout first which will act as a blueprint as you code. This way you already know where to put widgets on the GUI application and time will not be wasted in figuring out where to place them when coding.

digital clock design layout
 

Creating the Application Main Window

x = datetime.now()

window = tk.Tk()
window.title("Digital Clock")

canvas = tk.Canvas(window, height=200, width=500)
canvas.pack()

frame = tk.Canvas(window, bg='#696969')
frame.place(relx=0, rely=0, relheight=1, relwidth=1)

#insert code here

window.mainloop()

Line 1: datetime object containing current date and time. Note that we use .strftime() to create a string representing date and time in another format which we’ll see later on how to use

Line 3: creates the GUI application main window

Line 4: sets the window title to “Digital Clock”

Line 6: creates the canvas, setting the height and width to height = 200, width = 500

Line 7: packs the canvas into the window

Line 9: creates the frame setting the background color to #696969

Line 10: places the frame in a specific position in the parent widget

  • relheight, relwidth − Height and width as a float between 0.0 and 1.0, as a fraction of the height and width of the parent widget [2]
  • relx, rely − Horizontal and vertical offset as a float between 0.0 and 1.0, as a fraction of the height and width of the parent widget [2]

Line 14: mainloop() method executes when GUI application is run, waiting for events from user

 

Inserting Label Widget

#Displays the 24-hour clock 00:00 
clock = tk.Label(frame, fg="#8FBC8F", bg='#696969', font="Verdana 110", anchor="nw")
clock.place(relx=0.05, rely=0.15, relheight=0.6, relwidth=0.7)

#Displays the seconds in clock
second = tk.Label(frame, fg="#8FBC8F", bg='#696969', font="Verdana 30", anchor="nw")
second.place(relx=0.7, rely=0.55, relheight=0.3, relwidth=0.1)

#Label for month
month = tk.Label(frame, fg='#BDB76B', bg='#696969', text="MONTH", font="Verdana 15")
month.place(relx=0.790, rely=0.1, relheight=0.15, relwidth=0.2)

#Displays month name, short version (e.g. FEB)
b = tk.Label(frame, fg='#8FBC8F', bg='#696969', text=x.strftime("%b"), font="Verdana 25 bold")
b.place(relx=0.790, rely=0.230, relheight=0.15, relwidth=0.2)

#Label for date
date = tk.Label(frame, fg='#BDB76B', bg='#696969', text="DATE", font="Verdana 15")
date.place(relx=0.790, rely=0.380, relheight=0.15, relwidth=0.2)

#Displays day of month 
d = tk.Label(frame, fg='#8FBC8F', bg='#696969', text=x.strftime("%d"), font="Verdana 25 bold")
d.place(relx=0.790, rely=0.51, relheight=0.15, relwidth=0.2)

#Label for weekday
day = tk.Label(frame, fg='#BDB76B', bg='#696969', text="DAY", font="Verdana 15")
day.place(relx=0.790, rely=0.650, relheight=0.15, relwidth=0.2)

#Displays weekday, short version (e.g. Wed)
a = tk.Label(frame, fg='#8FBC8F', bg='#696969', text=x.strftime("%a"), font="Verdana 25 bold")
a.place(relx=0.790, rely=0.77, relheight=0.15, relwidth=0.2)

The Label widget on Tkinter is used to display a text or image on the screen. The label widget uses double buffering, so you can update the contents at any time, without annoying flicker [3].

Label(*master, **options)

*master refers to the parent widget. In our case our master is the frame.

**options refers to the widget options.

One of the widget options we have used is called text which displays the text in the label. If you notice on line 14, 22, and 30 our text contains x.strftime(%b), x.strftime(%d), and x.strftime(%a) which will display the month, day of month, and weekday respectively on our GUI application.

 

Adding Functions

def get_time():
    hour_min = time.strftime("%H:%M")
    clock.config(text=hour_min)
    clock.after(200, get_time)

'''
clock = tk.Label(frame, fg="#8FBC8F", bg='#696969', font="Verdana 110", anchor="nw")
clock.place(relx=0.05, rely=0.15, relheight=0.6, relwidth=0.7)
'''

get_time()


def get_second():
    sec = time.strftime("%S")
    second.config(text=sec)
    second.after(200, get_second)

'''
second = tk.Label(frame, fg="#8FBC8F", bg='#696969', font="Verdana 30", anchor="nw")
second.place(relx=0.7, rely=0.55, relheight=0.3, relwidth=0.1)
'''

get_second()

def get_time() and def get_second() function is used to display time and seconds on their respective labels.

 

Resources:

[1] Tkinter (Wikipedia)

[2] Python – Tkinter place() method

[3] The Tkinter Label Widget

Machine Learning: House Price Prediction Using Linear Regression

Teaching helps in better understanding and retaining new materials learned and as part of my #100DaysOfCode challenge on Twitter, as I am learning Machine Learning algorithms, I will relay my learning here on this blog in hopes to help fellow beginners in this subject.

 

Linear Regression is an approach to modeling the relationship between two variables by fitting a straight line to the observed data. It is a basic and commonly used type of predictive analysis for a continuous dependent variable using a given set of independent variable. Thus, it can be said that Linear Regression is used for solving regression problems.

For better understanding of the definition:

Regression – statistical method that determines the strength and character of the relationship between a dependent variable and independent variable;

Continuous Variable – can take on unlimited number of values between its minimum and maximum value (e.g. price, salary, length, etc.)

An example of a relationship between an independent variable and dependent variable is shown below:

linear regression machine learning

This is a simple bivariate data (data involving two variables) plotted showing the time between two eruptions and the duration of the second eruption for 10 eruptions of the geyser Old Faithful with y (dependent variable) being the duration of eruption and x (independent variable) being the time between eruptions:

#x = Time between eruptions (in seconds)
#y = Duration of eruption (in seconds)

x = [272, 227, 237, 238, 203, 270, 218, 226, 250, 245]
y = [89, 79, 83, 82, 81, 85, 78, 81, 85, 79]

The regression line can be represented by the equation: y = mx + b

Where y and x are the variables describing a specific point on the graph, m is the slope of the line, and b the y-intercept describing where the line crosses the y-axis.

We calculate the R-squared to check if there is a relationship between the two variables because if there is no relationship between the two variables then linear regression cannot be used for prediction. R-squared value ranges from -1 to 1, where 0 means there is no relationship and -1 or 1 means there is a relationship.

We got 0.76 from the calculated r-squared value above example image, which shows that there is a relationship between the time between eruption and duration of eruption, albeit not perfect.

 

Getting Started

The dataset that I will be using on this tutorial is from GitHub user huzaifsayed’s USA_Housing, and I will be using Google Colaboratory notebook to write and execute the code.

 

Importing Dataset

import pandas as pd

url = 'https://raw.githubusercontent.com/huzaifsayed/Linear-Regression-Model-for-House-Price-Prediction/master/USA_Housing.csv'
dataset = pd.read_csv(url)

There are different ways to load a CSV file on Google Colaboratory notebook, one of the easiest is to upload from a GitHub repository. Copy the link of the raw dataset and store it in a variable. Then load that variable into Pandas read_csv to get the dataframe [1].

 
dataset.info()
dataset info linear regression machine learning

Pandas dataframe.info() function is used to get a summary of the dataframe which can be useful in doing an analysis of the data. As we are using linear regression, we will not be including the ‘Address’ column because it is an object that will not be useful in our linear regression model:

X = dataset[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population']]
y = dataset['Price']

 

Split Dataset into Train and Test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

We will use train_test_split() method from the model_selection library of sklearn. sklearn.model_selection train_test_split, splits arrays or matrices into random train and test subsets. (Documentation)

test_size – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples;

random_state – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. 

 

Training the Linear Regression Model

from sklearn.linear_model import LinearRegression

linmodel = LinearRegression()

#training the model using training set
linmodel.fit(X_train, y_train)

Create linear regression object storing it to a variable called linmodel, then train the model using the training set X_train and y_train. (Documentation)

Output:

training linear regression machine learning output
 

Prediction from Linear Regression Model

import matplotlib.pyplot as plt

predictions = linmodel.predict(X_test)  
plt.scatter(y_test, predictions)

Using the trained model, we will now use it to predict the outcome of the test set. And to visualize the result, we will plot the data points of y_test and predictions.

The prediction from the test set:

prediction from linear regression model
 

Some other real life application of Linear Regression can be to predict one’s salary based on experience, predict gas money to pay when going on a road trip based on miles driven, predict product sale based on past buying behavior, or even predict economic growth of a country or state. It is a simple yet very useful algorithm to use for predictions.

Full code of what I have done // Full code from the original

 

References:

[1] Get Started: 3 Ways to Load CSV files into Colab

[2] Linear Regression Machine Learning Project for House Price Prediction

[3 ] Machine Learning Project 1: Predict Salary using Simple Linear Regression