Machine Learning

Setup

DigitRecognition

recognition.py

Modules for Analysis:
pandas - for datasets
numpy - for arrays
svm from sklearn - for the svm algorithm (we will also set the font for our visuals to be slightly bigger using sns.set(font_scale = 1.2))
metrics from sklearn - module for evaluating the performance of the model

Modules for Visuals:
matplotlib.pyplot - for graphs
seaborn - prettier graphs (we will also set the font for our visuals to be slightly bigger using sns.set(font_scale = 1.2))

New Module:
We will also import tensorflow here as tf.
Tensorflow is another machine learning library with many uses. For now we are just using it to get the mnist dataset.
The mnist dataset is a large set of images of handwritten digits. There are 60,000 training images and 10,000 testing images!

import pandas as pd
import numpy as np
from sklearn import svm
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(font_scale=1.2)
import tensorflow as tf

The dataset is already clean and ready for use.

We can load the dataset straight into the training set and testing set, each with the features and labels separated.

load_data()

import pandas as pd
import numpy as np
from sklearn import svm
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(font_scale=1.2)
import tensorflow as tf

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

Displaying the Images

imshow()

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

plt.imshow(X_train[0], cmap='gray')
plt.show()

cmap

show()

plots tab

1. All the images are grayscale, meaning they only contain black, white and gray.
2. The images are 28 pixels by 28 pixels in size (28x28).

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

print(X_train[0])

cmap

color map

gray

Fixing the Data

shape

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

print("X_train shape", X_train.shape)
print("y_train shape", y_train.shape)
print("X_test shape", X_test.shape)
print("y_test shape", y_test.shape)

Whenever we fit our model, we need to pass two arguments into the fit() function:

X: Training data of shape (n_samples, n_features)
y: Training label values of shape (n_samples, n_labels)

Whenever we predict with our model, we need to pass one argument into the predict() function:

X: Testing samples of shape (n_samples, n_features)

Basically, supervised learning algorithms in scikit-learn expect data to be stored in two-dimensional arrays.

Luckily, 1D arrays such as our labels in y_train and y_test, are automatically reshaped to become 2D arrays. Therefore they will be reshaped from (n_samples,) to (n_samples, 1)

However, our features are still in 3 dimensions with a shape (n_samples, 28, 28). We need to reshape this to be only 2 dimensional.

We will do this by changing the pixel data to not be a 2D array of height and width, but one long array of all the pixels. I.e. 28 pixels by 28 pixels will just become 784 pixels (28 squared).

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)

Fitting the SVM Model

X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)

X_train = X_train[:100, :]
y_train = y_train[:100]
X_test = X_test[:100, :]
y_test = y_test[:100]

svm.SVC()

X_train

y_train

X_train = X_train[:100, :]
y_train = y_train[:100]
X_test = X_test[:100, :]
y_test = y_test[:100]

model = svm.SVC()
model.fit(X_train, y_train)

model = svm.SVC()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

y_pred = model.predict(X_test)

index_to_compare = 0

title = 'True: ' + str(y_test[index_to_compare]) + ', Prediction: ' + str(y_pred[index_to_compare])

plt.title(title)
plt.imshow(X_test[index_to_compare].reshape(28,28), cmap='gray')
plt.grid(None)
plt.axis('off')
plt.show()

We create a title which displays both the true label of the image and our model's predicted label.

We pass the image into imshow() and set the cmap to gray. Here we also have to reshape the image data which is currently (784,), back to (28,28).

Finally, we turn off the grid and axis since we're not displaying a graph, and we show the image to the screen.

index_to_compare = 2

index_to_compare = 6

Evaluating the Model

metrics.accuracy_score()

plt.show()

acc = metrics.accuracy_score(y_test, y_pred)
print('\nAccuracy: ', acc)

Why is it so inaccurate?

There are many factors that affect the accuracy and performance of a machine learning model.

In this case, there is one major factor that is affecting the accuracy: the sample size.

Since we reduced our sample size to 100, this is not many images for the model to train on.

It also means that with only 100 images, there will only be an average of 10 images for each handwritten digit.

However, that's just an average. There may be only 1 image for one type of digit and 19 for another type of digit.

acc = metrics.accuracy_score(y_test, y_pred)
print('\nAccuracy: ', acc)

digits = pd.DataFrame.from_dict(y_train)

ax = sns.countplot(x=0, data=digits)

ax.set_title("Distribution of Digit Images in Test Set")
ax.set(xlabel='Digit')
ax.set(ylabel='Count')

plt.show()

countplot()

x

data

confusion matrix

heatmap

A confusion matrix (also known as an error matrix) is a table that compares predicted classifications to actual classifications.

A heatmap uses intensity of the intensity of colors to represent the amount of a category.

metrics.confusion_matrix()

plt.show()

cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)

cm = metrics.confusion_matrix(y_test, y_pred)

ax = plt.subplots(figsize=(9, 6))

sns.heatmap(cm, annot=True)

ax[1].title.set_text("SVC Prediction Accuracy")
ax[1].set_xlabel("Predicted Digit")
ax[1].set_ylabel("True Digit")

plt.show()

subplots()

annot

X_train = X_train[:500, :]
y_train = y_train[:500]
X_test = X_test[:100, :]
y_test = y_test[:100]

model = svm.SVC()
model.fit(X_train, y_train)

index_to_compare

6

Handwritten Digit Recognition

Setup

Displaying the Images

Fixing the Data

The Shape Problem

Fitting the SVM Model

Evaluating the Model

Activity: Image Recognition