Machine Learning

Setup

Scikit-Learn

The Regression Problem

We want to predict the gas consumption (in millions of gallons) in 48 of the US states based on the petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of the population with a driving license.

petrol

RandomForestRegression

regression.py

Datasets

petrol_consumption.csv - A CSV file containing information about petrol consumption.

import pandas as pd
import numpy as np

'Datasets/'

import pandas as pd
import numpy as np

pd.options.display.max_columns = None

df = pd.read_csv('Datasets/petrol_consumption.csv')

import pandas as pd
import numpy as np

pd.options.display.max_columns = None

df = pd.read_csv('Datasets/petrol_consumption.csv')

print(df.head())

Splitting the Data

features

labels

As you may recall, when we do supervised machine learning, we have to divide the data into two sets:

A training set - used to teach the model how to predict. This set contains input data (the features) that correspond to an output (the labels). The model then predicts results using the input of the training dataset and compares the results to the actual output. Based on the results of the comparison and the algorithm being used, the parameters of the model are adjusted.

A testing set - used to evaluate the performance of the model. Once the model has finished with the training dataset and made its adjustments, it tests itself using the testing set which only contains input data (no output data). We can then use the predictions to check how accurate our model was.

iloc[]

learn more about here

colon (:)

df = pd.read_csv('Datasets/petrol_consumption.csv')

X = df.iloc[:, 0:4].values
y = df.iloc[:, 4].values

X

y

X

y

y = df.iloc[:, 4].values

print("X")
print()
print(X)
print()
print("y")
print()
print(y)

Scikit-Learn

train_test_split

sklearn.model_selection

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

Before discussing train_test_split, you should know about Sklearn (or Scikit-learn). It is a Python library that offers various features for data processing that can be used for classification, clustering, and model selection.

Model_selection is a method for setting a blueprint to analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results when making a prediction.

To do that, you need to train your model by using a training dataset. Then, you test the model against testing dataset.

If you have one dataset, you'll need to split it by using the Sklearn train_test_split function first.

train_test_split()

X = df.iloc[:, 0:4].values
y = df.iloc[:, 4].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

train_test_split() is a function in Sklearn model selection for splitting data arrays into two subsets: one for training data and another for testing data. With this function, you don't need to divide the dataset manually.

By default, Sklearn's train_test_split() will randomly divide the dataset into two. However, you can also specify a random state for the operation.

There are a few parameters in train_test_split that we need to look at:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.*, random_state=*)

Output:

X_train - variable to hold the training dataset
X_test - variable to hold the test dataset
y_train - variable to hold the set of labels that correspond to the data in X_train
y_test - variable to hold the set of labels that correspond to the data in X_test

Parameters:

X, y - the first parameter is the dataset you're selecting to use.

test_size - a percent (represented as a float value from 0 to 1) that sets the size of the testing dataset. It will be set to 0.25 if the training size is not specified.

random_state - if you do not pass a value into this parameter, it will perform a different random grouping of the training and testing splits every time. If you pass in a number, it will perform the split consistently every time the function is called.

  X = df.iloc[:, 0:4].values
  y = df.iloc[:, 4].values

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

20%

80%

Feature Scaling

feature scaling

Feature scaling is a method used to normalize the range of independent variables (features) of data. It is also known as data normalization.

Most of the time, your dataset will contain features highly varying in magnitudes, units, and range. Since most machine learning algorithms use Euclidean distance between two data points in their computations, this is a problem.

The algorithms only take in the magnitude of features and ignore the units. Therefore results would vary greatly between different units, such as 5 kilograms and 5000 grams, even though they are the same The features with high magnitudes will matter much more in the machine learning algorithm's distance calculations than features with low magnitudes.

Therefore we use feature scaling to normalize our data into a fixed range.

StandardScaler

sklearn.preprocessing

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

sc =  StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

StandardScaler

sc.fit_transform()

sc.transform()

learn more about the StandardScaler here

Training the Algorithm

RandomForestRegressor

sklearn.ensemble

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

sc =  StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

regressor = RandomForestRegressor(n_estimators=20, random_state=0)

n_estimators

20

random_state

learn more about here

regressor = RandomForestRegressor(n_estimators=20, random_state=0)

regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

regressor.fit()

Fit

regressor.predict()

y_pred

Evaluating the Algorithm

Mean Absolute Error: this is the average of all the differences between your predicted values and the actual values. For example, if you predicted 60 and the actual value was 63.5, you would have an absolute error of 3.5. - Learn more about mean absolute error.

Mean Squared Error: This is the average of all the errors squared. By squaring the numbers, you assign greater weight to differences that are larger. For example, if you predicted 63 but the actual value was 60, you would have a squared error of 9. If you predicted 65 but the actual value was 60, you would have a squared error of 25. If you have a low mean absolute error but a high mean squared error, it means that there are outliers in the data set where the errors were unusually large, even though most of the predictions were accurate. - Learn more about mean squared error.

Root Mean Squared Error: When you square each of the errors, it results in a number that is very large compared to the mean absolute error. By taking the square root, it will result in a number that is more easily directly comparable to the mean absolute error. - Learn more about root mean squared error.

metrics

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

regressor = RandomForestRegressor(n_estimators=200, random_state=0)

regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

overfit

Activity: Analyze Datasets

Would this dataset be a good candidate to use a supervised learning algorithm?
Could there be links between features that could be used to make predictions?
What type of cleanup would be needed to make the dataset ready for machine learning?

Random Forest for Regression

Setup

Splitting the Data

Training and Testing

Sklearn and Model_selection

train_test_split

Feature Scaling

Feature Scaling

Training the Algorithm

Evaluating the Algorithm

Activity: Analyze Datasets