Machine Learning

This lesson will teach you about supervised learning algorithms and the Random Forest Model.

Supervised Machine Learning

Supervised Learning: Most machine learning algorithms are used for supervised learning. This is where you have input variables (independent data) that is used to predict an output variable (dependent data). This is useful when humans can reliably predict output based on inputs, such as labeling what types of objects are in a picture.

Unsupervised Learning: Sometimes, you may write programs that don't have exactly correct outputs that can be derived from inputs. Unsupervised Learning is commonly used for programs that have creative outputs, such as writing stories or songs.

Reinforcement Learning: You might develop an AI that continually learns over time, rather than receiving training only at one time. In the reinforcement learning approach, the model is continually updated as it learns about its environment. Think about an AI that learns how to navigate a maze. Every time the AI navigates the maze, you give it a reward based on how fast it traversed the maze. That way, it can learn the fastest path through the maze.

Supervised Learning

In supervised learning, we create a training set, which takes in a set of input variables with the matching output variable, and feed this set into the algorithm. The algorithm then learns by mapping the input to output so that it can later predict an output based on some input it hasn't seen yet.

Cannot load image

"Machine learning nutshell", by Epochfail, licensed under CC BY-SA 4.0

It is called supervised learning because we can think of the training dataset as a teacher that supervises the learning process.

The algorithm iteratively makes predictions on the training data and is corrected by the teacher who knows the answers. Learning stops when the algorithm reaches an acceptable level of predictive accuracy.

Supervised learning can be further split into two types of problems:

Classification: This is when the output variable is a category, e.g. red or blue or male or female. When you perform classification, all your outputs must be EXACTLY matched to a category in a set.

Regression: This is when the output variable is a value inside of a range, e.g. dollars or height. For example, you can have 0 dollars, 100 dollars, 1.5 dollars, or other decimal values for your output variable.

Some of the most popular examples of supervised learning algorithms are:

Linear Regression for regression problems.

Random Forest for classification and regression problems.

Support Vector Machines for classification problems.

We will start of by looking at the Random Forest Algorithm.

Random Forest Algorithm

ensemble

forest

Bootstrap Aggregation

bagging

"Random Forests Ensemble", by Hgkim0331, licensed under CC BY-SA 4.0

1. Pick N random records from the dataset.

2. Build a decision tree based on these N records.

3. Choose the number of trees you want in your algorithm and repeat steps 1 and 2.

4a. In regression problems, for a new record, make each tree predict the output for the record. The final output for that record is the average of the predicted outputs from each individual tree.

4b. In classification problems, each tree predicts the category that the new record belongs to. The new record is assigned to the category that has the majority of predictions.

Advantages of Random Forest	Disadvantages of Random Forest
Not biased since there are multiple trees, each trained on a subset of data.	Requires a lot of computational resources because of the complexity
Very stable algorithm since new data or outliers may impact one tree but is unlikely to impact all trees.	Takes a relatively long time to train large amounts of data due to the complexity.
Works well with both categorical and numerical features.

Supervised Learning Algorithms

Supervised Machine Learning

Random Forest Algorithm