Intro to Python

Classification

The Elder Scrolls IV: Oblivion is simply one of the best role-playing games ever made.

Dragons and Titans is a halfhearted and frustrating online arena that doesn't deserve your time.

Ratchet & Clank skillfully avoids most of the traps that hold back the majority of modern platform games and presents a fantastic, well-balanced, story-driven adventure.

Poor design decisions and lots of annoying little things mar this otherwise decent racer.

Best

Fantastic

Well-balanced

Halfhearted

Frustrating

Poor

Annoying

Data Setup

Scikit-Learn

machineLearningClassification.py

Gamespot Game Reviews

.read_csv()

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

dataframe = pd.read_csv("gamespot_game_reviews.csv")

Data Analysis

dataframe = pd.read_csv("gamespot_game_reviews.csv")

print(dataframe.columns)

dataframe = pd.read_csv("gamespot_game_reviews.csv")

reviews_dataframe = dataframe[['tagline', 'classifier']]
print(reviews_dataframe)

pos

neg

Understanding Text

CountVectorizer

TfidfTransformer

tagline

reviews_dataframe = dataframe[['tagline', 'classifier']]

tagline_list = reviews_dataframe['tagline'].tolist()
count_vect = CountVectorizer()
words_train_counts = count_vect.fit_transform(tagline_list)

print(words_train_counts)

(0,

Chrono Cross may not have had the largest budget, but it has the largest heart.

3663

2056

largest

the

Term Frequency -Inverse Document Frequency

importance

the

amazing

terrible

amazing

terrible

words_train_counts = count_vect.fit_transform(tagline_list)

tfidf = TfidfTransformer()
words_train_tfidf = tfidf.fit_transform(words_train_counts)

#This will show us how many rows and how many columns the dataset has
print(words_train_tfidf.shape)

##Result
##(1184, 4139)

#This will show us an example of what the columns look like
print(words_train_tfidf.data)

##Result
##[0.12666006 0.18338277 0.21286579 ... 0.16532873 0.27479679 0.39429255]

Train, Test, Predict

splitting the data set

train_test_split()

words_train_tfidf = TfidfTransformer().fit_transform(words_train_counts)

split_results = train_test_split(words_train_tfidf, np.array(reviews_dataframe['classifier']), test_size=0.3)
train_words = split_results[0]
test_words = split_results[1]
train_reviews = split_results[2]
test_reviews = split_results[3]

test_reviews = split_results[3]

classification_model = MultinomialNB().fit(train_words, train_reviews)
testing_score = classification_model.predict(test_words)

print(testing_score)

testing_score

testing_score = classification_model.predict(test_words)

number_right = 0
for i in range(len(testing_score)):
    if testing_score[i] == test_reviews[i]:
        number_right +=1

print("Accuracy for tagline classify: %.2f%%" % ((number_right/float(len(test_reviews)) * 100)))

Additional Challenges

Improving Accuracy

You didn't perform any data cleanup when it came to this dataset. Sometimes, it can be worth it to clean up the data before putting it into the machine learning algorithm to improve its accuracy.

Try to clean up the dataset by performing some of the steps listed in the Natural Language Processing Lesson, such as removing stop words and stemming the text.

In addition, some of the rows are duplicates, because the exact same review is given for two different consoles. Removing these duplicates might improve your results.

Custom Input

1. Create a list with your own game reviews.
2. Use the transform() function from count_vect instead of the fit_transform() function on your list.
3. Use the transform() function from tfidf instead of the fit_transform() function on the count vectorizer's results.
4. Insert those results into the classification_model.predict() function.
5. Print your results. Did the model predict correctly? Try writing some reviews that will be confusing for it to interpret.

User Input

Have your program take input from the user and display the results of the prediction back to them.

Additional Data Sets

In the Dataset Analysis lesson, you found a dataset that could be used as input for a classification algorithm. Try to implement a text classification process for the dataset you picked out or take a look through the Kaggle Datasets and see if there are any data sets that would also make a good test for training a machine-learning model.

#Cheat code answers

#Custom list
test_list=["The absolute worst game ever made, do not buy it under any circumstances", "I was blown away by the intense graphics and the amazing gameplay. This game is one of the best I've ever played"]
test_cv = count_vect.transform(test_list)
test_tfidf = tfidf.transform(test_cv)
results = classification_model.predict(test_tfidf)

Machine Learning Classification