Intro to Python

The Model Training Process

Before going into code, take a step back and understand how the model training process works.

In this lesson, you will not be using any advanced machine learning packages to create a machine learning text generation model. This model will be simple, and perhaps very inefficient compared to advanced machine learning packages you could use, but it will show you how the machine learning process works.

The machine learning process that our model use will follow these steps:

Create a Model Object
Clean up unstructured text input from a file
Create the learning model using a custom data structure
Generate text based on the created model
Save the model in a file so that it can be used again in the future without retraining

This text generation model can use any text file as input, so long as python can interpret the text (python may not be able to interpret some special characters in files by default, such as foreign language characters).

If you need a file to test with, you can use this text file: sherlockholmes.txt. This is The Adventures of Sherlock Holmes, you can open the below box to see an example of the text. Our text generation algorithm will try to use the same words and flow that this text uses.

The learning model that we will use will be very simple. After you reach the end of the lesson, you are welcome to create a model that works differently and see if it generates text more effectively. The model

For each word in the text
Remember which word comes AFTER that word

Then, when we get to the actual text generation part of the algorithm.

Starting with a random word
Pick a word that came after that word randomly
Inside the data structure, Pick a word that comes after the word we picked randomly
Loop until the desired number of words are generated

This algorithm isn't the brightest, and will definitely start to fall short when you generate long strings of text, because it can get derailed very easily from one thought to the next. For now, it will present a very simple case for created a machine learning model.

Cleaning Text Input

Create a new python file called textGeneratorModel.py, and create the ModelGenerator class at the top of the file.

class ModelGenerator:

    def __init__(self):
        self.model = {}

In this class's init function, it will create a new dictionary. This dictionary will be used to organize the data structure for the model. The key for the dictionary will be a string, such as the, that, or absolutely, and the values will be objects. Each of those objects will have logic that will determine how the model will generate text, based on the text it is given.

import re

class ModelGenerator:

    def __init__(self):
        self.model = {}
		
    def get_tokenized_text_from_formatted_file(file_path):
        text_file = open(file_path)
        all_lines = text_file.readlines()
        as_one_line = ''.join(all_lines)
        split_tokens = re.split(' |\n',as_one_line)
        return split_tokens

This is a basic cleanup function that will take the text from the file and turn it into a list. That list can then be interpreted by the machine learning model, in a way that the raw text file can be read by the model and learned from. This is an example of tokenizing the text by words.

This cleanup is not perfect. It doesn't understand how to handle punctuation very well. If you wanted to improve the machine learning model, you could also need to teach it how to understand punctuation. For now, it just understands spaces, and makes sure that paragraphs stick together.

If you want to make your code more functional-oriented, you can turn all those multiple lines into one line, such as below. This is the one-line function that will perform all the actions of the cleanup at once.

Note: The re stands for RegEx. This is a complicated way of handling parsing text data, and understanding how RegEx truly functions is beyond the scope of this course.

    def get_tokenized_text_from_formatted_file(self, file_path):
        return re.split(' |\n', ''.join(open(file_path).readlines()))

Next, check to make sure the output of your code is accurate. You can get a section of the tokenized text by using the code below, and print it out to the console to see what its output looks like.

You can see it just adds commas and apostrophes onto the words that they appear next to. Not the prettiest tokenizing job since it doesn't handle punctuation elegantly, but it will do for the bare minimum of the model for now.

class ModelGenerator:

    def __init__(self):
        self.model = {}
		
    def get_tokenized_text_from_formatted_file(self, file_path):
        text_file = open(file_path)
        all_lines = text_file.readlines()
        as_one_line = ''.join(all_lines)
        split_tokens = re.split(' |\n',as_one_line)
        return split_tokens

model_generator = ModelGenerator()
tokenized_text = model_generator.get_tokenized_text_from_formatted_file('sherlockholmes.txt')
print(tokenized_text[2000:2050])

# Output
# ['there', 'was', 'the', 'sharp', 'sound', 'of', "horses'", 
# 'hoofs', 'and', 'grating', 'wheels', 'against', 'the', 'curb,', 
# 'followed', 'by', 'a', 'sharp', 'pull', 'at']

TextToken Object

In order for the model to function in a sensible way, we need to create a data structure that will determine the answer to the following question:

Based on the current text token that I have, what should be the next text token that I generate?

This object will need two functions that the model will use. The model will need to specify which text parts will come after the specific text part of the object, and it will need a function to return a next part after the model has been trained.

import random
import re

class TextToken():

    def __init__(self, token_string):
        self.token_string = token_string
        self.next_tokens = []

    def add_token(self, new_token_string):
        self.next_tokens.append(new_token_string)

    def pick_next_token(self):
        if len(self.next_tokens) > 0:
            return random.choice(self.next_tokens)
        else:
            return '.'

class ModelGenerator:

In this implementation, the object knows what string it is through the token_string variable. In addition, there is a list of next_tokens that will be all the different text parts that could potentially come after this token.

As the model learns from the text, it will create these objects, and add the text that appears after this token to its list. That way, it knows what is the word that comes after it.

When it comes time to generate text, the model will pick a next token based on all the potential choices in the object's next_tokens list. If there is nothing that comes after this word, it will return a period. This is to make sure that the last text-token on the model's training can still return an output, even if it hasn't learned what comes after the final token in the text.

Next, the ModelGenerator will need to use the tokens to generate the model.

class ModelGenerator:

    def __init__(self):
        self.model = {}
		
    def get_tokenized_text_from_formatted_file(self, file_path):
        text_file = open(file_path)
        all_lines = text_file.readlines()
        as_one_line = ''.join(all_lines)
        split_tokens = re.split(' |\n',as_one_line)
        return split_tokens
	
    def update_model(self, text_file):
        list_of_tokens = self.get_tokenized_text_from_formatted_file(self, file_path)
        index = 1
        while index < len(list_of_tokens):
            current_token = list_of_tokens[index - 1]
            next_token = list_of_tokens[index]
            if (current_token in self.model.keys()):
                self.model[current_token].add_token(next_token)
            else:
                self.model[current_token] = TextToken(current_token)
            self.model[current_token].add_token(next_token)
            index += 1

In update_model, the function will take in a file path, which will then be tokenized to allow the model to learn from it.

To loop through all the tokens, we use a while loop starting at index 1. This is because we need to check the text that comes AFTER the previous text. Then, you need to end when you reach the final token of text in the list.

If there already is an object for the particular string, you don't need to make a new token object for it. Otherwise, a new token object is created, and the token is linked to the next token through its list.

Generating Text

To do something useful with our created model, we can generate text based on the model's learned information.

This will occur in the generate_text function, which will take in the amount of tokens that you want to generate for your string.

    def update_model(self, text_file):
        list_of_tokens = self.get_tokenized_text_from_formatted_file(text_file)
        index = 1
        while index < len(list_of_tokens):
            current_token = list_of_tokens[index - 1]
            next_token = list_of_tokens[index]
            if (current_token in self.model.keys()):
                self.model[current_token].add_token(next_token)
            else:
                self.model[current_token] = TextToken(current_token)
            self.model[current_token].add_token(next_token)
            index += 1
			
    def generate_text(self, token_count):
        start_selection_string = random.choice(list(self.model.keys()))
        token_object_selection = self.model[start_selection_string]
        generated_string = ''
        for current_token_number in range(token_count):
            string_to_add = token_object_selection.token_string
            generated_string += " " + string_to_add
            token_object_selection = self.model[token_object_selection.pick_next_token()]
        return generated_string

The generate_text function will start with a random string of text from the dictionary model, and then continually select one string after another by going through the objects in the model. Each object knows what tokens should come after itself, so by picking the next token and saving that result, we can continually pick new tokens one after the other, until we reach the end of the amount of tokens to create.

Finally, at the bottom of the file, create your model, update it, and generate some sample text.

model_generator = ModelGenerator()
model_generator.update_model('sherlockholmes.txt')
generated_text = model_generator.generate_text(100)
print(generated_text)

# Output will be a random collection of strings that might look like this:
# Then I was. Then, suddenly snapped, and I thought seized my approaching it, 
# sir, I expected obedience on to look."  "Yes," said she.

It may be using the words from the text, but as the sentences get longer, the text seems to go more off-topic.

Saving and Loading

It would be inconvenient if you had to retrain your model every time you ran the program. Instead, we will use the python module pickle, which will allow you to store objects on your hard drive.

WARNING: All of the code that you have written in python so far has been relatively safe code, with little that it could do to harm your computer. Because you will be saving files to the file system, it is up to you to make sure that you have enough hard drive space to perform this activity.

We recommend that you have at least 1 GB of free hard drive space. However, if you create objects that are bigger, it will take up more of your hard drive space. As you proceed through the lesson, pay attention to the file size of the .p file that is generated by this code.

There are numerous ways that you can optimize your saved python objects using pickle. We won't cover any of them as part of this lesson, but if you want to create a gigantic learning model, you might want to learn about them.

Import pickle into your python file.

import random
import re
import pickle

We will update the init method of the ModelGenerator class to check if the model file already exists. We will also create helper methods that will save and load to and from the file.

class ModelGenerator:

    def __init__(self, model_path = ''):
        if model_path != '':
            if path.exists(model_path):
                self.model = self.load_model(model_path)
            else:
                self.model = {}
        else:
            self.model = {}
        self.model_path = model_path
        
    def load_model(self, initial_model):
        return pickle.load(open(initial_model, "rb"))
	
    def save_model(self):
        if self.model_path != '':
            pickle.dump(self.model, open(self.model_path, "wb"))

If there is no model file name input when the ModelGenerator object is created, it assumes that you are creating a new model, and that new model won't be saved.

However, if there is a model file path name, it checks to see if the file already exists. If it does, it loads that model into the model variable.

We will use the save_model method later, when we want to save the model inside the update_model method.

We will update the update_model method to save the model to the disk after it has finished learning based off of an input file.

    def update_model(self, text_file):
        list_of_tokens = self.get_tokenized_text_from_formatted_file(text_file)
        index = 1
        while index < len(list_of_tokens):
            current_token = list_of_tokens[index - 1]
            next_token = list_of_tokens[index]
            if (current_token in self.model.keys()):
                self.model[current_token].add_token(next_token)
            else:
                self.model[current_token] = TextToken(current_token)
            self.model[current_token].add_token(next_token)
            index += 1
        self.save_model()

Next, at the bottom of the file, where you create the ModelGenerator object, you can input the optional parameter for the file name to store the model.

model_generator = ModelGenerator('reading_model.p')
model_generator.update_model('sherlockholmes.txt')
generated_text = model_generator.generate_text(100)

You can now run your code to generate text without having to perform any update_model calls, since the model is stored, and will be automatically loaded when the program runs.

model_generator = ModelGenerator('reading_model.p')
generated_text = model_generator.generate_text(100)

More Runs, More Files

To more finely-tune a model, you can have it run multiple times over the same text, or have it run over multiple different texts to see what you come up with.

You can simply update your model multiple times over the same text by using a for loop.

for update_run in range(1, 10):
    print("Reading Sherlock Holmes, iteration: ", update_run)
    model_generator.update_model('sherlockholmes.txt')

print(model_generator.generate_text(100))

To perform code that works on filesystem, you need to add another import at the top of your file, import os.

import random
import re
import pickle
import os

After you have imported os, you can use code that will allow you to find all files in a directory that you specify.

files_path = 'all_texts/'
runs_per_file = 10
files = os.listdir(files_path)
for file in files:
    full_path = os.path.join(files_path, file)
    if os.path.isfile(full_path):
        for run in range(runs_per_file):
            print("Model Reading:", file, ", iteration:", run)
            model_generator.update_model(full_path)
        
    
print(model_generator.generate_text(100))

The code above checks for all the files in the files_path variable's path. The way this code is written, that directory should be created at the same level as the python file that you are running, and that directory must be called all_texts. It will then do the update for each run of the model, and will print which file it is reading as it goes through the files.

Where can you get text to train your model on? One good site is Gutenberg.org's list of top 100 books.

If you run into encoding errors with your text files, try to copy and paste the UTF-8 versions of the text into a new text file, before you attempt to open them in python.

Additional Features

Interested in taking this further? Here's some things that you can do to improve the project.

Add Punctuation Handling

Right now, any time that punctuation occurs, it simply gets stored connected to the text token. You might choose to separate out punctuation as a separate token, or try to add logic related to how punctuation should work in the generated text.

Combination Tokens

Right now, every single token that gets generated has a very simple logic. You can add in additional logic related to expanding the token size and changing it over time.

If a certain token always appears after a token, why not combine them into one token? Then you can get parts of text that make sense next to each other more easily. For example, the token he followed by said could be combined into a token of he said.

Text Generation Model