Machine Learning

Setup

NaturalLanguageProcessing

processing.py

Requests

BeautifulSoup 4

import requests
from bs4 import BeautifulSoup

NLTK (Natural Language ToolKit)

import requests
from bs4 import BeautifulSoup
import nltk

nltk.download()

import nltk

nltk.download()

install all packages

Wikipedia page about Nintendo

Requests

import nltk

nintendo_wiki = 'https://en.wikipedia.org/wiki/Nintendo'
page = requests.get(nintendo_wiki)

print(page.text)

BeautifulSoup

Clear console

CTRL + L

nintendo_wiki = 'https://en.wikipedia.org/wiki/Nintendo'
page = requests.get(nintendo_wiki)

soup = BeautifulSoup(page.text, 'html.parser')

soup.get_text()

soup = BeautifulSoup(page.text, 'html.parser')
text = soup.get_text()

print(text)

strip

True

separator

' '

soup = BeautifulSoup(page.text, 'html.parser')
text = soup.get_text(separator = ' ', strip = True)

print(text)

Python Tokenization

Tokenization

tokens

Word-level

At the word-level, tokenization returns a set of words from a sentence.

For example, tokenizing the sentence My name is David returns the following set of words:

  words = ['My', 'name', 'is', 'David']

Sentence-level

At the sentence-level, tokenization returns a set of sentences from a document.

For example, tokenizing the document with the text: My name is David. I am 21 years old. I live in the UK. returns the following set of sentences:

S1 = My name is David.
S2 = I am 21 years old.
S3 = I live in the UK.

For this example, we are going to do word level tokenization. We can do this using the Python string split() method.

This method splits a string into a list where each word is a list item. By default, the method will split words separated by whitespace.

soup = BeautifulSoup(page.text, 'html.parser')
text = soup.get_text(separator = ' ', strip = True)

words = text.split()

print (words)

As you can see we now have a long list of all the words from the wikipedia article. Let's see how many words we have in total.

words = text.split()

print (len(words))

In this example, the whole page has 18,785 words in total (Your count may be a little different, if the page has changed, but it should be close to that number.) However, keep in mind that these are not 18,785 unique words. You will also soon find out that some of the things counted are not actually words.

Frequency Distribution

FreqDist()

frequency distribution

words = text.split()

freq = nltk.FreqDist(words)

FreqDist

items()

freq = nltk.FreqDist(words)

print(freq.items())

Frequency Distribution Graph

plot()

title

freq = nltk.FreqDist(words)

freq.plot(20, title = 'Frequency Distribution')

plots tab

1. The same words appear multiple time with different capitalization, such as 'game' and 'Game'.

2. Punctuation and special characters are also counted.

3. Words that do not provide useful information are also counted, such as 'the', 'of', 'in', 'and', etc.

1. Changing Case:

lower()

soup = BeautifulSoup(page.text, 'html.parser')
text = soup.get_text(separator = ' ', strip = True)

clean_text = text.lower()

words = clean_text.split()

clean_text

clean_words.split()

2. Removing Punctuation:

words

maketrans()

translate()

Translate means to convert or change and the translate() method will do the same thing. It will convert/change a string by altering the characters in the string.

A string can be translated with the translate() method by:

1. Inserting characters into the string
2. Replacing characters in the string
3. Deleting characters in the string

How the translate() method changes the string depends on a translation table which we will create with the maketrans() method.

A translation table is just a dictionary that gives the translation for each character in a string. We can use the maketrans() method to create a translation table using the following syntax:

    string.maketrans(x,y,z)
    # y and z are optional

There are 3 different ways to use this method:

Passing in 1 argument (x)
Passing in 2 arguments(x,y)
Passing in 3 arguments(x,y,z)

One Argument

If you pass in 1 argument, it will have to be a dictionary. In the dictionary, the keys must be characters and the values must be characters or strings.

The value should be what we want to replace the key with.

    example = 'abcde'
    dict1 = {'a':'1', 'b':'2'}
    table = string.maketrans(dict1)

Here, maketrans() returns a translation table which is stored in table.

This table shows how characters should be replaced when we use the translate() method. E.g. 'a' should be replaced with '1' and 'b' should be replaced with '2'.

Two Arguments

If you pass in 2 arguments, then both arguments should be strings of the same length (same number of characters).

    example = 'howdy boys and girls of the class'
    str1 = 'abcde' #length of 5
    str2 = '12345' #also length of 5
    table = string.maketrans(str1, str2)

The first string should contain the old characters to replace, and the second string should contain the new characters that will replace the old ones. Again, maketrans() returns a translation table which is stored in table.

This table shows how characters should be replaced when we use the translate() method, based on the two strings passed in. E.g. 'a' should be replaced with '1' and 'b' should be replaced with '2'.

Three Arguments

If you pass in 3 arguments, then the first 2 arguments should be strings of the same length where the characters in the second string should replace the characters in the first string (the same as if we only passed in 2 arguments).

However with 3 arguments, the third argument must be a string of all the characters you want to remove.

    example = '((howdy boy$s& and girls of the class%'
    str1 = 'abcde' #length of 5
    str2 = '12345' #also length of 5
    str3 = '($&%'
    table = string.maketrans(str1, str2, str3)

Once again, maketrans() returns a translation table.

This table shows how characters should be replaced and which characters should be removed when we use the translate() method, based on the three strings passed in. E.g. 'a' should be replaced with '1' and 'b' should be replaced with '2' etc. Also, all the characters in '($&%' should be removed.

translate()

After we have created the translation table, we can pass it into the translate() method that is being called by the string we want to translate.

The translate() method will return a copy of the string with each character translated according to the translation table.

    example = '((howdy boy$s& and girls of the class%'
    str1 = 'abcde' #length of 5
    str2 = '12345' #also length of 5
    str3 = '($&%'
    table = string.maketrans(str1, str2, str3)

    newstr = example.translate(table)
    print(newstr)

    #output: 'how4y 2oys 1n4 girls of th5 3l1ss'

In this example we translated the example string by replacing and deleting characters based on the translation table.

maketrans()

String module

string.punctuation

import nltk
import string

clean_text = clean_text.lower()

table = str.maketrans('', '', string.punctuation)

clean_text = text.lower()

table = str.maketrans('', '', string.punctuation)
clean_text = clean_text.translate(table)

words = clean_text.split()

3. Stop Words:

stop words

words

import nltk
import string
from nltk.corpus import stopwords

words = clean_text.split()

clean_words = []

freq = nltk.FreqDist(words)

stopwords.words('english')

words = clean_text.split()

clean_words = []

for token in words:
    if token not in stopwords.words('english'):

freq = nltk.FreqDist(words)

words = clean_text.split()

clean_words = []

for token in words:
    if token not in stopwords.words('english'):
        clean_words.append(token)

freq = nltk.FreqDist(words)

nltk.FreqDist()

words = clean_text.split()

clean_words = []

for token in words:
    if token not in stopwords.words('english'):
        clean_words.append(token)

freq = nltk.FreqDist(clean_words)

Nintendo

clean_words = list(filter(lambda word: word not in stopwords.words('english'), words))

NLTK Tokenization

split()

sentence tokenizer

word tokenizer

#freq = nltk.FreqDist(clean_words)

#freq.plot(20, title = 'Frequency Distribution')

#for token in words:
#    if token not in stopwords.words('english'):
#        clean_words.append(token)

#freq = nltk.FreqDist(clean_words)

sent_tokenize

word_tokenize

nltk.tokenize

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

sent_tokenize()

#freq.plot(20, title = 'Frequency Distribution')

sentences = sent_tokenize(text)

print(sentences)

sentences = sent_tokenize(text)

print(sentences[3])

sentences = sent_tokenize(text)

print(sentences[6])

sentences = sent_tokenize(text)

sentence_words = sentences[6].split()
first_sentence = ' '.join(sentence_words[-18:])

sentence_words

first_sentence

sentence_words = sentences[6].split()
first_sentence = ' '.join(sentence_words[-18:])

print(first_sentence)

'Co.'

Ltd.

word_tokenize

sentence_words = sentences[2].split()
first_sentence = ' '.join(sentence_words[-18:])

nltk_words = word_tokenize(text)

print(nltk_words)

nltk_words = word_tokenize(text)

french_text = "Bonjour M. Adam, comment allez-vous? J'espère que tout va bien. Aujourd'hui est un bon jour."
french_sentences = sent_tokenize(french_text,"french")

print(french_sentences)

WordNet

WordNet

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import wordnet

wordnet.synsets()

french_sentences = sent_tokenize(french_text,"french")

syn = wordnet.synsets('boy')

'boy'

syn = wordnet.synsets('boy')

print(len(syn))
print(syn)

'boy'

We can get the name of a Synset using the name() method.

We can get a definition of the word using the definition() method.

We can also get a list of examples of the word using the examples() method.

syn = wordnet.synsets('boy')

print('Name: ', syn[0].name())
print('\nDefinition: ', syn[0].definition())
print('\nExamples: ', syn[0].examples())

syn = wordnet.synsets('boy')

print('Name: ', syn[1].name())
print('\nDefinition: ', syn[1].definition())
print('\nExamples: ', syn[1].examples())

homonyms

lemma()

'computer'

synonyms = []

for syn in wordnet.synsets('computer'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())

print(synonyms)

Lemmas()

lemma

'computer'

lemma.name()

antonyms()

antonyms = []

for syn in wordnet.synsets('small'):
    for lemma in syn.lemmas():
        if lemma.antonyms():
            antonyms.append(lemma.antonyms()[0].name())

print(antonyms)

'small'

Stemming

Stemming

stem

comput

Porter stemming algorithm

from nltk.corpus import wordnet
from nltk.stem import PorterStemmer

PorterStemmer

stem()

stemmer = PorterStemmer()

print(stemmer.stem('eating'))

'eating'

'eat'

Snowball stemming algorithm

from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer

languages

print(SnowballStemmer.languages)

french_stemmer = SnowballStemmer('french')

print(french_stemmer.stem("manger"))

'manger'

'mang'

Lemmatization

lemmas

lemmatization

Lemmatization

lemma

WordNetLemmatizer

nltk.stem

from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer

WordNetLemmatizer

lemmatize

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print('Stemming: ', stemmer.stem('increases'))
print('Lemmatizing: ', lemmatizer.lemmatize('increases'))

'increas'

'es'

'increase'

'playing'

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize('playing'))

part of speech (POS)

Nouns - n
Verbs - v
Adjectives - a
Adverbs - r

pos

'v'

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize('playing', pos='v'))

Additional Data Sources

Project Gutenberg has texts from the public domain of fiction and non-fiction.

Kaggle has datasets for text that can go through this type of cleanup too.

Friends TV Show Scripts

Seinfeld TV Show Scripts

Processing Webpage Content