Intro to Python

What is NLP?

Tokenization

Tokenization

My name is David

Word-level Tokenization

Word-level tokenization returns a set of words from a sentence.

E.g. tokenizing the sentence My name is David returns the following set of words:

words = ['My', 'name', 'is', 'David']

From this breakdown, it would be easy for the computer to look through the list to find David and know that the sentence contains a name.

Sentence-level Tokenization

At the sentence level, tokenization returns a set of sentences from a document.

E.g. tokenizing the document with the text: My name is David. I am 21 years old. I live in the UK. returns the following set of sentences:

S1 = My name is David.
S2 = I am 21 years old.
S3 = I live in the UK.

By breaking down these text into sentences, you can make decisions about each sentence individually. For example, sentence 1 is 4 words long, while sentences 2 and 3 are both 5 words long. Breakdown down text into sentences also then allows you to break down each sentence into its various words. By breaking down this text into sentences and each sentence into words, you can tell that sentence 1 contains a name and sentence 3 contains a country.

Stop Word Removal

Stop words

is

am

the

I am feeling happy today

I

am

I

the

Stemming

Stemming

computational

computed

computing

comput

computational

computed

computing

Lemmatization

Lemmatization

worse

bad

worse

bad

worse

Wikipedia Basic Words List

Parts of Speech

boy

eat

parts of speech

laptop

mouse

keyboard

eating

playing

good

bad

Named Entity Recognition

Named entity recognition

Mark Zuckerberg is the CEO of Facebook.

Mark Zuckerberg -> Person
CEO -> Position
Facebook -> Organization

NLP Applications

Text Classification

If you have an email account, you might occasionally get emails called Spam. These emails may try to sell you something you don't want or promise you money if you submit payment to an address.

Below is an example of part of a spam email, from a Kaggle Spam Email Dataset.

A POWERHOUSE GIFTING PROGRAM You Don't Want To Miss!

GET IN WITH THE FOUNDERS!
The MAJOR PLAYERS are on This ONE
For ONCE be where the PlayerS are
This is YOUR Private Invitation

EXPERTS ARE CALLING THIS THE FASTEST WAY
TO HUGE CASH FLOW EVER CONCEIVED
Leverage $1,000 into $50,000 Over and Over Again

Okay, this message is full of weird capitalized text and claims of letting you make lots of money. Natural language processing can be used to look through hundreds of thousands of emails like this so that they can be categorized as spam before you ever have to read them.

By classifying text with computers, you free up people to read what they want to read, and ignore emails not worth paying attention to.

Text Generation

Text generation takes in source content, such as a book or a television or movie script, and then generates text based on the words present in that text.

For example, a text generation algorithm generated a script for a fake episode of the TV show The X-Files by reading text from existing scripts.

Cannot Load Image

While this script isn't going to win any major writing awards, the text follows a basic logic, including both character dialogue and scene descriptions.

Text generation tends to work best at smaller scales, since computers have a hard time thinking about the broader themes in writing.

Natural Language Processing