Natural Language Processing

This lesson will introduce you to human-computer interaction with natural language processing.

What is NLP?

    Natural Language Processing (or NLP) is a field of artificial intelligence in which computers analyze, understand, and derive meaning from human language in a smart and useful way.

    NLP can be utilized for understanding what people are writing about, translating text from one language to another, recognizing spoken words and turning them into text, and much more.

    Now let's take a look at some of the most common NLP concepts.


    Tokenization is when you divide a sentence into chunks of words. Usually it is the first task performed in natural language processing.

    For example, take the sentence My name is David. Does this sentence contain a name? Does it contain a country? You might have a separate list of names that you can check each word against to see if it is a name. How many nouns does the sentence have? Does it have any verbs? If you want to understand a sentence, you need to break it down into its words.

    Tokenization can be performed at two levels: word-level and sentence-level.

    Word-level Tokenization

    Word-level tokenization returns a set of words from a sentence.

    E.g. tokenizing the sentence My name is David returns the following set of words:

    words = ['My', 'name', 'is', 'David']

    From this breakdown, it would be easy for the computer to look through the list to find David and know that the sentence contains a name.

    Sentence-level Tokenization

    At the sentence level, tokenization returns a set of sentences from a document.

    E.g. tokenizing the document with the text: My name is David. I am 21 years old. I live in the UK. returns the following set of sentences:

    S1 = My name is David.
    S2 = I am 21 years old.
    S3 = I live in the UK.

    By breaking down these text into sentences, you can make decisions about each sentence individually. For example, sentence 1 is 4 words long, while sentences 2 and 3 are both 5 words long. Breakdown down text into sentences also then allows you to break down each sentence into its various words. By breaking down this text into sentences and each sentence into words, you can tell that sentence 1 contains a name and sentence 3 contains a country.

Stop Word Removal

    Stop words are words that do not provide any useful information for your data analysis.

    E.g. if you are developing an emotion detection software, words such as is, am and the do not give information related to emotions.

    If you have a sentence that says I am feeling happy today, I and am do not give emotion-related information. However, I may be important in identifying who is the subject of the sentence.

    There is no universal list of stop words to remove, it just depends on your application.

    In NLP, every word needs processing. Removing stop words saves processing time when you have a lot of words to process. It also prevents you from incorrectly training your machine learning algorithm.

    For example, when you are writing the emotion detection software referenced above, and try to train a machine learning algorithm on a set of text, it might mistakenly identify sentences that use the word the as having a particular emotion, even though you know it doesn't impact the sentence.

    In addition, it will take up the machine learning algorithm's processing power to try to interpret which emotion the is supposed to represent. Before you start performing long-running programs on text data, it is usually a good idea to find the top 50 or 100 words in the document, and see if any of them would be good to remove from your analysis and algorithm training.


    Stemming is the process of removing suffixes from words in order to normalize them and reduce them.

    E.g. Stemming the words computational, computed, and computing would all result in comput since this is the non-changing part of the word.

    Why would you want to do this? This enables you to summarize your data and put words into common groups. This can be helpful for machine learning algorithms. If you wanted a machine learning algorithm to detect whether a person was taking about computers, the algorithm will be heavily reinforced by comput, since that is a common stemmed word across different texts talking about computing.

    If you don't stem the word, the algorithm will believe that computational, computed, and computing all represent completely different things, even though you as a person know they are very similar.

    When thinking about which words to stem, you want to look at a word list that is the complete opposite of your stop word list. Instead of finding the most frequently used words, find words that only appear once or twice inside of your text. If stemming those words won't change how you think about the meaning of the text for your analysis, stemming them will improve the computer's ability to interpret the text.


    Lemmatization is similar to stemming but it takes context into account while stemming the words.

    It is more complex as it needs to look up and fetch the exact word from a dictionary to get its meaning.

    E.g. for the word worse, lemmatization returns bad as the context of the word is taken into account. Therefore it knows that worse is an adjective and is the second form of the word bad.

    However, stemming will return the word worse as it is.

    Both stemming and lemmatization are useful for finding the semantic similarity between different pieces of texts.

    There are a huge number of words in the English language. If your machine learning algorithm needed to know over 100,000 different words and try to summarize texts, it would be very difficult to process and draw conclusions.

    If you instead simplify the English language down into basic words, your program will be able to process information much faster. To see a basic list of words, you can take a look at the Wikipedia Basic Words List.

Parts of Speech

    Each word in a sentence has a specific role. E.g. boy is a noun and eat is a verb. These are the parts of speech (or POS).

    An important NLP task is assigning parts of speech tags to the words.

    POS tagging helps to construct grammatically correct sentences and identify contexts.

    A POS tagger labels words with their corresponding parts of speech.

    For instance, laptop, mouse, and keyboard are tagged as nouns. eating and playing are verbs while good and bad are tagged as adjectives.

    This is a hard task to perform because many words have multiple meanings and so could be different parts of speech. This depends on context.

Named Entity Recognition

    Named entity recognition refers to the process of classifying entities into predefined categories such as person, location, organization, vehicle etc.

    For instance in the sentence Mark Zuckerberg is the CEO of Facebook. A typical named entity recognizer will return following information about this sentence:

    Mark Zuckerberg -> Person
    CEO -> Position
    Facebook -> Organization

    Named entity recognition is important for topic modeling where a program can automatically detect the topic of a document based on the entities within.

NLP Applications

    NLP is an exciting area for machine learning because it is still so challenging to write programs that can accurately interpret text. Reading comes naturally to humans after years of practice, but we ask computers to interpret text in a couple of minutes and come back with answers.

    Text Classification

    If you have an email account, you might occasionally get emails called Spam. These emails may try to sell you something you don't want or promise you money if you submit payment to an address.

    Below is an example of part of a spam email, from a Kaggle Spam Email Dataset.

    The MAJOR PLAYERS are on This ONE
    For ONCE be where the PlayerS are
    This is YOUR Private Invitation

    Leverage $1,000 into $50,000 Over and Over Again

    Okay, this message is full of weird capitalized text and claims of letting you make lots of money. Natural language processing can be used to look through hundreds of thousands of emails like this so that they can be categorized as spam before you ever have to read them.

    By classifying text with computers, you free up people to read what they want to read, and ignore emails not worth paying attention to.

    Text Generation

    Text generation takes in source content, such as a book or a television or movie script, and then generates text based on the words present in that text.

    For example, a text generation algorithm generated a script for a fake episode of the TV show The X-Files by reading text from existing scripts.

    Cannot Load Image

    While this script isn't going to win any major writing awards, the text follows a basic logic, including both character dialogue and scene descriptions.

    Text generation tends to work best at smaller scales, since computers have a hard time thinking about the broader themes in writing.