Day 3 AM Session Plan

This session introduces students to natural language processing with web content

Presentation Slides

Session Introduction


    What Are We Doing?

    Getting Data from Websites
    Identifying HTML Elements
    Tokenizing and Processing Text Content

Web Scraping


    Students will learn:

    What web scraping is
    Looking at website code
    Downloading web pages using code
    Extracting web page information through tags
    Extracting web page information with classes and IDs
    Extracting web page information from tables
  • The students can perform addition cleaning actions on the dataset so that it can be properly used for data analysis.

Processing Webpage Content


    Students will learn:

    Setting up the project
    Tokenizing text with python
    Counting word frequency using frequency distribution functions
    Using NLTK tokenization to tokenize sentences and words
    Using the WordNet database to find synonyms and homonyms
    Performing stemming to clean the words in the dataset
    Performing lemmatization to clean the words in the dataset