Day 3 AM Session Plan
This session introduces students to natural language processing with web content
Session Introduction
TEACHER LED
What Are We Doing?
Getting Data from Websites
Identifying HTML Elements
Tokenizing and Processing Text Content
Web Scraping
SELF-PACED
- The students can perform addition cleaning actions on the dataset so that it can be properly used for data analysis.
Students will learn:
What web scraping is
Looking at website code
Downloading web pages using code
Extracting web page information through tags
Extracting web page information with classes and IDs
Extracting web page information from tables
Processing Webpage Content
SELF-PACED
Students will learn:
Setting up the project
Tokenizing text with python
Counting word frequency using frequency distribution functions
Using NLTK tokenization to tokenize sentences and words
Using the WordNet database to find synonyms and homonyms
Performing stemming to clean the words in the dataset
Performing lemmatization to clean the words in the dataset