Machine Learning

What is Web Scraping?

web scraping

Requests - a HTTP library that allows your program to access a website and its data. It is very easy to use while having a lot of capabilities.

BeautifulSoup 4 - a parsing library that can use different parsers. A parser is just a program that can extract data from HTML and XML documents (webpages). It basically allows us to get the specific parts of a webpage that we want, such as table data or links.

Understanding Websites

HTML

H

t

M

L

Tags

tag name

less-than symbol

greater-than symbols

<tag-name>

</tag-name>

<title>My Webpage</title>

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>My Webpage</title>
  </head>
  <body>

    <h1>Hello</h1>

    <p>This is a webpage written in HTML</p>

    <a href="http://google.com">A link to Google</a>

  </body>
</html>

.html

1. The <!DOCTYPE html> tag tells specifies that this is a HTML document.

2. The <body> tags contain the visible part of the HTML document

3. The <title> tags added the text My Webpage to the webpage tab.

4. The <h1> tags created a header with the text Hello.

5. The <p> tags created a paragraph with the text This is a webpage written in HTML.

6. The <a> tags created a link to google.com with the text A link to Google. The href attribute in the <a> tag contains the url that the link should take you to.

learn more about HTML here.

Downloading Webpages

Wikipedia page about cat breeds.

WebScraping

cats.py

Requests

BeautifulSoup 4

Pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd

requests.get()

Response

page

import requests
from bs4 import BeautifulSoup
import pandas as pd

cats_wiki = 'https://en.wikipedia.org/wiki/List_of_cat_breeds'
page = requests.get(cats_wiki)

status_code

cats_wiki = 'https://en.wikipedia.org/wiki/List_of_cat_breeds'
page = requests.get(cats_wiki)

print(page.status_code)

print(page)

Tools

Preferences

IPython console

Source code

Buffer

Project -> Recent Projects

text

cats_wiki = 'https://en.wikipedia.org/wiki/List_of_cat_breeds'
page = requests.get(cats_wiki)

print(page.text)

Clear console

CTRL + L

cats_wiki = 'https://en.wikipedia.org/wiki/List_of_cat_breeds'
page = requests.get(cats_wiki)

soup = BeautifulSoup(page.text, 'html.parser')

html.parser

soup.prettify()

soup = BeautifulSoup(page.text, 'html.parser')

print(soup.prettify())

Finding Tags

find()

<a> tag

link

soup = BeautifulSoup(page.text, 'html.parser')

link = soup.find('a')
print(link)

find_all()

<a> tags

links

len()

soup = BeautifulSoup(page.text, 'html.parser')

links = soup.find_all('a')
print(len(links))

soup = BeautifulSoup(page.text, 'html.parser')

links = soup.find_all('a')
print(links[33])

Abyssinian

/wiki/Abyssinian_cat

get()

soup = BeautifulSoup(page.text, 'html.parser')

links = soup.find_all('a')
print(links[33].get('href'))

Classes and IDs

classes

IDs

CSS (cascading style sheets.)

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>My Webpage</title>
  </head>
  <body>

    <h1 class="big-title" id="main-title">Hello</h1>

    <p class="information">This is a webpage written in HTML</p>
    <p class="information">Our paragraphs have classes</p>

    <h1 class="big-title" id="second-title">Links</h1>

    <a href="http://google.com" class="extra-info">A link to Google</a>
    <a href="http://youtube.com" class="extra-info">A link to YouTube</a>

  </body>
</html>

information

big-title

main-title

second-title

class_

id

info = soup.find_all(class_='information')
main_title = soup.find_all(id='main-title')

'information'

'main-title'

info = soup.find_all('p', class_='information')
main_title = soup.find_all('h1', id='main-title')

wikitable

soup = BeautifulSoup(page.text, 'html.parser')

cat_table = soup.find('table', class_='wikitable')
print(cat_table)

F12

FN

F12

elements tab

Inspect

Inspect element

Extracting from tables

soup = BeautifulSoup(page.text, 'html.parser')

cat_table = soup.find('table', class_='wikitable')

breed = []
country = []
origin = []
body_type = []
coat_length = []
pattern = []
images = []

cat_table.find('tbody')

.find_all('tr')

breed_info

breed = []
country = []
origin = []
body_type = []
coat_length = []
pattern = []
images = []

for row in cat_table.find('tbody').find_all('tr'):
    breed_info = row.find_all('td')

breed_name

for row in cat_table.find('tbody').find_all('tr'):
    breed_info = row.find_all('td')
    breed_name = row.find('th')

for row in cat_table.find('tbody').find_all('tr'):
    breed_info = row.find_all('td')
    breed_name = row.find('th')

    if len(breed_info) == 6:
        breed.append(breed_name.find(text = True))
        country.append(breed_info[0].find(text = True))
        origin.append(breed_info[1].find(text = True))
        body_type.append(breed_info[2].find(text = True))
        coat_length.append(breed_info[3].find(text = True))
        pattern.append(breed_info[4].find(text = True))

<img> tag

source attribute

No Image

for row in cat_table.find('tbody').find_all('tr'):
    breed_info = row.find_all('td')
    breed_name = row.find('th')

    if len(breed_info) == 6:
        breed.append(breed_name.find(text = True))
        country.append(breed_info[0].find(text = True))
        origin.append(breed_info[1].find(text = True))
        body_type.append(breed_info[2].find(text = True))
        coat_length.append(breed_info[3].find(text = True))
        pattern.append(breed_info[4].find(text = True))

        if breed_info[5].find('img'):
            images.append(breed_info[5].find('img').get('src'))
        else:
            images.append('No Image')

find(text = True)

for row in cat_table.find('tbody').find_all('tr'):
    breed_info = row.find_all('td')
    breed_name = row.find('th')

    if len(breed_info) == 6:
        breed.append(breed_name.find(text = True))
        country.append(breed_info[0].find(text = True))
        origin.append(breed_info[1].find(text = True))
        body_type.append(breed_info[2].find(text = True))
        coat_length.append(breed_info[3].find(text = True))
        pattern.append(breed_info[4].find(text = True))

        if(breed_info[5].find('img' )):
            images.append(breed_info[5].find('img' ).get('src'))
        else:
            images.append('No Image')

print(breed)

print(len(breed))

cat_breed_df = pd.DataFrame(
    {'Breed': breed,
     'Country': country,
     'Origin': origin,
     'Body Type': body_type,
     'Coat Length': coat_length,
     'Pattern': pattern,
     'Images': images
    })

display.max_columns

None

cat_breed_df = pd.DataFrame(
  {'Breed': breed,
   'Country': country,
   'Origin': origin,
   'Body Type': body_type,
   'Coat Length': coat_length,
   'Pattern': pattern,
   'Images': images
  })

pd.set_option('display.max_columns', None)
print(cat_breed_df.head())

Body Type

\n

Breed

dataset cleaning guide

1. Try and remove the \n from the data in the Body Type column.

2. Make the Breed column the index of the dataset.

Activity: Challenges

Run analysis on the Dataset

Dog Breed Challenge

Encyclopedia Britannica Dog Breed List

Go Random!

Wikipedia Random Article

Get a few random pages using wikipedia's random page feature and save them into different variables
Get the paragraph content from those articles, and pick specific words that appear in multiple articles
Start with one article, then when you get to a word that is the same in another article, switch the content to display from the other article.
Now, you have a weird mashup of articles which might seem like they have nothing to do with each other...or do they?