Machine Learning

Setup

DataCleaning

cleaning.py

Datasets

New

Folder...

BL-Flickr-Images-Book.csv - A CSV file containing information about books from the British Library.

university_towns.txt - A text file containing names of college towns in every US state.

olympics.csv - A CSV file summarizing the participation of all countries in Summer and Winter Olympics.

import pandas as pd
import numpy as np

Dropping Columns

Corporate Author, Engraver, Issuance Type, etc.

Title, Author, Publisher, etc.

import pandas as pd
import numpy as np

df = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')

/

df.head(n)

n

df = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')
print(df.head())

import numpy as np

pd.options.display.max_columns = None

df = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')
print(df.head())

Edition Statement, Corporate Author, Corporate Contributors, Former Owner, Engraver, Issuance Type and Shelfmarks

df = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')

to_drop = ['Edition Statement',
           'Corporate Author',
           'Corporate Contributors',
           'Former owner',
           'Engraver',
           'Contributors',
           'Issuance type',
           'Shelfmarks']

drop()

inplace

True

axis

1

df = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')

to_drop = ['Edition Statement',
       'Corporate Author',
       'Corporate Contributors',
       'Former owner',
       'Engraver',
       'Contributors',
       'Issuance type',
       'Shelfmarks']

df.drop(to_drop, inplace=True, axis=1)

print(df.head())

columns

df.drop(columns=to_drop, inplace=True)

Changing the Index

Identifier

print(df['Identifier'].is_unique)

# Output: True

set_index()

df = df.set_index('Identifier')

Instead of reassigning the variable, we can make the function directly modify the object by setting the inplace parameter.

df.set_index('Identifier', inplace=True)

df = df.set_index('Identifier')

print(df.head())

.loc[]

label-based indexing

df = df.set_index('Identifier')

print(df.loc[206])

iloc[0]

position-based indexing

df = df.set_index('Identifier')

print(df.iloc[0])

Tidying Fields

Date of Publication

Place of Publication

object dtype

print(df.dtypes.value_counts())

# Output:
# object     6

date of publication


print(df['Date of Publication'].head(25))

Date of Publication

1. Remove the extra dates in square brackets e.g. 1879 [1878].

2. Convert data ranges to their start date eg. 1839, 38-54; 1860-63.

3. Completely remove dates we are not cerain about and replace them with NumPy's NaN (Not a Number) e.g. [1897?].

4. Convert the string NaN to Numpy's NaN value.

regular expression


regex = r'^(\d{4})'

see how it works here.

The \d represents any digit.

The {4} repeats this rule four times i.e. four digits.

The caret symbol ( ^ ) matches the start of a string.

The parentheses () signify a capturing group, which tells Pandas that we want to extract that part of the regex.

All of this is to make sure we get the first 4 digits of a string and ignore cases where [ starts off the string.

You can learn more about regular expressions here.

str.extract()

regex = r'^(\d{4})'

extr = df['Date of Publication'].str.extract(regex, expand=False)
print(extr.head(25))

extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
print(extr.head(25))

object

pd.to_numeric()

extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

df['Date of Publication'] = pd.to_numeric(extr)
print(df['Date of Publication'].dtype)

# Output:
# float64

extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

df['Date of Publication'] = pd.to_numeric(extr)

print(df['Date of Publication'].isnull().sum() / len(df))

# Output:
# 0.11717147339205986

.isnull()

.sum()

len(df)

As years went by, more books were published,

Fixing Strings

df['Date of Publication'].str

.str

.split(), .replace(), and .capitalize()

Place of Publication

.str

np.where

.str

np.where

Place of Publication

print(df['Place of Publication'].head(10))

print(df.loc[4157862])
print(df.loc[4159587])

str.contains()

boolean mask

pub = df['Place of Publication']

london = pub.str.contains('London')

print(london[:5])

pub

London

Oxford

pub = df['Place of Publication']

london = pub.str.contains('London')

oxford = pub.str.contains('Oxford')

np.where()

np.where

np.where(condition, then, else)

condition - an array-like object or a boolean mask.

then - the value to be used if the condition evaluates to True.

else - the value to be used if the condition evaluates to False.

We can also nest the np.where() functions in order to check multiple conditions.

np.where(condition1, x1,
        np.where(condition2, x2,
            np.where(condition3, x3, ...)))

oxford = pub.str.contains('Oxford')

df['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                               pub.str.replace('-', ' ')))

london

London

London.

Oxford.

str.replace('-', ' ')

df['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                               pub.str.replace('-', ' ')))

print(df['Place of Publication'].head())

df['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                               pub.str.replace('-', ' ')))

print(df.head())

Cleaning All At Once

.applymap()

university_towns.txt

state names

university towns

[edit]

university_towns = []

# Open the text file and read each line
with open('Datasets/university_towns.txt') as file:
    for line in file:
        if '[edit]' in line:
            # Store this state until another one is found
            state = line
        else:
            # Otherwise, we have a city so create a tuple with the stored state and this city.
            # Then store the tuple in the university_towns list
            university_towns.append((state, line))

university_towns = []

# Open the text file and read each line
with open('Datasets/university_towns.txt') as file:
    for line in file:
        if '[edit]' in line:
            # Store this state until another one is found
            state = line
        else:
            # Otherwise, we have a city so create a tuple with the stored state and this city.
            # Then store the tuple in the university_towns list
            university_towns.append((state, line))

for x in range(5):
    print(university_towns[x])

State

RegionName

towns_df = pd.DataFrame(university_towns, columns=['State', 'RegionName'])

print(towns_df.head())

applymap()

towns_df = pd.DataFrame(university_towns, columns=['State', 'RegionName'])

def get_citystate(item):
    if ' (' in item:
        return item[:item.find(' (')]
    elif '[' in item:
        return item[:item.find('[')]
    else:
        return item

get_citystate

' ('

'['

applymap()

def get_citystate(item):
    if ' (' in item:
        return item[:item.find(' (')]
    elif '[' in item:
        return item[:item.find('[')]
    else:
        return item

towns_df = towns_df.applymap(get_citystate)

print(towns_df.head())

Renaming Columns and Skipping Rows

olympics.csv

olympics_df = pd.read_csv('Datasets/olympics.csv')

print(olympics_df.head())

The column headers are string forms of numbers starting at 0.

The row which should be the header is instead the first row in the dataset records.

You can look at the source data here and you will notice that the row which should be the header has bad values.

For example NaN should really be Country, ? Summer should be Summer Games and 01 ! should be Gold etc.

1. Skipping one row and setting the header as the first row.

2. Rename the columns

header

olympics.csv

olympics_df = pd.read_csv('Datasets/olympics.csv', header=1)

print(olympics_df.head())

NaN

Unnamed: 0

rename()

olympics_df = pd.read_csv('Datasets/olympics.csv', header=1)

new_names =  {'Unnamed: 0': 'Country',
             '? Summer': 'Summer Olympics',
             '01 !': 'Gold',
             '02 !': 'Silver',
             '03 !': 'Bronze',
             '? Winter': 'Winter Olympics',
             '01 !.1': 'Gold.1',
             '02 !.1': 'Silver.1',
             '03 !.1': 'Bronze.1',
             '? Games': '# Games',
             '01 !.2': 'Gold.2',
             '02 !.2': 'Silver.2',
             '03 !.2': 'Bronze.2'}

new_names

rename()

olympics_df.rename(columns=new_names, inplace=True)

print(olympics_df.head())

inplace

True

Activity: Data Cleaning

What do you want to learn from the dataset?
What columns will you need to include? What columns can you remove?
Do all the fields have valid data? Do any rows with incomplete data need to be removed?

Dataset Cleaning

Setup

Dropping Columns

Changing the Index

Alternative Method

Tidying Fields

Regex Explained

Fixing Strings

np.where()

Cleaning All At Once

Renaming Columns and Skipping Rows

Activity: Data Cleaning