Data Collection and Treatment

This lesson will teach you how to collect and clean data to prepare it for machine learning.

What is Data Treatment?

    As you know, machine learning requires the use of lots of data. When we first collect data, we call it raw data. Raw data can have missing values, inconsistent formatting, malformed records and other problems, so we need to make the it clean data before we can use it.

    The process of converting raw data to clean data is called dataset treatment (also known as dataset cleaning or data pre-processing).

    There are many different techniques for dataset treatment, but we will look at the most common steps.

Step 1: Collecting Data

    The most logical first step is to actually collect the raw data. What data you choose to use will depend on the problem you're trying to solve.

    Selecting good data is important because it will reduce the time it takes to clean the data if it is already in good shape.

    There are many good resources online for collecting public and free datasets on almost any topic you can thing of.

    Here are some good websites for finding datasets:
    1. Kaggle - The world's largest data science community and hosts many useful resources, including a large and detailed dataset repository.
    2. Google Dataset Search - This doesn't have its own datasets but it is a great resource for finding datasets from other websites.
    3. UCI Machine Learning Repository - This is another website that maintains over 490 datasets.

Step 2: Data Profiling

    Once you have collected the data, you need to check its condition by looking for patterns, outliers and exceptions, as well as incorrect, inconsistent, missing or skewed information.

    The process of checking and identifying these condititions is called data profiling.

    This is necessary because machine learning is only as good as the data that is provided. If the data provided to an machine learning model is bad, then the results will also be bad.

    Python and Pandas provide many functions to profile the data which we will look at later.

Step 3: Formatting Data

    There are many different ways of formatting data, and each dataset may have a different format. Therefore it is important to make sure the format of the data is consistent.

    E.g. data for currency may be formatted as $ or USD. State names may be spelled out in one dataset and abbreviated in others.

    If the data is not standardized in the same format, there will be errors in our programs.

Step 4: Feature Engineering

    The dataset you choose may have all the information you need but not exactly how you need it.

    For example, a dataset may have a column for dates but it may be more useful for your algorithm if it had data for days of the week.

    Therefore you can use the data for dates and transform it to create another column for days of the year.

    This transforming of raw data to create more useful information is called feature engineering (or feature extraction).

    Usually this involves deconstructing data into multiple parts to give a more specific relationship.

    E.g. Imagine you have a dataset with a start and end date of hotel bookings. However, you may not be able to use non-numerical data in your machine learning model.

    Instead you can create a new feature for stay duration which would be the difference between the start date and end date of the bookings.

Step 5: Splitting Data

    The final step is to split the data into two sets:

    Training Set: This set of data is used to train your machine learning algorithm.

    Testing Set: This set of data is used to test your algorithm and evaluate the outcome.

    Make sure the subsets do not overlap. A good rule of thumb is to split the data so 80% goes to the training set and 20% goes to the testing set.

Activity: Explore Data

    Now that you understand the process of collecting data how to treat datasets, it's time to explore.

    Go to any of the recommended websites and have a look at the different datasets available.

    When looking at the datasets, try to think about what you could use it for and how machine learning could be applied with the data.
    If you can't find any that interest you on the websites, do your own research to search for fascinating datasets.

    The Internet is a huge database of all the data you could possibly want so you are sure to find something that excites you. Whether it's about video games, art, food or anything.