Working with Files

This lesson will teach you how to read from and write to files in python.

Reading Text Files



    Do you know what happens when you open a file, make a change, and save that file? You've undoubtedly done this hundreds of times with your computer, but you've done it from the computer's GUI.

    Data science programs require you to access files from code, and often you will need to write data to files, all without ever clicking on buttons on the screen or pressing any keys.

    This lesson will show you how to read and write to files all from python code, so that you can perform your data analysis on the files from inside the python code.

    Create a new python file and save it as files.py.

    Then download this sample text file. Save it in the same folder as the python file.

    Before we can read a file, we first need to open it. We can do this with the open() function.

    sample_file = open("sample.txt")
                    

    The file path in the parentheses must be the exact location and name of the file.

    If it's in the same folder then we can just write the name of the file, otherwise, you have to write the full path.

    If you run the code and do not see an error then the file was successfully opened.

    Now let's try and print the contents of the file.

    sample_file = open("sample.txt")
    
    print(sample_file)
    
    # Output:
    # <_io.TextIOWrapper name='D:\\sample.txt' mode='r' encoding='cp1252'>
                    

    The output shows the sample_file variable is a wrapper to the sample.txt file and opens the file in read-only mode, which means you can't make any changes to the file. It doesn't actually show the text inside of the file, but instead shows you the object information.

    If you give the wrong file path or file name then you will likely get the following error:

    sample_file505 = open("sample505.txt")
    
    print(sample_file505)
    
    # Output:
    # FileNotFoundError: [Errno 2] No such file or directory: 'sample505.txt'
                    

    Whenever you get Errno 2, either the file doesn't exist or you gave the wrong file path to the open() function.

    Now to actually read the contents of a file and not the wrapper, we use the read() function.

    sample_file = open("sample.txt")
    
    print(sample_file.read())
    
    # Output:
    # Welcome to Natural Language Processing
    # It is one of the most exciting research areas as of today
    # We will see how Python can be used to work with text files.
                    

    If you try to read the file again then nothing will be printed to the shell.

    This is because when you call read(), the cursor is moved to the end of the text. In reading files, the cursor represents the last place that the python program read the file to. Therefore there is no more text to print if you call read() again.

    The code works like this because there are many times in data science where you want to read all the current values of a file, and then wait until more values are added and read the file again. Rather than starting from the beginning and reading all the text, since your program is running you just want to get the latest added information from the file.

    To be able to read a file again and again, we can call the seek() function and pass 0 as the argument.

    This will move the cursor back to the start of the text file.

    sample_file = open("sample.txt")
    
    print(sample_file.read())
    
    sample_file.seek(0)
    
    print(sample_file.read())
                    

    In the shell, you will see that the contents of the text file print twice.

    Once you are done working with a file, it is important to close the file so that other applications can access the file.

    We can do this by using the close() method.

    sample_file = open("sample.txt")
    
    print(sample_file.read())
    
    sample_file.seek(0)
    
    print(sample_file.read())
    
    sample_file.close()
                    

    If you forget to close the file after you're finished reading the file, other programs may not be able to access its information. In addition, open files take up memory in your computer. It may not seem like a big deal with a 3-line text file, but when you have a 100 MB data file, memory starts to become important.

Line by Line



    Instead of reading all contents of the file at once, we can also read the file contents line by line.

    We can do this using the readlines() function, which returns each line in the text file as a list item.

    The end of a line may end with \n. This is the newline character, which just tells a computer to start a new line.

    sample_file = open("sample.txt")
    
    print(sample_file.readlines())
    
    # Output:
    # ['Welcome to Natural Language Processing\n',
    # 'It is one of the most exciting research areas as of today\n',
    # 'We will see how Python can be used to work with text files.']
                      

    This can make the text easier to use, especially when it comes to natural language processing.

    For example we can use a for loop to print the first word in each line.

    We don't need to use readlines() here because python knows to split files into lines.

    Essentially, a text file is just a list of lines and a line is just a list of words.

    sample_file = open("sample.txt")
    
    for lines in sample_file:
      print(lines.split()[0])
    
    # Output:
    # Welcome
    # It
    # We
                      

    First we use a for loop to loop through each line in the file.

    Then we use the split() method to split each line into a list of words.

    Finally, [0] takes the first element (index 0) of each list. This prints the first word of each line.

Writing to Text Files



    To write to a text file, we just open a file with the mode set to w or w+.

    w opens a file in write mode and w+ opens the a file in both read and write mode.

    Using these modes will also create a file if it doesn't already exist.

    If you use these modes to open a file that already has text, all the existing content will be removed.

    sample_file = open("sample.txt", 'w+')
    
    print(sample_file.read())
    
    sample_file.write("This file has been rewritten.")
    
    sample_file.seek(0)
    
    print(sample_file.read())
    
    sample_file.close()
                      

    Running the code above will open the file and print the contents. This will be blank because we used w+ mode to open it.

    We then write to the file using the write() function.

    We then use seek(0) to return to the beginning, the read the file again to see the new text.

    Usually we don't want to replace a file but instead add onto it. We can do this using a+ mode which means append and read.

    sample_file = open("sample.txt", 'a+')
    
    print(sample_file.read())
    
    sample_file.write("\nThis file has been appended to.")
    
    sample_file.seek(0)
    
    print(sample_file.read())
    
    sample_file.close()
                      

    This code will first print the contents of the file.

    We then append more content to the file. \n tells the computer that this should be on a new line.

    We then seek() to the start of the file and read it again to see the original contents and the appended contents.

    Remember to close the file afterwards so other programs can use it.

    We can actually tell python to automatically close a file using the with keyword.

    with open("sample.txt") as sample_file:
      print(sample_file.read())
                  

    Everything indented under with will be executed, then the file will automatically be closed.

Practice



    Now that you know how to open, read, and write to files, find the dataset that you found on the Dataset Analysis lesson. Download the dataset file to your computer.

    Note: You can read and write to any file type with python, but .csv files and .txt files will be easier to read than .json files.

    Try to write a program to open the dataset and do some basic analysis. For example, you can count how many lines there are, or write our own data into the file.

    You can also try to count how often certain words appear, or break each line down into parts based on its columns.

    When attempting to read from a file, you may run into a UnicodeDecodeError on your console. The error occurs because the read() method does not recognize all of the characters in the file.

    An encoding represents characters of human language (e.g., 'a', '!', '7') in digital form (e.g., binary). Not all encodings cover every character encountered in a file. So, a UnicodeDecodeError lets us know when the default encoding used in the open() method cannot translate the entire file to human language from machine language.

    You can specify the encoding of the input file when you first open it with the following code: data_file = open("filename.csv", "r", encoding = "utf-8")

    Often, setting the encoding to UTF-8 allows the program to read the file successfully. However, other popular file encodings include Latin-1 and ASCII. You may need to try different encoding arguments to find the correct one for your file.