Getting Started with Web Scraping

The internet is a gold mine of information, but how can we start extracting some of that data?

Fortunately, there are a number of tools within the Python ecosystem that we can leverage to achieve our data collection goals. In this blog post, you will learn about the different methods that can be used to extract data from online sources, a little bit about how the internet works, the layout of HTML documents and the power of automation in browsers. Once you've read through this article, you will be armed with the basic knowledge and tools to go out there and set up any data collection pipelines that you can think of!

This post assumes some basic knowledge about the Python programming language, as well as an understanding of file formats and a fundamental command of web technology. I make reference and link to resources to learn more about specific topics as I go along.

Extracting with the requests library

The easiest way to get started with data extraction from the web is to use the requests library. This simple package allows the user to make HTTP requests to any resource on the internet and to do so in a Pythonic way.

While its power lies in its simplicity, it is also complete with many useful features such as support for query strings, cookies and much more. Have a look at the documentation for a more detailed explanation, or check the course resources.

*The Anatomy of a Request: *As a quick refresher, a request is made by a client (in this case, us) to a named host which is located on a server (in this case, the website). The URL maps out the request to the correct resource. For more information, check out MDN's explanation.

With that being said, let's have a look at what it looks like in action. As an example, let's say we're interested in scraping the list of companies that constitute the S&P 500 stock market index. Wikipedia maintains an up-to-date list on a dedicated page.

First, we're going to import the library and make a request with no additional parameters to the URL of interest:

import requests
response = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

Once that code has finished running, our response variable will contain a bunch of useful information about the result of the request. For example, we would first be interested in whether the request was successful. The way to check this would be to look at the status code of the response. These are three-number codes that are probably familiar to most internet users. Examples include 404 for resource not found, 500 for a server error and 200 for a successful request.

# checking the status code of the response
response.status_code
> 200

Our request was successful! We might also be interested in the type of content that is returned. While request headers are a whole topic unto themselves that won't be treated here, let's see what the server responded with in this case:

# checking the content-type headers
response.headers['content-type']
> 'text/html; charset=UTF-8'

Looks like we got back the HTML of the page, encoded in the UTF-8 character set. This is as standard as it gets for a normal web page. While it's useful to know the type of content that we received, what we're really interested in is *what *the content is.

For this, we're going to require a parser that can make sense of the raw code that we received.

Parsing with the BeautifulSoup Library

As we saw above, the content that is contained in a request response comes in a raw textual form that is not easily amenable to analysis. To make sense of this data, we will use BeautifulSoup - a parsing library built for pulling data out of HTML and XML files. As their documentation states, it commonly saves programmers hours or days of work.

*What's in an HTML: *HTML documents are composed of text and tags, marked up so that they make semantic sense. In our analysis, we will focus on the tags that contain the information that we want to extract. Since knowing HTML is important to web scraping, it's useful to refer to MDN's guidebook.

Continuing our analysis, let's import the library and create a BeautifulSoup object from the response HTML.

from bs4 import BeautifulSoup
doc = BeautifulSoup(response.text)

The doc variable now contains all the relevant information about the web page in a much more accessible form. For instance, we can confirm the title of the web page simply by checking its attribute:

# checking the page title
doc.title

<title>List of S&P 500 companies - Wikipedia</title>

Moving on to the table, we might have observed that there are actually two tables on the page. Here it's worth looking into tag attributes, specifically id and class as these are most often used to find elements of interest. Since we're only interested in the index constituents (the first table) and not their changes (the second table), we will query the document for the table with the right identifier and assign it to a separate variable:

# finding the right table
table = doc.find('table', {'id': 'constituents'})

Having found the table element, let's isolate the columns that are going to form the header of our table. These correspond to the <th> tag:

# finding and creating the columns of our table
columns = [x.text.strip() for x in table.find_all('th')]
columns
> ['Symbol',
 'Security',
 'SEC filings',
 'GICS Sector',
 'GICS Sub-Industry',
 'Headquarters Location',
 'Date first added',
 'CIK',
 'Founded']

These correspond to the columns we can see on Wikipedia as well. The final parsing step is to pull the values that will be used to populate the rows of the table. These are represented by the <tr> tags. Since the first row corresponds to the headers, we'll skip it:

# finding and creating the rows of our table
rows = table.find_all('tr')

row_values = []
for row in rows[1:]:
    row_value = [x.text.strip() for x in row.find_all('td')]
    row_values.append(row_value)

Now that we have both the column names and the row values, it's time to put them together. Since pandas is the lingua franca of data analysis in Python, let's create a dataframe using information that we've just scraped:

# putting it together into a dataframe
import pandas as pd
df = pd.DataFrame(row_values, columns=columns)
df

And that gives us a nice and tidy dataframe that can be used for any number of analysis and data processing tasks that we may have.

In this section, we've learned to traverse the elements of an HTML document to find what values we need and then stitched them together to create a usable table. Armed with this understanding, let's also consider a more direct alternative using pandas.

Simplifying with pandas

Understanding how data is structed on the internet is an integral part of the web scraping workflow. Sending a request, receiving back a raw HTML file, parsing and analyzing that file and finally extracting and wrangling the data into a usable shape constitutes the essential loop of scraping.

As tooling has evolved around this workflow, new and easier ways of extracting data have been developed. Let's have a look at using the data analysis library pandas to directly obtain the data we're interested in.

Returning to our previous example of the constituents of the S&P 500 index, this task can easily be achieved by using the [pd.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) function provided by the library. I encourage you to read the linked documentation to familiarize yourself with the function and its parameters.

Putting it into practice, it would look like this:

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

Much shorter, right? Since the function returns a list of all tables it found on the page, we simply need to find the one we're interested in, and the job's done. The upside here is that we receive the table of changes as well at no extra effort.

tables[0]

This section has shown that tools exist and continue to be developed that help speed the workflow of data collection from the internet.

In the next post in this series, we’ll look at getting data from websites that aren't as easily accessible and scrapable as a simple Wikipedia table.