Getting started with web scraping in Python

The internet is a gold mine of information, but how can we start extracting some of that data?

Fortunately, there are a number of tools within the Python ecosystem that we can leverage to achieve our data collection goals. In this blog post, you will learn about the different methods that can be used to extract data from online sources, a little bit about how the internet works, the layout of HTML documents and the power of automation in browsers. Once you've read through this article, you will be armed with the basic knowledge and tools to go out there and set up any data collection pipelines that you can think of!

This post assumes some basic knowledge about the Python programming language, as well as an understanding of file formats and a fundamental command of web technology. I make reference and link to resources to learn more about specific topics as I go along.

Extracting with the requests library

The easiest way to get started with data extraction from the web is to use the requests library. This simple package allows the user to make HTTP requests to any resource on the internet and to do so in a Pythonic way.

While its power lies in its simplicity, it is also complete with many useful features such as support for query strings, cookies and much more. Have a look at the documentation for a more detailed explanation, or check the course resources.

📖

The Anatomy of a Request: As a quick refresher, a request is made by a client (in this case, us) to a named host which is located on a server (in this case, the website). The URL maps out the request to the correct resource. For more information, check out MDN's explanation.

With that being said, let's have a look at what it looks like in action. As an example, let's say we're interested in scraping the list of companies that constitute the S&P 500 stock market index. Wikipedia maintains an up-to-date list on a dedicated page.

First, we're going to import the library and make a request with no additional parameters to the URL of interest:

import requests
response = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

Once that code has finished running, our response variable will contain a bunch of useful information about the result of the request. For example, we would first be interested in whether the request was successful. The way to check this would be to look at the status code of the response. These are three-number codes that are probably familiar to most internet users. Examples include 404 for resource not found, 500 for a server error and 200 for a successful request.

# checking the status code of the response
response.status_code
> 200

Our request was successful! We might also be interested in the type of content that is returned. While request headers are a whole topic unto themselves that won't be treated here, let's see what the server responded with in this case:

# checking the content-type headers
response.headers['content-type']
> 'text/html; charset=UTF-8'

Looks like we got back the HTML of the page, encoded in the UTF-8 character set. This is as standard as it gets for a normal web page. While it's useful to know the type of content that we received, what we're really interested in is *what *the content is.

For this, we're going to require a parser that can make sense of the raw code that we received.

Parsing with the BeautifulSoup Library

As we saw above, the content that is contained in a request response comes in a raw textual form that is not easily amenable to analysis. To make sense of this data, we will use BeautifulSoup - a parsing library built for pulling data out of HTML and XML files. As their documentation states, it commonly saves programmers hours or days of work.

💡

What's in an HTML: HTML documents are composed of text and tags, marked up so that they make semantic sense. In our analysis, we will focus on the tags that contain the information that we want to extract. Since knowing HTML is important to web scraping, it's useful to refer to MDN's guidebook.

Continuing our analysis, let's import the library and create a BeautifulSoup object from the response HTML.

from bs4 import BeautifulSoup
doc = BeautifulSoup(response.text)

The doc variable now contains all the relevant information about the web page in a much more accessible form. For instance, we can confirm the title of the web page simply by checking its attribute:

# checking the page title
doc.title

<title>List of S&P 500 companies - Wikipedia</title>

Moving on to the table, we might have observed that there are actually two tables on the page. Here it's worth looking into tag attributes, specifically id and class as these are most often used to find elements of interest. Since we're only interested in the index constituents (the first table) and not their changes (the second table), we will query the document for the table with the right identifier and assign it to a separate variable:

# finding the right table
table = doc.find('table', {'id': 'constituents'})

Having found the table element, let's isolate the columns that are going to form the header of our table. These correspond to the <th> tag:

# finding and creating the columns of our table
columns = [x.text.strip() for x in table.find_all('th')]
columns
> ['Symbol',
 'Security',
 'SEC filings',
 'GICS Sector',
 'GICS Sub-Industry',
 'Headquarters Location',
 'Date first added',
 'CIK',
 'Founded']

These correspond to the columns we can see on Wikipedia as well. The final parsing step is to pull the values that will be used to populate the rows of the table. These are represented by the <tr> tags. Since the first row corresponds to the headers, we'll skip it:

# finding and creating the rows of our table
rows = table.find_all('tr')

row_values = []
for row in rows[1:]:
    row_value = [x.text.strip() for x in row.find_all('td')]
    row_values.append(row_value)

Now that we have both the column names and the row values, it's time to put them together. Since pandas is the lingua franca of data analysis in Python, let's create a dataframe using information that we've just scraped:

# putting it together into a dataframe
import pandas as pd
df = pd.DataFrame(row_values, columns=columns)
df

And that gives us a nice and tidy dataframe that can be used for any number of analysis and data processing tasks that we may have.

In this section, we've learned to traverse the elements of an HTML document to find what values we need and then stitched them together to create a usable table. Armed with this understanding, let's also consider a more direct alternative using pandas.

Simplifying with pandas

Understanding how data is structed on the internet is an integral part of the web scraping workflow. Sending a request, receiving back a raw HTML file, parsing and analyzing that file and finally extracting and wrangling the data into a usable shape constitutes the essential loop of scraping.

As tooling has evolved around this workflow, new and easier ways of extracting data have been developed. Let's have a look at using the data analysis library pandas to directly obtain the data we're interested in.

Returning to our previous example of the constituents of the S&P 500 index, this task can easily be achieved by using the [pd.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) function provided by the library. I encourage you to read the linked documentation to familiarize yourself with the function and its parameters.

Putting it into practice, it would look like this:

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

Much shorter, right? Since the function returns a list of all tables it found on the page, we simply need to find the one we're interested in, and the job's done. The upside here is that we receive the table of changes as well at no extra effort.

tables[0]

This section has shown that tools exist and continue to be developed that help speed the workflow of data collection from the internet.

Having a basic command of how to scrape static data from websites, it's now time to move on to a more difficult task - extracting data from dynamic websites.

What exactly are "dynamic" websites? In this case, I will use the term to refer to any website that requires some kind of interaction before displaying the data that needs to be collected. This can be a login form, pagination, a widget that needs to be clicked, etc.

In this guide, we'll go over how to approach these websites while learning about the powerful suite of tools offered by Selenium.

Browser automation through Selenium

We've all used a browser before - it's the most common way of accessing web pages thanks to its graphical interface (as opposed to the requests from the previous guide). The difference between requests and a browser is that the latter enables interactivity. Using a browser, we can click around, scroll the page, zoom in or out and many more actions that are impossible using only requests.

When it comes to using browsers programmatically, the most powerful tool in our arsenal is the browser automation suite Selenium. In a nutshell, this tool gives us the full power of a web browser, combined with the customization and extensibility of using Python to control it.

At the core of Selenium lies the webdriver, or the browser. The driver is an instance of a browser (e.g. Chrome) that is used to navigate and interact with pages. However, Selenium doesn't ship with "batteries included", meaning that we have to supply our own webdriver. This is good since it gives us the flexibility to choose whatever browser we want. In this case we'll stick to Chrome.

The easiest way to implement this is using the webdriver-manager Python library, which abstracts away many of the inconvenient details of working directly with browser executables. In a single line of code, we're able to instantiate a new browser and have it ready for action:

from selenium import webdriver

driver = webdriver.Chrome()

Running this code should have opened a blank new browser window on your machine. If so, congratulations! This browser can now be used to perform any actions that we're interested in. For example, we can visit Google by running the following:

driver.get('https://www.google.com')

While visiting simple web pages can be interesting, the real power lies in the ability to chain actions into repeatable sequences and deploying these systematically.

In the next task, I'll quickly demonstrate some of Selenium's features and what makes it such an asset in our toolkit. Before moving on, make sure to call driver.quit() to close the browser instance, otherwise processes may be left hanging in the background.

To The Stars - Selenium in Action

Having managed to get a simple instance of a Chrome browser up and running, it's time to see what can be done with it. Technically, the list of possibilities is almost endless, but for our purposes we will focus on the targeted scraping of information.

Let's say our workflow is the following:

Go to a website
Run a query
Extract the result details and compile them

✅

Be kind, don't rush! It is important to be mindful of the strain our requests are putting on the targeted servers. Nobody wants to accidentally bring down a website because they sent too many requests at once. Be sure to include sensible wait times between page loads!

How would that look like in practice? Let's see - first we initialize a new Chrome driver and point to the website’s login page:

import time
driver = webdriver.Chrome()
driver.get('https://website.com/login')
time.sleep(5)

Throughout this script, we'll be using the time.sleep() function from the time library that ships with Python to pace our requests. This basically tells the script to wait for a number of seconds. While functional for this example, for more involved work I strongly recommend you look into Selenium's suite of waits.

Now we will have to input our username and password. First we identify the elements on the page that we will have to interact with. Then, this can be done by way of the send_keys() function that comes with Selenium. Note that you should never share your credentials in a public repository.

username_field = driver.find_element_by_id('login-form-username')
password_field = driver.find_element_by_id('login-form-password')
login_button = driver.find_element_by_id('login')

username_field.send_keys('vtasca')
password_field.send_keys('PASSWORD')
login_button.click()

Now we should be logged onto the platform, congrats! According to the workflow we specified above, we now need to get to the query screen. Once there, we'll use Selenium to find the query field and send it our pre-built query before signaling an Enter keypress.

driver.get('https://website.com/objects')
time.sleep(5)

search_bar = driver.find_element_by_id('advanced-search')
query = 'order by created desc'

search_bar.send_keys(query)

from selenium.webdriver.common.keys import Keys

driver.find_element_by_id('advanced-search').send_keys(Keys.ENTER)
time.sleep(5)

That takes care of performing the query. The browser will now display the results of the query, allowing us to see the information we need. However, you will notice that in order to get detailed information about the issues, we actually need to click on each issue to get its details to load. In the next code block, we find the issues within the HTML and loop through the first 5 issues, clicking on them and collecting information about what project they're in, their key-value identifier and their title.

issue_view = driver.find_element_by_class_name('issue-list')
issue_list = issue_view.find_elements_by_tag_name('li')

issues = []
for issue in issue_list[:5]:
    issue.click()
    time.sleep(5)
    issue_project = driver.find_element_by_id('project-name-val').text
    issue_key = driver.find_element_by_id('key-val').text
    issue_title = driver.find_element_by_id('summary-val').text
    issues.append({
        'Project': issue_project,
        'Key': issue_key,
        'Title': issue_title
    })

Finally, we make sure to quit the driver instance and then stitch these into a dataframe so that we can use them for analysis.

pd.DataFrame(issues)

And that's it! You can see that automating a browser to perform scraping requires a bit more involvement compared to simply requesting the data from a static source, but this is also what makes such a versatile tool, a kind of swiss army knife of scraping.

♻️

Keep in mind that many applications will also offer their data in API form through a dedicated endpoint. This is almost always preferable to scraping their website as it ensures both that we respect their server and that we get the data we need.

Having started from requests, learning a bit about the layout of HTML elements and the intricacies of query them to developing more robust solutions with browser automation, you are now better equipped to delve into the open-ended world of data collection on the internet. I encourage you to check out the resources linked throughout these blog posts, and don't be afraid to experiment.