Getting started with web scraping in Python
Extracting with the requests library
📖
The Anatomy of a Request: As a quick refresher, a request is made by a client (in this case, us) to a named host which is located on a server (in this case, the website). The URL maps out the request to the correct resource. For more information, check out MDN's explanation.
import requests
response = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
# checking the status code of the response
response.status_code
> 200
# checking the content-type headers
response.headers['content-type']
> 'text/html; charset=UTF-8'
Parsing with the BeautifulSoup Library
💡
What's in an HTML: HTML documents are composed of text and tags, marked up so that they make semantic sense. In our analysis, we will focus on the tags that contain the information that we want to extract. Since knowing HTML is important to web scraping, it's useful to refer to MDN's guidebook.
from bs4 import BeautifulSoup
doc = BeautifulSoup(response.text)
# checking the page title
doc.title
<title>List of S&P 500 companies - Wikipedia</title>
# finding the right table
table = doc.find('table', {'id': 'constituents'})
# finding and creating the columns of our table
columns = [x.text.strip() for x in table.find_all('th')]
columns
> ['Symbol',
'Security',
'SEC filings',
'GICS Sector',
'GICS Sub-Industry',
'Headquarters Location',
'Date first added',
'CIK',
'Founded']
# finding and creating the rows of our table
rows = table.find_all('tr')
row_values = []
for row in rows[1:]:
row_value = [x.text.strip() for x in row.find_all('td')]
row_values.append(row_value)
# putting it together into a dataframe
import pandas as pd
df = pd.DataFrame(row_values, columns=columns)
df
Simplifying with pandas
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
tables[0]
Browser automation through Selenium
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.google.com')
To The Stars - Selenium in Action
Go to a website
Run a query
Extract the result details and compile them
✅
Be kind, don't rush! It is important to be mindful of the strain our requests are putting on the targeted servers. Nobody wants to accidentally bring down a website because they sent too many requests at once. Be sure to include sensible wait times between page loads!
import time
driver = webdriver.Chrome()
driver.get('https://website.com/login')
time.sleep(5)
username_field = driver.find_element_by_id('login-form-username')
password_field = driver.find_element_by_id('login-form-password')
login_button = driver.find_element_by_id('login')
username_field.send_keys('vtasca')
password_field.send_keys('PASSWORD')
login_button.click()
driver.get('https://website.com/objects')
time.sleep(5)
search_bar = driver.find_element_by_id('advanced-search')
query = 'order by created desc'
search_bar.send_keys(query)
from selenium.webdriver.common.keys import Keys
driver.find_element_by_id('advanced-search').send_keys(Keys.ENTER)
time.sleep(5)
issue_view = driver.find_element_by_class_name('issue-list')
issue_list = issue_view.find_elements_by_tag_name('li')
issues = []
for issue in issue_list[:5]:
issue.click()
time.sleep(5)
issue_project = driver.find_element_by_id('project-name-val').text
issue_key = driver.find_element_by_id('key-val').text
issue_title = driver.find_element_by_id('summary-val').text
issues.append({
'Project': issue_project,
'Key': issue_key,
'Title': issue_title
})
pd.DataFrame(issues)
♻️
Keep in mind that many applications will also offer their data in API form through a dedicated endpoint. This is almost always preferable to scraping their website as it ensures both that we respect their server and that we get the data we need.