Introduction
I have worked on and maintained a good number of web-scrapers in the past. I have also written a few articles on web-scraping. However, I have never written a step by step guide on how to build a web scraper.
This post will aim to serve a starter guide for myself and anyone for a simple web scraping project. Though its not a complete guide, it will serve as a good starting point for anyone looking to build a web scraper.We will use a combination of technologies to build our web scraper. We will be using Python, Postgres, SQLAlchemy and Docker.
Audience and Objectives
This article at beginner developers, hobbyists and DIY folks who are looking to build a web scraper. Intermediate developers can also use this article as a refresher on how to build a web scraper.
This article aims to serve as a step by step guide on how to build a web scraper using Python. We will be using the Lifetime Leaderboards | UMG Gaming website as our target website. We will be scraping the data from the website and storing it in a database.
Prerequisites to Follow Along
To follow along with this article, you will need to have the following installed on your machine:
- Python3 and Pip (I am currently using 3.10) Any version above 3.8 shoul work.
- Git installed and configured on your machine. Instructions vary by Operating System.
- Docker installed and running. Docker Documentation
- Terminal or Command Line Interface (CLI) installed on your machine.
Database - Ensure Postgres is installed locally and running on port 5432
OR
run a Postgres container on Docker Easy PostgreSQL 10 and pgAdmin 4 Setup with Docker container on port 5432
Optional but recommended
- Visual Studio Code or any other IDE of your choice.
- Github account
Initial Setup Instructions
These instructions will work for most Unix (Linux, Mac) and Windows implementations. Kindly refer to the documentation for your Operating System for more information.
- Create a directory for your project and change into it. I will be using
simple-web-scraper
as my project name.
mkdir simple-web-scraper && cd simple-web-scraper
- Create a virtual environment for your project. I will be using
venv
as my virtual environment name.
python3 -m venv venv
- Activate the virtual environment.
source venv/bin/activate
- Install the required dependencies.
python3 -m pip install requests beautifulsoup4 psycopg2-binary SQLAlchemy python-dotenv pytest Faker factory-boy
- Save the dependencies to a requirements.txt file.
python3 -m pip freeze > requirements.txt
- Create a .gitignore file and add the following lines to it.
# .gitignore
venv
__pycache__
.vscode
.env
- Create a README.md file and add the following lines to it.
# simple-web-scraper
Simple web scraper for [Lifetime Leaderboards | UMG Gaming](https://www.umggaming.com/leaderboards)
## Installation
1.Clone this Repo
- Run the following command to initialize a git repository.
git init
- Add the files to the staging area.
git add .
- Commit the files to the repository.
git commit -m "Initial Commit"
Hopefully you have followed the instructions above and have a working project directory. If you have any issues, please feel free to reach out to me on Twitter or Github.
Reconnaissance Phase
The first step in any web scraping project is to understand the target website. As mentioned above we'll be using Lifetime Leaderboards | UMG Gaming website as our target website. We will be scraping the data from the website and storing it in a database.
The choice is mostly motivated by the fact that the website is simple and has a good amount of data to scrape. and as a bonus, it has a leaderboards for the most popular gamers sorted by xp and earnings.
Understanding the Target Website
Below is a screenshot of the xp and earnings leaderboards.
We can see that the xp leaderboard has a table with the following columns:
- Place
- Username
- Trophies
- Social
- Xp
The earnings leaderboard is similar with the only difference being the last column which is earnings.
This is a good starting point for our web scraper. We now have a good idea of the data as well as the structure of the data. We can now move on to the next step.
Understanding the Structure of the Target Website
The next step is to understand the structure of the target website. We will be using the Chrome Developer Tools to inspect the target website. Press F12
on your keyboard to open the developer tools.
Once the developer tools are open, navigate to the xp leaderboard. Right click on the table and select Inspect
. This will open the element inspector. Here we can see the HTML structure of the table and identify the elements we need to scrape.
A Game of Codes
Now Recon is complete and we have a good idea of the data we need to scrape. We can now move on to the next step. We will be using Python to scrape the data from the website. We will be using the requests library to make HTTP requests to the website. We will be using the BeautifulSoup library to parse the HTML response from the website.
Making HTTP Requests
Inside the simple-web-scraper
directory, create a new file called xp_scrape.py
. Open the file and add the following lines of code.
# xp_scrape.py
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get("https://www.umggaming.com/leaderboards")
# load data into bs4
soup = BeautifulSoup(data.text, "html.parser")
leaderboard = soup.find("table", {"id": "leaderboard-table"})
tbody = leaderboard.find("tbody")
We have imported the requests
library and used it to make a GET
request to the target website. We have also imported the BeautifulSoup
library. We will be using this library to parse the HTML response from the website.
We have also used the BeautifulSoup
library to find the table with the id leaderboard-table
. We have also found the tbody
element inside the table. We will be using this element to find the rows in the table.
Parsing the HTML Response
We will now use the BeautifulSoup
library to parse the HTML response from the website. We will be using the find_all
method to find all the tr
elements inside the tbody
element. We will then loop through the tr
elements and print the text inside each td
element.
# xp_scrape.py
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get("https://www.umggaming.com/leaderboards")
# load data into bs4
soup = BeautifulSoup(data.text, "html.parser")
leaderboard = soup.find("table", {"id": "leaderboard-table"})
tbody = leaderboard.find("tbody")
for tr in tbody.find_all("tr"):
place = tr.find_all("td")[0].text.strip()
username = tr.find_all("td")[1].find_all("a")[1].text.strip()
xp = tr.find_all("td")[4].text.strip()
# print('position','username','xp', sep='\t')
print("Place", "Username", "XP", sep=" ")
print(place, username, xp, sep=" ")
We have used the find_all
method to find all the tr
elements inside the tbody
element. We have then looped through the tr
elements and printed the text inside each td
element.
We have also used the strip
method to remove the whitespace from the text. We have also used the sep
argument to separate the columns with a tab.
Getting Earnings Leaderboard
The process for getting the earnings leaderboard is similar to the process for getting the xp leaderboard. We will be using the find_all
method to find all the tr
elements inside the tbody
element. We will then loop through the tr
elements and print the text inside each td
element.
# earnings_scrape.py
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get("https://umggaming.com/leaderboards/earnings")
# load data into bs4
soup = BeautifulSoup(data.text, "html.parser")
leaderboard = soup.find("table", {"id": "leaderboard-table"})
tbody = leaderboard.find("tbody")
for tr in tbody.find_all("tr"):
place = tr.find_all("td")[0].text.strip()
username = tr.find_all("td")[1].find_all("a")[1].text.strip()
earnings = tr.find_all("td")[4].text.strip()[1:]
# print('position','username','earnings', sep='\t')
print("Place", "Username", "Earnings", sep=" ")
print(place, username, earnings, sep="
We have used the find_all
method to find all the tr
elements inside the tbody
element. We have then looped through the tr
elements and printed the text inside each td
element.
We have also used the strip
method to remove the whitespace from the text. We have also used the sep
argument to separate the columns with spaces.
Saving the Data to a Postgres Database
We will now save the data to Postgres DB, this will give us persistence across sessions as well allow perform analysis on the data. To keep things consistent we'll use an Object Relational Mapper (ORM) SQLAlchemy to create the schema and psycopg2 to connect to the Postgres DB.
Creating the Schema
Inorder to keep all our files organized we'll create a new directory called db
inside our root directory and create a new file called base_sql.py
inside the db
directory. We'll create a Base
class that will be used to create the schema for our database. We'll also create a Session
class that will be used to create a session to the database. Following best security practice we'll also use environment variables to store our database credentials.
# base_sql.py
import os
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from dotenv import load_dotenv
load_dotenv()
db = os.getenv(
"DB", "postgresql+psycopg2://test:testpassword@localhost:5432/xp_db"
)
engine = create_engine(db)
engine.connect()
print(engine)
Session = sessionmaker(bind=engine)
Base = declarative_base()
The Base
class is used to create the schema for our database. We have used the declarative_base
method to create the Base
class. We have also used the create_engine
method to create an engine that will be used to connect to the database. We have also used the sessionmaker
method to create a session that will be used to create a session to the database.
The load_dotenv
method is used to load the environment variables from the .env
file. We have used the getenv
method to get the database credentials from the environment variables. If none is found an alternative provided for testing purposes. We have also used the connect
method to connect to the database.
Creating the Models
We'll create two files inside the db
directory. One file will be used to create the XP
model and the other file will be used to create the Earnings
model. We'll use the Base
class to create the schema for the database. We'll also use the Column
class to create the columns for the database. We'll also use the Integer
class to create the id
column. We'll also use the String
class to create the username
column. We'll also use the Float
class to create the xp
and earnings
columns.
# player_sql.py
from sqlalchemy import Column, String, Integer, Date
from datetime import date as dt
from base_sql import Base
class Player(Base):
__tablename__ = "players"
id = Column(Integer, primary_key=True)
username = Column(String(80))
place = Column(String())
xp = Column(Integer())
date = Column(Date)
def __init__(self, username, place, xp):
self.place = place
self.username = username
self.xp = xp
self.date = dt.today()
def __repr__(self):
return f"{self.username} {self.place} {self.xp} {self.date}"
def __str__(self):
return f"{self.username} {self.place} {self.xp} {self.date}"
if __name__ == "__main__":
print("Player class")
The code above is used to create the Player
model.
We have used the __tablename__
attribute to set the name of the table. We have also used the Column
class to create the columns for the database. We have also used the Integer
class to create the id
column. We have also used the String
class to create the username
column. We have also used the Float
class to create the xp
and earnings
columns.
We have also used the __init__
method to initialize the Player
class. We have also used the __repr__
method to return a string representation of the Player
class. We have also used the __str__
method to return a string representation of the Player
class.
We have also used the if __name__ == "__main__":
statement to check if the file is being run directly. If it is being run directly we will print a message.
# top_earning_players_sql.py
from sqlalchemy import Column, String, Date, Integer, Float
from datetime import date as dt
from base_sql import Base
class Earning_Player(Base):
__tablename__ = "paid_players"
id = Column(Integer, primary_key=True)
username = Column(String(80))
place = Column(String())
earnings = Column(Float())
date = Column(Date)
def __init__(self, username, place, earnings):
self.place = place
self.username = username
self.earnings = earnings
self.date = dt.today()
def __repr__(self):
return f"{self.username} {self.place} {self.earnings} {self.date}"
def __str__(self):
return f"{self.username} {self.place} {self.earnings} {self.date}"
if __name__ == "__main__":
print("Earning_Player class")
The code above is used to create the Earning_Player
model.
It is almost the same as the Player
model. The only difference is that we have used the Float
class to create the earnings
column.
The Date Column
is used to store the date the data was scraped. By default the Date
column will store the date in the format YYYY-MM-DD
. We have used the dt.today()
method to get the current date.
Creating the Database
We'll create a new file called postgres_inserts.py
inside the db
directory. We'll use the create_all
method to create the database. We'll also use the Session
class to create a session to the database. We'll also use the add
method to add the data to the database. We'll also use the commit
method to commit the changes to the database.
# postgres_inserts.py
from base_sql import Session, engine, Base
from player_sql import Player
from top_earning_players_sql import Earning_Player
# 1 generate database schema
Base.metadata.create_all(engine, Base.metadata.tables.values(), checkfirst=True)
# 2 Create a new session
session = Session()
session.commit()
session.close()
# create players
print(len(session.query(Player).all()))
print(len(session.query(Earning_Player).all()))
if __name__ == "__main__":
print("Creating tables")
The code above is used to create the database. We have used the create_all
method to create the database. We also query the database to check if the data has been added to the database.
Data Finally
Back to our scrapers we'll add the code to insert the data into the database. We'll add the code to insert the data into the database in the xp_scraper.py
file. We'll also add the code to insert the data into the database in the top_earning_players_scraper.py
file.
# xp_scraper.py
import requests
from bs4 import BeautifulSoup
from db.base_sql import Session, engine, Base
from db.player_sql import Player
# 1 generate database schema
Base.metadata.create_all(engine)
# 2 Create a new session
session = Session()
# get the data
data = requests.get("https://www.umggaming.com/leaderboards")
# load data into bs4
soup = BeautifulSoup(data.text, "html.parser")
leaderboard = soup.find("table", {"id": "leaderboard-table"})
tbody = leaderboard.find("tbody")
for tr in tbody.find_all("tr"):
place = tr.find_all("td")[0].text.strip()
username = tr.find_all("td")[1].find_all("a")[1].text.strip()
xp = tr.find_all("td")[4].text.strip()
# print('position','username','xp', sep='\t')
print("Place", "Username", "XP", sep=" ")
print(place, username, xp, sep=" ")
# leaderboard-table > tbody:nth-child(2) > tr:nth-child(1) > td:nth-child(5)
# create players
player = Player(username=username, place=place, xp=xp)
# Check if player exists
players = session.query(Player).all()
try:
pl = session.query(Player).filter(Player.username == username).first()
if pl:
session.delete(pl)
session.commit()
except Exception as e:
session.rollback()
print(e)
else:
session.add(player)
session.commit()
session.close()
if __name__ == "__main__":
print("Scraping data")
The code above is the completed code including importing the Player
model and the Session
class. We have also used the Session
class to create a session to the database. We have also used the add
method to add the data to the database. We have also used the commit
method to commit the changes to the database.
We first query the database for any existing data, if the data exists we delete it and add the new data. We also use the try
and except
statement to handle any errors. Finally we commit the changes to the database and close the session.
Lets do the same for the top_earning_players_scraper.py
file.
# earning_scrape.py
import requests
from bs4 import BeautifulSoup
from db.base_sql import Session, engine, Base
from db.top_earning_players_sql import Earning_Player
# 1 generate database schema
Base.metadata.create_all(engine)
# 2 Create a new session
session = Session()
# get the data
data = requests.get("https://umggaming.com/leaderboards/earnings")
# load data into bs4
soup = BeautifulSoup(data.text, "html.parser")
leaderboard = soup.find("table", {"id": "leaderboard-table"})
tbody = leaderboard.find("tbody")
for tr in tbody.find_all("tr"):
place = tr.find_all("td")[0].text.strip()
username = tr.find_all("td")[1].find_all("a")[1].text.strip()
earnings = tr.find_all("td")[4].text.strip()[1:]
# print('position','username','earnings', sep='\t')
print("Place", "Username", "Earnings", sep=" ")
print(place, username, earnings, sep=" ")
# create players
player = Earning_Player(username=username, place=place, earnings=float(earnings))
players = session.query(Earning_Player).all()
try:
if session.query(Earning_Player).filter(Earning_Player.id > 0).all():
# Check if player exists
pl = session.query(Earning_Player).filter(Earning_Player.username == username).first()
if pl:
# print(pl.username, pl.place, pl.earnings)
session.delete(pl)
session.commit()
except Exception as e:
session.rollback()
print(e)
else:
session.add(player)
session.commit()
session.close()
if __name__ == "__main__":
print("Scraping data")
The code above is the completed code including importing the Earning_Player
model and the Session
class. We have also used the Session
class to create a session to the database. We have also used the add
method to add the data to the database. We have also used the commit
method to commit the changes to the database.
3, 2, 1 Scrape
We have everything in place to run our scrapers and save the data accordingly, lets walkthrough that process below:
Database Setup and Insertion
Create a new database called xp_db
in postgresql. Then create a .env
file in the root of the project and add the following code to the .env
file.
DB=postgresql+psycopg2://test:testpassword@localhost:5432/xp_db
The code above is an example of the .env
file. Replace the test
and testpassword
with your postgresql username and password.
Then run the following commands in the terminal.
cd db
python3 base_sql.py
python3 player_scraper.py
python3 top_earning_players_scraper.py
python postgres_inserts.py
Running the Scrapers
Run the scrapers by entering the following commands:
python3 xp_scrape.py
python3 earnings_scrape.py
Test Everything
Inorder to have this project be as well-rounded as possible we'll code a series of tests to test the code using pytest. Create a new directory called tests
and create a new file called test_db.py
in the tests
directory. Inside create a test_players.py
file.
# test_players.py
import pytest
from db.player_sql import Player
from db.top_earning_players_sql import Earning_Player
from db.base_sql import Session, engine, Base
from db.player_factory_basic import PlayerFactory
@pytest.fixture
def player_func():
return PlayerFactory.build()
@pytest.fixture
def player_func2():
return PlayerFactory.build()
def my_func_to_delete_Player(session, Player_id):
session.query(Player).filter(Player.id == Player_id).delete()
def my_func_to_delete_Player2(session, Player_id):
session.query(Earning_Player).filter(Earning_Player.id == Player_id).delete()
def test_player(session, player_func):
session.add(player_func)
assert session.query(Player)
my_func_to_delete_Player(session, Player.id)
result = session.query(Player).one_or_none()
assert result is None
def test_player2(session, player_func2):
session.add(player_func2)
assert session.query(Earning_Player)
my_func_to_delete_Player2(session, Earning_Player.id)
result = session.query(Earning_Player).one_or_none()
assert result is None
The code above is the completed code for the test_players.py
file. We have used the pytest
library to create a fixture
to create a Player
object. We have also used the pytest
library to create a fixture
to create a Earning_Player
object. We have also used the pytest
library to create a fixture
to create a Session
object. We have also used the pytest
library to create a fixture
to create a engine
object. We have also used the pytest
library to create a fixture
to create a Base
object.
We have also used the pytest
library to create a fixture
to create a PlayerFactory
object. We have also used the pytest
library to create a fixture
to create a Earning_PlayerFactory
object.
Run the Tests
To run the tests, open the terminal and run the following command in the root of the project.
pytest .
The output should be similar to the following image:
Completed Code on GitHub
Alternatively you can check/clone the completed code from: GitHub
Conclusion
In this tutorial, we have learned how to scrape data from a website using the requests
library and BeautifulSoup
library. We have also learned how to store the scraped data in a database using the SQLAlchemy
library. We have also learned how to test the code using the pytest
library. We have also learned how to use the factory_boy
library to create test data.
I hope you liked this write-up and get inspired to extend it further. Keep coding! Feel free to leave comments below or reach out on Twitter: Ken_Mwaura1 or LinkedIn.