Getting Started with a Web Scraping Project

Getting Started with a Web Scraping Project

·

15 min read

Introduction

I have worked on and maintained a good number of web-scrapers in the past. I have also written a few articles on web-scraping. However, I have never written a step by step guide on how to build a web scraper.

This post will aim to serve a starter guide for myself and anyone for a simple web scraping project. Though its not a complete guide, it will serve as a good starting point for anyone looking to build a web scraper.We will use a combination of technologies to build our web scraper. We will be using Python, Postgres, SQLAlchemy and Docker.

Audience and Objectives

This article at beginner developers, hobbyists and DIY folks who are looking to build a web scraper. Intermediate developers can also use this article as a refresher on how to build a web scraper.

This article aims to serve as a step by step guide on how to build a web scraper using Python. We will be using the Lifetime Leaderboards | UMG Gaming website as our target website. We will be scraping the data from the website and storing it in a database.

Prerequisites to Follow Along

To follow along with this article, you will need to have the following installed on your machine:

  • Python3 and Pip (I am currently using 3.10) Any version above 3.8 shoul work.
  • Git installed and configured on your machine. Instructions vary by Operating System.
  • Docker installed and running. Docker Documentation
  • Terminal or Command Line Interface (CLI) installed on your machine.
  • Database - Ensure Postgres is installed locally and running on port 5432

    OR

    run a Postgres container on Docker Easy PostgreSQL 10 and pgAdmin 4 Setup with Docker container on port 5432

Initial Setup Instructions

These instructions will work for most Unix (Linux, Mac) and Windows implementations. Kindly refer to the documentation for your Operating System for more information.

  • Create a directory for your project and change into it. I will be using simple-web-scraper as my project name.
mkdir simple-web-scraper && cd simple-web-scraper
  • Create a virtual environment for your project. I will be using venv as my virtual environment name.
python3 -m venv venv
  • Activate the virtual environment.
source venv/bin/activate
  • Install the required dependencies.
python3 -m pip install requests beautifulsoup4 psycopg2-binary SQLAlchemy python-dotenv pytest Faker factory-boy
  • Save the dependencies to a requirements.txt file.
python3 -m pip freeze > requirements.txt
  • Create a .gitignore file and add the following lines to it.
# .gitignore
venv
__pycache__
.vscode
.env
  • Create a README.md file and add the following lines to it.
# simple-web-scraper

Simple web scraper for [Lifetime Leaderboards | UMG Gaming](https://www.umggaming.com/leaderboards)

## Installation

1.Clone this Repo
  • Run the following command to initialize a git repository.
git init
  • Add the files to the staging area.
git add .
  • Commit the files to the repository.
git commit -m "Initial Commit"

Hopefully you have followed the instructions above and have a working project directory. If you have any issues, please feel free to reach out to me on Twitter or Github.

Reconnaissance Phase

Recon illustration

The first step in any web scraping project is to understand the target website. As mentioned above we'll be using Lifetime Leaderboards | UMG Gaming website as our target website. We will be scraping the data from the website and storing it in a database.

The choice is mostly motivated by the fact that the website is simple and has a good amount of data to scrape. and as a bonus, it has a leaderboards for the most popular gamers sorted by xp and earnings.

Understanding the Target Website

Below is a screenshot of the xp and earnings leaderboards.

Lifetime Leaderboards | UMG Gaming

Earning Leaderboards | UMG Gaming

We can see that the xp leaderboard has a table with the following columns:

  • Place
  • Username
  • Trophies
  • Social
  • Xp

The earnings leaderboard is similar with the only difference being the last column which is earnings.

This is a good starting point for our web scraper. We now have a good idea of the data as well as the structure of the data. We can now move on to the next step.

Understanding the Structure of the Target Website

The next step is to understand the structure of the target website. We will be using the Chrome Developer Tools to inspect the target website. Press F12 on your keyboard to open the developer tools.

Once the developer tools are open, navigate to the xp leaderboard. Right click on the table and select Inspect. This will open the element inspector. Here we can see the HTML structure of the table and identify the elements we need to scrape.

Devtools | UMG Gaming

A Game of Codes

Now Recon is complete and we have a good idea of the data we need to scrape. We can now move on to the next step. We will be using Python to scrape the data from the website. We will be using the requests library to make HTTP requests to the website. We will be using the BeautifulSoup library to parse the HTML response from the website.

Making HTTP Requests

Inside the simple-web-scraper directory, create a new file called xp_scrape.py. Open the file and add the following lines of code.

# xp_scrape.py
import requests
from bs4 import BeautifulSoup

# get the data
data = requests.get("https://www.umggaming.com/leaderboards")

# load data into bs4
soup = BeautifulSoup(data.text, "html.parser")

leaderboard = soup.find("table", {"id": "leaderboard-table"})
tbody = leaderboard.find("tbody")

We have imported the requests library and used it to make a GET request to the target website. We have also imported the BeautifulSoup library. We will be using this library to parse the HTML response from the website.

We have also used the BeautifulSoup library to find the table with the id leaderboard-table. We have also found the tbody element inside the table. We will be using this element to find the rows in the table.

Parsing the HTML Response

We will now use the BeautifulSoup library to parse the HTML response from the website. We will be using the find_all method to find all the tr elements inside the tbody element. We will then loop through the tr elements and print the text inside each td element.

# xp_scrape.py
import requests
from bs4 import BeautifulSoup

# get the data
data = requests.get("https://www.umggaming.com/leaderboards")

# load data into bs4
soup = BeautifulSoup(data.text, "html.parser")

leaderboard = soup.find("table", {"id": "leaderboard-table"})
tbody = leaderboard.find("tbody")

for tr in tbody.find_all("tr"):
    place = tr.find_all("td")[0].text.strip()
    username = tr.find_all("td")[1].find_all("a")[1].text.strip()
    xp = tr.find_all("td")[4].text.strip()
    # print('position','username','xp', sep='\t')
    print("Place", "Username", "XP", sep="  ")
    print(place, username, xp, sep="  ")

We have used the find_all method to find all the tr elements inside the tbody element. We have then looped through the tr elements and printed the text inside each td element.

We have also used the strip method to remove the whitespace from the text. We have also used the sep argument to separate the columns with a tab.

Getting Earnings Leaderboard

The process for getting the earnings leaderboard is similar to the process for getting the xp leaderboard. We will be using the find_all method to find all the tr elements inside the tbody element. We will then loop through the tr elements and print the text inside each td element.

# earnings_scrape.py
import requests
from bs4 import BeautifulSoup

# get the data
data = requests.get("https://umggaming.com/leaderboards/earnings")

# load data into bs4
soup = BeautifulSoup(data.text, "html.parser")

leaderboard = soup.find("table", {"id": "leaderboard-table"})
tbody = leaderboard.find("tbody")

for tr in tbody.find_all("tr"):
    place = tr.find_all("td")[0].text.strip()
    username = tr.find_all("td")[1].find_all("a")[1].text.strip()
    earnings = tr.find_all("td")[4].text.strip()[1:]
    # print('position','username','earnings', sep='\t')
    print("Place", "Username", "Earnings", sep="  ")
    print(place, username, earnings, sep="

We have used the find_all method to find all the tr elements inside the tbody element. We have then looped through the tr elements and printed the text inside each td element.

We have also used the strip method to remove the whitespace from the text. We have also used the sep argument to separate the columns with spaces.

Saving the Data to a Postgres Database

We will now save the data to Postgres DB, this will give us persistence across sessions as well allow perform analysis on the data. To keep things consistent we'll use an Object Relational Mapper (ORM) SQLAlchemy to create the schema and psycopg2 to connect to the Postgres DB.

Creating the Schema

Inorder to keep all our files organized we'll create a new directory called db inside our root directory and create a new file called base_sql.py inside the db directory. We'll create a Base class that will be used to create the schema for our database. We'll also create a Session class that will be used to create a session to the database. Following best security practice we'll also use environment variables to store our database credentials.

# base_sql.py

import os

from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from dotenv import load_dotenv

load_dotenv()


db = os.getenv(
    "DB", "postgresql+psycopg2://test:testpassword@localhost:5432/xp_db"
)
engine = create_engine(db)
engine.connect()
print(engine)
Session = sessionmaker(bind=engine)

Base = declarative_base()

The Base class is used to create the schema for our database. We have used the declarative_base method to create the Base class. We have also used the create_engine method to create an engine that will be used to connect to the database. We have also used the sessionmaker method to create a session that will be used to create a session to the database.

The load_dotenv method is used to load the environment variables from the .env file. We have used the getenv method to get the database credentials from the environment variables. If none is found an alternative provided for testing purposes. We have also used the connect method to connect to the database.

Creating the Models

We'll create two files inside the db directory. One file will be used to create the XP model and the other file will be used to create the Earnings model. We'll use the Base class to create the schema for the database. We'll also use the Column class to create the columns for the database. We'll also use the Integer class to create the id column. We'll also use the String class to create the username column. We'll also use the Float class to create the xp and earnings columns.

# player_sql.py

from sqlalchemy import Column, String, Integer, Date
from datetime import date as dt

from base_sql import Base


class Player(Base):
    __tablename__ = "players"

    id = Column(Integer, primary_key=True)
    username = Column(String(80))
    place = Column(String())
    xp = Column(Integer())
    date = Column(Date)

    def __init__(self, username, place, xp):
        self.place = place
        self.username = username
        self.xp = xp
        self.date = dt.today()

    def __repr__(self):
        return f"{self.username} {self.place} {self.xp} {self.date}"

    def __str__(self):
        return f"{self.username} {self.place} {self.xp} {self.date}"


if __name__ == "__main__":
    print("Player class")

The code above is used to create the Player model.

We have used the __tablename__ attribute to set the name of the table. We have also used the Column class to create the columns for the database. We have also used the Integer class to create the id column. We have also used the String class to create the username column. We have also used the Float class to create the xp and earnings columns.

We have also used the __init__ method to initialize the Player class. We have also used the __repr__ method to return a string representation of the Player class. We have also used the __str__ method to return a string representation of the Player class.

We have also used the if __name__ == "__main__": statement to check if the file is being run directly. If it is being run directly we will print a message.

# top_earning_players_sql.py

from sqlalchemy import Column, String, Date, Integer, Float
from datetime import date as dt

from base_sql import Base


class Earning_Player(Base):
    __tablename__ = "paid_players"

    id = Column(Integer, primary_key=True)
    username = Column(String(80))
    place = Column(String())
    earnings = Column(Float())
    date = Column(Date)

    def __init__(self, username, place, earnings):
        self.place = place
        self.username = username
        self.earnings = earnings
        self.date = dt.today()

    def __repr__(self):
        return f"{self.username} {self.place} {self.earnings} {self.date}"

    def __str__(self):
        return f"{self.username} {self.place} {self.earnings} {self.date}"


if __name__ == "__main__":
    print("Earning_Player class")

The code above is used to create the Earning_Player model.

It is almost the same as the Player model. The only difference is that we have used the Float class to create the earnings column. The Date Column is used to store the date the data was scraped. By default the Date column will store the date in the format YYYY-MM-DD. We have used the dt.today() method to get the current date.

Creating the Database

We'll create a new file called postgres_inserts.py inside the db directory. We'll use the create_all method to create the database. We'll also use the Session class to create a session to the database. We'll also use the add method to add the data to the database. We'll also use the commit method to commit the changes to the database.

# postgres_inserts.py

from base_sql import Session, engine, Base
from player_sql import Player
from top_earning_players_sql import Earning_Player

# 1 generate database schema
Base.metadata.create_all(engine, Base.metadata.tables.values(), checkfirst=True)

# 2 Create a new session
session = Session()
session.commit()
session.close()


# create players
print(len(session.query(Player).all()))
print(len(session.query(Earning_Player).all()))

if __name__ == "__main__":
 print("Creating tables")

The code above is used to create the database. We have used the create_all method to create the database. We also query the database to check if the data has been added to the database.

Data Finally

Data illustration

Back to our scrapers we'll add the code to insert the data into the database. We'll add the code to insert the data into the database in the xp_scraper.py file. We'll also add the code to insert the data into the database in the top_earning_players_scraper.py file.

# xp_scraper.py

import requests
from bs4 import BeautifulSoup

from db.base_sql import Session, engine, Base
from db.player_sql import Player

# 1 generate database schema
Base.metadata.create_all(engine)


# 2 Create a new session
session = Session()

# get the data
data = requests.get("https://www.umggaming.com/leaderboards")

# load data into bs4
soup = BeautifulSoup(data.text, "html.parser")

leaderboard = soup.find("table", {"id": "leaderboard-table"})
tbody = leaderboard.find("tbody")

for tr in tbody.find_all("tr"):
    place = tr.find_all("td")[0].text.strip()
    username = tr.find_all("td")[1].find_all("a")[1].text.strip()
    xp = tr.find_all("td")[4].text.strip()
    # print('position','username','xp', sep='\t')
    print("Place", "Username", "XP", sep="  ")
    print(place, username, xp, sep="  ")
    # leaderboard-table > tbody:nth-child(2) > tr:nth-child(1) > td:nth-child(5)

    # create players
    player = Player(username=username, place=place, xp=xp)
    # Check if player exists
    players = session.query(Player).all()
    try: 
        pl = session.query(Player).filter(Player.username == username).first()
        if pl:
            session.delete(pl)
            session.commit()
    except Exception as e:
        session.rollback()
        print(e)
    else:
        session.add(player)
    session.commit()

session.close()

if __name__ == "__main__":
 print("Scraping data")

The code above is the completed code including importing the Player model and the Session class. We have also used the Session class to create a session to the database. We have also used the add method to add the data to the database. We have also used the commit method to commit the changes to the database.

We first query the database for any existing data, if the data exists we delete it and add the new data. We also use the try and except statement to handle any errors. Finally we commit the changes to the database and close the session.

Lets do the same for the top_earning_players_scraper.py file.

# earning_scrape.py

import requests
from bs4 import BeautifulSoup

from db.base_sql import Session, engine, Base
from db.top_earning_players_sql import Earning_Player

# 1 generate database schema
Base.metadata.create_all(engine)


# 2 Create a new session
session = Session()

# get the data
data = requests.get("https://umggaming.com/leaderboards/earnings")

# load data into bs4
soup = BeautifulSoup(data.text, "html.parser")

leaderboard = soup.find("table", {"id": "leaderboard-table"})
tbody = leaderboard.find("tbody")

for tr in tbody.find_all("tr"):
    place = tr.find_all("td")[0].text.strip()
    username = tr.find_all("td")[1].find_all("a")[1].text.strip()
    earnings = tr.find_all("td")[4].text.strip()[1:]
    # print('position','username','earnings', sep='\t')
    print("Place", "Username", "Earnings", sep="  ")
    print(place, username, earnings, sep="   ")
    # create players
    player = Earning_Player(username=username, place=place, earnings=float(earnings))
    players = session.query(Earning_Player).all()
    try:
        if session.query(Earning_Player).filter(Earning_Player.id > 0).all():
            # Check if player exists
            pl = session.query(Earning_Player).filter(Earning_Player.username == username).first()
            if pl:
                # print(pl.username, pl.place, pl.earnings)
                session.delete(pl)
                session.commit()

    except Exception as e:
        session.rollback()
        print(e)
    else:
        session.add(player)

        session.commit()

session.close()

if __name__ == "__main__":
 print("Scraping data")

The code above is the completed code including importing the Earning_Player model and the Session class. We have also used the Session class to create a session to the database. We have also used the add method to add the data to the database. We have also used the commit method to commit the changes to the database.

3, 2, 1 Scrape

Illustration

We have everything in place to run our scrapers and save the data accordingly, lets walkthrough that process below:

Database Setup and Insertion

Data illustration

Create a new database called xp_db in postgresql. Then create a .env file in the root of the project and add the following code to the .env file.

DB=postgresql+psycopg2://test:testpassword@localhost:5432/xp_db

The code above is an example of the .env file. Replace the test and testpassword with your postgresql username and password.

Then run the following commands in the terminal.

cd db 
python3 base_sql.py
python3 player_scraper.py
python3 top_earning_players_scraper.py
python postgres_inserts.py

Running the Scrapers

Run the scrapers by entering the following commands:

python3 xp_scrape.py
python3 earnings_scrape.py

Test Everything

Inorder to have this project be as well-rounded as possible we'll code a series of tests to test the code using pytest. Create a new directory called tests and create a new file called test_db.py in the tests directory. Inside create a test_players.py file.

# test_players.py

import pytest

from db.player_sql import Player
from db.top_earning_players_sql import Earning_Player
from db.base_sql import Session, engine, Base


from db.player_factory_basic import PlayerFactory


@pytest.fixture
def player_func():
    return PlayerFactory.build()


@pytest.fixture
def player_func2():
    return PlayerFactory.build()


def my_func_to_delete_Player(session, Player_id):
    session.query(Player).filter(Player.id == Player_id).delete()


def my_func_to_delete_Player2(session, Player_id):
    session.query(Earning_Player).filter(Earning_Player.id == Player_id).delete()


def test_player(session, player_func):
    session.add(player_func)
    assert session.query(Player)
    my_func_to_delete_Player(session, Player.id)
    result = session.query(Player).one_or_none()
    assert result is None

def test_player2(session, player_func2):
    session.add(player_func2)
    assert session.query(Earning_Player)
    my_func_to_delete_Player2(session, Earning_Player.id)
    result = session.query(Earning_Player).one_or_none()
    assert result is None

The code above is the completed code for the test_players.py file. We have used the pytest library to create a fixture to create a Player object. We have also used the pytest library to create a fixture to create a Earning_Player object. We have also used the pytest library to create a fixture to create a Session object. We have also used the pytest library to create a fixture to create a engine object. We have also used the pytest library to create a fixture to create a Base object.

We have also used the pytest library to create a fixture to create a PlayerFactory object. We have also used the pytest library to create a fixture to create a Earning_PlayerFactory object.

Run the Tests

To run the tests, open the terminal and run the following command in the root of the project.

pytest .

The output should be similar to the following image:

Pytest | UMG Gaming

Completed Code on GitHub

Alternatively you can check/clone the completed code from: GitHub

Conclusion

In this tutorial, we have learned how to scrape data from a website using the requests library and BeautifulSoup library. We have also learned how to store the scraped data in a database using the SQLAlchemy library. We have also learned how to test the code using the pytest library. We have also learned how to use the factory_boy library to create test data.

I hope you liked this write-up and get inspired to extend it further. Keep coding! Feel free to leave comments below or reach out on Twitter: Ken_Mwaura1 or LinkedIn.

Next time gif

Resources