Daily News Scraper and Sms Notifications - Part One

Daily News Scraper and Sms Notifications - Part One

Being the millennial I am, I usually get my news from either social media or articles. Now am pretty sure am not the only one, however trying to find time or the energy to watch a news broadcast or worse read a newspaper is personally a challenge. Of course there are sufficiently a tonne of subscription services that offer periodic news updates via messages. However, I preferred a custom made solution that would be tailored to my use case.

Here in Kenya, news outlets don't allow open access to their API's hence I chose to use web scraping to get news articles.

Disclaimer: This article is purely for educational purposes, I am neither advocating nor endorsing anyone to crawl any website without proper authorisation. Always read the website robots.txt. Kindly don't overload a website with too many requests concurrently, space your requests to enable the servers to handle the traffic.

With all that sorted, I'll explain the workflow. We first scrape the website and get the top 3 headlines and the links to the full articles. We then craft a message with the data, we send it to ourselves and deploy it to the cloud and have it run continously.

Alt Text

Requirements to get started

To effectively follow along with this post and subsequent code, you will need the following prerequisites.

  • Python and pip (I am currently using 3.9.2 ) Any version above 3.5 should work.
  • An Africas Talking account.

    • Api Key and username from your account. Create an app and take note of the api key.

    If you plan on deploying this script I recommend Heroku

  • Heroku Account

Phase One

Create a directory to hold all the code. Change into the directory.

          mkdir daily-news-scraper
          cd daily-news-scraper
  • Create a new virtual environment for the project or activate the previous one.
    • Using python package manager(pip) install africastalking python sdk, python-dotenv library and newspaper3k libraries.
      • Save the installed libraries in a requirements.txt file
            python -m venv .
            source bin/activate
            pip install africastalking python-dotenv newspaper3k
            pip freeze > requirements.txt
        

        Phase Two

        There are a plethora of web scraping libraries available in python e.g. beatifulsoup, requests, scrapy. You can also read this article to get an extensive overview.

I chose the newspaper3k library as it specifically designed for extracting and curating articles. It also makes use of requests, lxml and beatifulsoup. This ensures alot of the low-level code for getting and extracting articles is abstracted thus easy to get started.

Enough talk, lets get coding! Alternatively check completed code on my github

Create a news_scraper.py script to hold all of our code.

touch news_scraper.py

Open it in your favorite IDE or editor, lets import the required libraries.

# news_scraper.py
import os
import newspaper
import africastalking as at
from dotenv import load_dotenv

We then create a .env file to hold all of our important credentials. This inline with best practices when it comes to security.

touch .env
  • Enter the following replacing the placeholders with the proper credentials.
    # Both can be obtained from your account console on Africas Talking
      username=Username-here
      api_key=apikey-here
      mobile_number=+2547XXXXXXXX
    
    Make sure your number above is in E.164 format

Phase Three

Below we proceed to load our environment values and initialize the Africas Talking client. This is done by assigning variables username and api_key to our values and passing them as arguments to africas talking python sdk.

  #news_scraper.py
  load_dotenv()
  # get the environment values from the .env file
  api_key = os.getenv('api_key')
  username = os.getenv('username')
  mobile_number = os.getenv('mobile_number')
  # Initialize the Africas talking client using username and api api_key
  at.initialize(username, api_key)
  # create a variable to reference the SMS client
  sms = at.SMS

There are a considerable number of news sources ranging from local to international to different categories. Choosing will be up to your personal preference. I chose the Business Daily Africa and Standard Media platforms.

I chose to scrape the homepage for the respective websites. However, you have the option to specify categories via the website links e.g. technology . I assigned the specified urls to variables for easier reference. We also create an empty list message to hold the headlines and urls.

# news_scraper.py
 # create variables to hold urls to be scraped
business_daily = "https://www.businessdailyafrica.com/"
standard_daily = "https://www.standardmedia.co.ke/"
# create an empty list to hold the headlines and urls
message = []

Now we'll create a function that takes an url as an argument, uses the newspaper library to scrape data from the url, insert the data in a list and get top three headlines and links to the full articles.

The function will also append the headline and url to our message list.

  # Create a function to scrape the top 3 headlines from news sources
def top_news(url):
  # get top articles on standard standard
  news_source = newspaper.build(url,)
  top_articles = []
  for index in range(3):
    article = news_source.articles[index]
    article.download()
    article.parse()
    top_articles.append(article)
  for a in top_articles:
    print(a.title, a.url)
    message.append(a.title)
    message.append(a.url)


  top_news(business_daily)
  top_news(standard_daily)

The newspaper library gives us the build() that takes a website link and returns a list that can easily iterrated on. For each article in the news_source variable we call the download() and parse() methods. Each article is downloaded and parsed and appended to list. Check the [documentation] (newspaper.readthedocs.io/en/latest) for further clarification.

Due to limits on characters and special symbols that can be included in sms we concatenate the list to only three articles.

We then assign each title and url to the message list. This will be our custom news headlines. Alt Text

Send me a Text NOW!

Now to finally notify ourselves of the current news. We will use the Africas Talking sms API We will create a function send_message that takes a list of headlines and mobile number as arguments. This function will then attempt to send us a message.

  # news_scraper.py
# Create a function to send a message containing the scraped news headlines.
def send_message(news: list, number: int):
  try:
    response = sms.send(news, [number])
    print(response)
  except Exception as e:
    print(f" Houston we have a problem: {e}")


# Call the function passing the message  and mobile_number as a arguments
send_message(str(message), mobile_number)

Adding a try-catch block ensures we first try to send the message and get notified in-case of a failure. We also print out the exception for easier debugging. We finally call the send_message() function passing our list as a string and number as defined in our environment variables.

Running the script and you should get a message containing the current headlines and links. Example as shown below: Alt Text

This article was meant cover both development and deployment of the news scraper. However, it'll be easier for if deployment and url shortening get a separate article(Part 2). I'll update this article with the link to part two.