Being the millennial I am, I usually get my news from either social media or articles. Now am pretty sure am not the only one, however trying to find time or the energy to watch a news broadcast or worse read a newspaper is personally a challenge. Of course there are sufficiently a tonne of subscription services that offer periodic news updates via messages. However, I preferred a custom made solution that would be tailored to my use case.
Here in Kenya, news outlets don't allow open access to their API's hence I chose to use web scraping to get news articles.
Disclaimer: This article is purely for educational purposes, I am neither advocating nor endorsing anyone to crawl any website without proper authorisation. Always read the website robots.txt. Kindly don't overload a website with too many requests concurrently, space your requests to enable the servers to handle the traffic.
With all that sorted, I'll explain the workflow. We first scrape the website and get the top 3 headlines and the links to the full articles. We then craft a message with the data, we send it to ourselves and deploy it to the cloud and have it run continously.
Requirements to get started
To effectively follow along with this post and subsequent code, you will need the following prerequisites.
- Python and pip (I am currently using 3.9.2 ) Any version above 3.5 should work.
-
- Api Key and username from your account. Create an app and take note of the api key.
If you plan on deploying this script I recommend Heroku
- Heroku Account
Phase One
Create a directory to hold all the code. Change into the directory.
mkdir daily-news-scraper
cd daily-news-scraper
- Create a new virtual environment for the project or activate the previous one.
- Using python package manager(pip) install africastalking python sdk, python-dotenv library and newspaper3k
libraries.
- Save the installed libraries in a requirements.txt file
python -m venv . source bin/activate pip install africastalking python-dotenv newspaper3k pip freeze > requirements.txt
Phase Two
There are a plethora of web scraping libraries available in python e.g. beatifulsoup, requests, scrapy. You can also read this article to get an extensive overview.
- Save the installed libraries in a requirements.txt file
- Using python package manager(pip) install africastalking python sdk, python-dotenv library and newspaper3k
libraries.
I chose the newspaper3k library as it specifically designed for extracting and curating articles. It also makes use of requests, lxml and beatifulsoup. This ensures alot of the low-level code for getting and extracting articles is abstracted thus easy to get started.
Enough talk, lets get coding! Alternatively check completed code on my github
Create a news_scraper.py
script to hold all of our code.
touch news_scraper.py
Open it in your favorite IDE or editor, lets import the required libraries.
# news_scraper.py
import os
import newspaper
import africastalking as at
from dotenv import load_dotenv
We then create a .env
file to hold all of our important credentials. This inline with best practices when
it comes to security.
touch .env
- Enter the following replacing the placeholders with the proper credentials.
Make sure your number above is in E.164 format# Both can be obtained from your account console on Africas Talking username=Username-here api_key=apikey-here mobile_number=+2547XXXXXXXX
Phase Three
Below we proceed to load our environment values and initialize the Africas Talking client.
This is done by assigning variables username
and api_key
to our values and passing them as arguments to africas talking python sdk.
#news_scraper.py
load_dotenv()
# get the environment values from the .env file
api_key = os.getenv('api_key')
username = os.getenv('username')
mobile_number = os.getenv('mobile_number')
# Initialize the Africas talking client using username and api api_key
at.initialize(username, api_key)
# create a variable to reference the SMS client
sms = at.SMS
There are a considerable number of news sources ranging from local to international to different categories. Choosing will be up to your personal preference. I chose the Business Daily Africa and Standard Media platforms.
I chose to scrape the homepage for the respective websites. However, you have the option to specify categories via the website links e.g. technology . I assigned the specified urls to variables for easier reference. We also create an empty list message
to hold the headlines and urls.
# news_scraper.py
# create variables to hold urls to be scraped
business_daily = "https://www.businessdailyafrica.com/"
standard_daily = "https://www.standardmedia.co.ke/"
# create an empty list to hold the headlines and urls
message = []
Now we'll create a function that takes an url as an argument, uses the newspaper library to scrape data from the url, insert the data in a list and get top three headlines and links to the full articles.
The function will also append the headline and url to our message
list.
# Create a function to scrape the top 3 headlines from news sources
def top_news(url):
# get top articles on standard standard
news_source = newspaper.build(url,)
top_articles = []
for index in range(3):
article = news_source.articles[index]
article.download()
article.parse()
top_articles.append(article)
for a in top_articles:
print(a.title, a.url)
message.append(a.title)
message.append(a.url)
top_news(business_daily)
top_news(standard_daily)
The newspaper library gives us the build()
that takes a website link and returns a list that can easily iterrated on.
For each article in the news_source
variable we call the download()
and parse()
methods. Each article is downloaded and parsed and appended to list. Check the [documentation] (newspaper.readthedocs.io/en/latest) for further clarification.
Due to limits on characters and special symbols that can be included in sms we concatenate the list to only three articles.
We then assign each title and url to the message
list. This will be our custom news headlines.
Send me a Text NOW!
Now to finally notify ourselves of the current news. We will use the Africas Talking sms API
We will create a function send_message
that takes a list of headlines and mobile number as arguments. This
function will then attempt to send us a message.
# news_scraper.py
# Create a function to send a message containing the scraped news headlines.
def send_message(news: list, number: int):
try:
response = sms.send(news, [number])
print(response)
except Exception as e:
print(f" Houston we have a problem: {e}")
# Call the function passing the message and mobile_number as a arguments
send_message(str(message), mobile_number)
Adding a try-catch block ensures we first try to send the message and get notified in-case of a failure. We also print out the exception for easier debugging. We finally call the send_message()
function passing our list as a string and number as defined in our environment variables.
Running the script and you should get a message containing the current headlines and links. Example as shown below:
This article was meant cover both development and deployment of the news scraper. However, it'll be easier for if deployment and url shortening get a separate article(Part 2). I'll update this article with the link to part two.