Daily News Scraper and Notifications - Part Two

Daily News Scraper and Notifications - Part Two

·

10 min read

Previously...

To recap on part one of this article. We built a daily news scraper using newspaper3k) and scraped two news platforms: Business Daily Africa and Standard Media. We parsed the data and extracted the headline and url for the top 3 stories. We also made use of Africas Talking amazing sms API and python sdk to send the headlines as notifications to a mobile number.

This article will cover url shortening for the article links and deployment to Heroku as well as scheduling the script to run in intervals.

Prerequisites

URL Shortening

Previously we were just sending links as they are from the news source via text. However, this not only inefficient but expensive. Each text was exceeding the character limit for one text, thus the cost went up significantly for one text. For context , one text usually cost around KES 0.8 per text, read more on Africas Talking Pricing. Looking at my dashboard, the previous message cost went up to KES 5.80 per text as shown below:

Alt Text This isn't a colossal amount, but we want the script to run every 30 mins or so throughout the day. Factoring that in, it becomes a necessity to apply url shortening for our links. Also it helps our clean up our text messages, this important in case you want to scale this solution up to more people.

There are a good number of url shortening services, one of the best known is bitly. They have a generous free allocation for free accounts: about 1000 links per month. This should be enough for our use case, However if you need more upgrade options are available or creating multiple accounts. We could use the default bitly api and requests library, I came across an easier solution: bitly shortener. This library has a lot of cool features including allowing you to have a pool of tokens to concurrent make requests to, thereby significantly increasing the number of links you can shorten.

    pip install bitlyshortener

Once you've sorted all the above, edit your .env and add the token as bitly_token, we then proceed to import the library and initialize the Shortener object and pass a list of tokens and max_cache as arguments.

    # news_scraper.py
  import bitlyshortener as bts
  """ 
  .... 
  other code here 
  """
  # Get the token(s) and create a shortener variable
  token = os.getenv('bitly_token')
  # create a variable to takes the token and returns a shortener object
  shortener = bts.Shortener(tokens=[token], max_cache_size=256)

We could create a separate function to create short urls, However integrating a separate function to the existing top_news function and making sure the urls matched with the appropriate headlines proved a challenge. I decided to instead add it to the for loop.

# Create a function to scrape the top 3 headlines from news sources
def top_news(url):
    # get top articles on standard standard
    news_source = newspaper.build(url, memoize_articles=False)
    top_articles = []

    for index in range(3):
        article = news_source.articles[index]
        article.download()
        article.parse()
        top_articles.append(article)
    for a in top_articles:
        # Shorten the long article urls using bitly shortener lib
        short_url = shortener.shorten_urls([a.url])
        message.append(a.title)
        # Short url is a list and we need to unpack it
        for url in short_url:
            message.append(url)

In the code above, we iterate through each link and call shorten_urls(), since the function returns a list we unpack each url and append it the message list variable.

The shortener object provides a usage() function that returns a float about the usage of the current quota. In order to have visibility in our script we add a print statement and multiply it by 1000 to get a percentage. (although this step is optional).

    # news_scraper.py
    usage = shortener.usage()
    print(f"Current url quota usage: {usage * 1000}%")

Deployment

For deployment of this project, I mainly chose Heroku as they provided an easy setup via Github also I already had an account with them. It is easy enough to choose any other cloud provider e,g. Digital Ocean, AWS, GCP etc.

After creating an account and logging in, I recommend you install the heroku cli for easier time during deploying. Now lets begin deployment: just do a heroku create --app <daily-news-scraper-ke>. If you go on your app dashboard you'll see your new app.

We need to create a runtime.txt file to tell Heroku which version of python we want it run. This is important as bitlyshortener requires atleast version 3.7 and by default heroku uses version 3.6. I set mine to 3.9.2 to replicate my development environment.

  echo "python-3.9.2" > runtime.txt

We also need to specify config vars that heroku will use during runtime. This similar how we've been storing our credentials in a .env file. You could either set them via the heroku console in the browser or terminal using the Heroku cli. Make sure you change the values to your actual credentials.

  heroku config:set bitly_token=bitly_token_here
  heroku config:set api_key=api_key_here
  heroku config:set username=Username_here
  heroku config:set mobile_number=2547XXXXXXXX

We now need to initialize a git repo and push the code on Heroku:

  git init
  heroku git:remote -a <heroku create --app daily-news-scraper-ke>
  git add .
  git commit -am 'Deploy news-scraper script'
  git push heroku master

Your app is now on Heroku, but it is not doing anything. Since this little script can't accept HTTP requests, going to .herokuapp.com won't do anything. But that should not be a problem. To have this script running 24/7 we need to use a simple Heroku add-on call "Heroku Scheduler". To install this add-on, click on the "Configure Add-ons" button on your app dashboard.

Alt Text

Then, on the search bar, look for Heroku Scheduler: Alt Text

Click on the result, and click on "Provision"

Alt Text If you go back to your App dashboard, you'll see the add-on:

Alt Text Click on the "Heroku Scheduler" link to configure a job. Then click on "Create Job". Here select "30 minutes", and for run command select python <news_scraper>.py. Click on "Save job".

Alt Text

While everything we used so far on Heroku is free, the Heroku Scheduler will run the job on the $25/month instance, but prorated to the second. Since this script approximately takes 7 seconds to run, for this script to run every 30 minutes you should just have to spend 12 cents a month.

Alt Text

Conclusion

One key feature of the newspaper library is that it has caching built-in. This means once it crawls a site for headlines it caches the result, the next time it crawls if no new headlines are found it returns an empty list. This saves a lot of time and code prevent multiple texts with similar headlines.

Alt Text

I hope that you liked this post and that you learned things reading it.I truly believe that this kind of project is one of the best ways to learn new tools and concepts.

I have many other ideas, and I hope you will like them. Do not hesitate to share what other things you build with this snippet, possibilities are endless.

If you have any question or comments. Let me know in the comments, or on Twitter. Happy Coding.