Visualizing the News: Grab your PILlow

2017-11-12 python programming

A Picture’s Worth a Gazillion Bits

newsapi

This weekend, I tripped over a neat news REST-ful api called newsapi. Grab an API key and you’re off to the races.

There are tons of live headlines - News API can provide headlines from 70 worldwide sources.

There are basically two api endpoints:

Accessing the API with Python

newsapi can easily be accessed using a browser since the REST-ful method used is a GET method. But, accessing the api from the browser is limiting.

Accessing the api programmatically with python isn’t difficult to do. There are libraries that we can use to make the task a breeze!

Two libraries are essential for accessing a REST-ful api:

urllib3 - a librarly to formulate requests and process the resulting return from the request.
json - a library to marshall and unmarshal JSON data structures.

Let’s walk through the basics of using our library to interact with the api.

import urllib3
import json

h = urllib3.PoolManager()
r = h.request('GET', 'https://newsapi.org/v1/articles?source=abc-news-au&sortBy=top&apiKey=e8612ef18bcb4b9c932680026f6b6d42')

(note - you may need to use pip3 install urllib3 certifi if your imports fail to load)

That’s it - you just made a request to the ABC News (AU) source and sorted by top with our apiKey that you received when you registered

How do we know if this request actually worked? We assigned the results of the request to a variable, r, and contained within that variable are members. The member, status lets us know the result of the HTTP request. Result codes are well defined and the value can be inspected to determine the validity of the results returned from the request.

If the request status is valid (i.e. equals $200$), then we can examine the data, located in r.data. Examining the data shows a string of JSON-encoded information. In order to access the information, we want to decode the JSON string into a JSON data structure.

Using the json library, we can use loads and input the JSON string and return a python dictionary of json key-value pairs.

json_ds = json.loads(r.data)

Now that we have the data in a json-encoded data structure, we can inspect it and see that there are the following keys:

status -> 200, indicating an ‘OK’ result
source -> ‘abc-news-au’, the name of the source requested in the GET request
sortBy -> ‘top’, the value of the sortBy value passed to the GET request
articles -> a list of json-encoded articles, itself a dictionary of json-encoded key-value pairs.
- author
- title
- description
- url
- urlToImage
- publishedAt

Accessing any of these values is as simple as using the key in quotes as the index to the json_ds dictionary. For example, to retrieve the list of articles, json_ds['articles'], retrieves the list. Using len() to determine how many articles are returned from the request.

We an iterate through all the articles, and print out the author and title as follows:


for _, a in enumerate(json_ds['articles']):
  print(a['author'],a['title'])

A brief note on the enumerate function. Rather than use range where we would have to wrap our dictionary with len to produce a valid integer-based range, we use enumerate and pass the dictionary, json_ds['articles'] directly to the function. The function returns a tuple, (index, value). Since we don’t need to use the index, an underscore, _, ignores the return of the value and the variable, a, received the article enumerated over the list.

We can now programmatically, access, manipulate, and do whatever we want with the data returned from the request. Far more useful than just returning the requested data in your browser.

So what about that `urlToImage` field?

The curious reader likely noticed that there are two urls in our list of articles. One of them (url) is a link to the full article. The other is a link to an image associated with the article. Let’s continue our programmatic quest and grab this image and create a thumbnail image for each image we retrieve.

In order to manipulate images, we need to do a couple of things:

issue a request to retrieve the image link data.
use the PIL library to save our image and create and save our thumbnail image.

Just like every other library, we must import PIL library components before we can access them. Specifically, to import the Image library from PIL, add the following: from PIL import Image.

Here’s some code to retrieve the image and then create thumbnails.

def saveImage(h, url, filename):
    r = h.request('GET', url)
    if r.status == OK:
        f = open(filename, "wb")
        raw = bytearray(r.data)
        f.write(raw)
        f.close()

def thumb(filename, thumbFilename):
    try:
        im = Image.open(filename)
        im.thumbnail(THUMB_SIZE)
        im.save(thumbFilename, "JPEG")
    except IOError:
        print("cannot create thumbnail for", filename, thumbFilename)

We can wrap both of these function together to produce a program that retrieves the articles from multiple sources, retrieves the images from each article, and creates an associated thumbnail from each image. Here’s a gist the complete program:

From the above code, you can see how easy it is to use the urllib3 to grab the articles of interest from the newsapi, and then for each article, grab the image url and save it to a file. Now that you have these tools in your possession, the ability to create fun and interesting applications outside of the browser await!

Bonus Question

For fun, how would you create a collage image from the thumbprints? Can you use the Image library to construct a new image that is a composition of blocks of thumbprint images? Give it a try, and email me if you get stuck!