Consuming data from APIs · Scientific computing

Not all data come from static files. In a large number of scientific applications, we need to collect data from websites, often using a RESTful API. In this module, we will use a very simple example to show how we can collect data about postal codes, get them as a JSON file, and process them.

In order to make this example work, we will need two things: a way to interact with remote servers (provided by the HTTP package), and a way to read JSON data (using JSON, as we have seen in the previous module.)

import HTTP
import JSON

In this example, we will get the names and coordinates of the various post codes in Rimouski, a (small) city in Québec, Canada. The zippopotam.us website offers an API that we can access without a login or an access token, and is therefore perfect for this example.

The basic loop of interaction with an API for data retrieval is:

Figure out the correct query parameters; there is no global recipe here, each API will have its documentation explaining which keywords can be used and what values they accept
Write the URL to the correct endpoint, composed of the API root and the query parameters
Perform an HTTP request (usually GET) on this endpoint
Check the response status (200 is a good sign, anything else should be checked against the IANA status codes list)
Read the body of the response in the correct format (usually JSON, but each API will have its own documentation)

HTTP status codes are a “soft” standard that most people and services agree on. For example, 404 means that the resource was not found, 403 that it is forbidden to access, and 500 that something went wrong on the server. If you work with APIs often as part of your work, it is very important to get acquainted with them.

In our case, after reading the API documentation, we know that we need to get to the endpoint that is given by country/province/name:

api_root = "https://api.zippopotam.us"
place = (country = "ca", province = "qc", name = "rimouski")
endpoint = "$(api_root)/$(place.country)/$(place.province)/$(place.name)"

"https://api.zippopotam.us/ca/qc/rimouski"

Now that we have this endpoint setup, we can perform a GET request, which is one of the many HTTP verbs. A GET request will request the content of a page for our consumption. Other commnly used verbs include LIST (to get a list of multiple resources), POST to upload formatted data, and PATCH to edit data on the API.

res = HTTP.get(endpoint)

HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Date: Tue, 11 Jun 2024 22:56:13 GMT
Content-Type: application/json
Transfer-Encoding: chunked
Connection: keep-alive
x-cache: hit
charset: UTF-8
vary: Accept-Encoding
CF-Cache-Status: DYNAMIC
Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=e03Lul3DLOGtY3BLT45BlGGkgUkBP1%2FEDIT99EvU8r8Rp8lVw%2BndTuRccblsDu%2BcUl49A0%2BZhbKISF6KvAvE3EZIliS3jPqE1%2FH0xO2swAzBLgy8i%2BcHKTXH%2BQ1FPl0ys5lNMg%3D%3D"}],"group":"cf-nel","max_age":604800}
NEL: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
access-control-allow-origin: *
Server: cloudflare
CF-RAY: 89253173afd67bc5-LAX
Content-Encoding: gzip
alt-svc: h3=":443"; ma=86400

{"country abbreviation": "CA", "places": [{"place name": "Rimouski Central", "longitude": "-68.5232", "post code": "G5L", "latitude": "48.4525"}, {"place name": "Rimouski Northeast", "longitude": "-68.4973", "post code": "G5M", "latitude": "48.4547"}, {"place name": "Rimouski Southwest", "longitude": "-68.5122", "post code": "G5N", "latitude": "48.4277"}], "country": "Canada", "place name": "Rimouski Central", "state": "Quebec", "state abbreviation": "QC"}"""

Because res is a new type of object, we can take a look at its fields:

typeof(res)

HTTP.Messages.Response

fieldnames(typeof(res))

(:version, :status, :headers, :body, :request)

The first thing we want to check is the response status, specifically that it is equal to 200:

isequal(200)(res.status)

true

We can also inspect the headers:

res.headers

15-element Vector{Pair{SubString{String}, SubString{String}}}:
                        "Date" => "Tue, 11 Jun 2024 22:56:13 GMT"
                "Content-Type" => "application/json"
           "Transfer-Encoding" => "chunked"
                  "Connection" => "keep-alive"
                     "x-cache" => "hit"
                     "charset" => "UTF-8"
                        "vary" => "Accept-Encoding"
             "CF-Cache-Status" => "DYNAMIC"
                   "Report-To" => "{\"endpoints\":[{\"url\":\"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=e03Lul3DLOGtY3BLT45BlGGkgUkBP1%2FEDIT99EvU8r8Rp8lVw%2BndTuRccblsDu%2BcUl49A0%2BZhbKISF6KvAvE3EZIliS3jPqE1%2FH0xO2swAzBLgy8i%2BcHKTXH%2BQ1FPl0ys5lNMg%3D%3D\"}],\"group\":\"cf-nel\",\"max_age\":604800}"
                         "NEL" => "{\"success_fraction\":0,\"report_to\":\"cf-nel\",\"max_age\":604800}"
 "access-control-allow-origin" => "*"
                      "Server" => "cloudflare"
                      "CF-RAY" => "89253173afd67bc5-LAX"
            "Content-Encoding" => "gzip"
                     "alt-svc" => "h3=\":443\"; ma=86400"

This stores a lot of interesting information, but as a vector of pairs. We can make our life significantly easier by turning this into a dictionary. For example, we can confirm that the API is indeed giving us data in the application/json format:

Dict(res.headers)["Content-Type"]

"application/json"

With this information, we can get the actual content of our request, which is stored in the body field.

The body field is cleared when first accessed. This is a strange quirk, but it do be like that. The safest way to handle the body (because we do not want to lose our request!) is to store it in a variable.

body = res.body
typeof(body)

Vector{UInt8} (alias for Array{UInt8, 1})

This is unexpected! We were promised an application/json content, and here we are with a long array of unsigned 8-bit encoded integers. Why? In a nutshell: there is no reason to expect that we will be querying text. We can use HTTP to request sound, images, videos, or even streaming data. And so what we get is the raw output. Thankfully, we can transform it into a string:

String(copy(body))

"{\"country abbreviation\": \"CA\", \"places\": [{\"place name\": \"Rimouski Central\", \"longitude\": \"-68.5232\", \"post code\": \"G5L\", \"latitude\": \"48.4525\"}, {\"place name\": \"Rimouski Northeast\", \"longitude\": \"-68.4973\", \"post code\": \"G5M\", \"latitude\": \"48.4547\"}, {\"place name\": \"Rimouski Southwest\", \"longitude\": \"-68.5122\", \"post code\": \"G5N\", \"latitude\": \"48.4277\"}], \"country\": \"Canada\", \"place name\": \"Rimouski Central\", \"state\": \"Quebec\", \"state abbreviation\": \"QC\"}"

We are using copy here because if we access body directly, it will be cleared. The recommended design pattern when dealing with HTTP responses it to process the body field in one go, to avoid losing this information:

We are now a step away from having our JSON object:

riki = JSON.parse(String(body))

Dict{String, Any} with 6 entries:
  "state abbreviation" => "QC"
  "country abbreviation" => "CA"
  "place name" => "Rimouski Central"
  "places" => Any[Dict{String, Any}("post code"=>"G5L", "latitude"=>"48.4525", "longitude"=>"-68.5232", "place name"=>"Rimouski Central"), Dict{String, Any}("post code"=>"G5M", "latitude"=>"48.4547", "longitude"=>"-68.4973", "place name"=>"Rimouski Northeast"), Dict{String, Any}("post code"=>"G5N", "latitude"=>"48.4277", "longitude"=>"-68.5122", "place name"=>"Rimouski Southwest")]
  "country" => "Canada"
  "state" => "Quebec"

If you run this line a second time, it will fail – this is because you have access these body already, and so it is now empty. This module has a lot of warnings. Welcome to working with remote data.

The output we get is now our standard JSON object, so we can do a little thing like:

for place in riki["places"]
    @info "The post code for $(place["place name"]) is $(place["post code"])"
end

[ Info: The post code for Rimouski Central is G5L
[ Info: The post code for Rimouski Northeast is G5M
[ Info: The post code for Rimouski Southwest is G5N

Most APIs we use in practice for research are a lot more data-rich, and can have highly structured fields. When this is the case, it is a good idea to take the output and represent it as a custom type: an example of this approach can be found in, e.g., the GBIF package for biodiversity data retrieval.