Not all data come from static files. In a large number of scientific applications, we need to collect data from websites, often using a RESTful API. In this module, we will use a very simple example to show how we can collect data about postal codes, get them as a JSON file, and process them.
In order to make this example work, we will need two things: a way to interact with remote servers (provided by the HTTP package), and a way to read JSON data (using JSON, as we have seen in the previous module.)
import HTTP
import JSON
In this example, we will get the names and coordinates of the various post
codes in Rimouski, a (small) city in Québec, Canada. The
zippopotam.us
website offers an API that we can access without a
login or an access token, and is therefore perfect for this example.
The basic loop of interaction with an API for data retrieval is:
- Figure out the correct query parameters; there is no global recipe here, each API will have its documentation explaining which keywords can be used and what values they accept
- Write the URL to the correct endpoint, composed of the API root and the query parameters
- Perform an HTTP request (usually
GET
) on this endpoint - Check the response status (
200
is a good sign, anything else should be checked against the IANA status codes list) - Read the body of the response in the correct format (usually JSON, but each API will have its own documentation)
404
means that the resource was not found, 403
that
it is forbidden to access, and 500
that something went wrong on the server.
If you work with APIs often as part of your work, it is very important to get
acquainted with them.In our case, after reading the API documentation, we know that we need to get
to the endpoint that is given by country/province/name
:
api_root = "https://api.zippopotam.us"
place = (country = "ca", province = "qc", name = "rimouski")
endpoint = "$(api_root)/$(place.country)/$(place.province)/$(place.name)"
"https://api.zippopotam.us/ca/qc/rimouski"
Now that we have this endpoint setup, we can perform a GET
request, which is
one of the many HTTP verbs. A GET
request will request the content
of a page for our consumption. Other commnly used verbs include LIST
(to get
a list of multiple resources), POST
to upload formatted data, and PATCH
to
edit data on the API.
res = HTTP.get(endpoint)
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Date: Tue, 11 Jun 2024 22:56:13 GMT
Content-Type: application/json
Transfer-Encoding: chunked
Connection: keep-alive
x-cache: hit
charset: UTF-8
vary: Accept-Encoding
CF-Cache-Status: DYNAMIC
Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=e03Lul3DLOGtY3BLT45BlGGkgUkBP1%2FEDIT99EvU8r8Rp8lVw%2BndTuRccblsDu%2BcUl49A0%2BZhbKISF6KvAvE3EZIliS3jPqE1%2FH0xO2swAzBLgy8i%2BcHKTXH%2BQ1FPl0ys5lNMg%3D%3D"}],"group":"cf-nel","max_age":604800}
NEL: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
access-control-allow-origin: *
Server: cloudflare
CF-RAY: 89253173afd67bc5-LAX
Content-Encoding: gzip
alt-svc: h3=":443"; ma=86400
{"country abbreviation": "CA", "places": [{"place name": "Rimouski Central", "longitude": "-68.5232", "post code": "G5L", "latitude": "48.4525"}, {"place name": "Rimouski Northeast", "longitude": "-68.4973", "post code": "G5M", "latitude": "48.4547"}, {"place name": "Rimouski Southwest", "longitude": "-68.5122", "post code": "G5N", "latitude": "48.4277"}], "country": "Canada", "place name": "Rimouski Central", "state": "Quebec", "state abbreviation": "QC"}"""
Because res
is a new type of object, we can take a look at its fields:
typeof(res)
HTTP.Messages.Response
fieldnames(typeof(res))
(:version, :status, :headers, :body, :request)
The first thing we want to check is the response status, specifically that it
is equal to 200
:
isequal(200)(res.status)
true
We can also inspect the headers:
res.headers
15-element Vector{Pair{SubString{String}, SubString{String}}}:
"Date" => "Tue, 11 Jun 2024 22:56:13 GMT"
"Content-Type" => "application/json"
"Transfer-Encoding" => "chunked"
"Connection" => "keep-alive"
"x-cache" => "hit"
"charset" => "UTF-8"
"vary" => "Accept-Encoding"
"CF-Cache-Status" => "DYNAMIC"
"Report-To" => "{\"endpoints\":[{\"url\":\"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=e03Lul3DLOGtY3BLT45BlGGkgUkBP1%2FEDIT99EvU8r8Rp8lVw%2BndTuRccblsDu%2BcUl49A0%2BZhbKISF6KvAvE3EZIliS3jPqE1%2FH0xO2swAzBLgy8i%2BcHKTXH%2BQ1FPl0ys5lNMg%3D%3D\"}],\"group\":\"cf-nel\",\"max_age\":604800}"
"NEL" => "{\"success_fraction\":0,\"report_to\":\"cf-nel\",\"max_age\":604800}"
"access-control-allow-origin" => "*"
"Server" => "cloudflare"
"CF-RAY" => "89253173afd67bc5-LAX"
"Content-Encoding" => "gzip"
"alt-svc" => "h3=\":443\"; ma=86400"
This stores a lot of interesting information, but as a vector of pairs. We can
make our life significantly easier by turning this into a dictionary. For
example, we can confirm that the API is indeed giving us data in the
application/json
format:
Dict(res.headers)["Content-Type"]
"application/json"
With this information, we can get the actual content of our request, which is
stored in the body
field.
body
field is cleared when first accessed. This is a strange
quirk, but it do be like that. The safest way to handle the body (because we
do not want to lose our request!) is to store it in a variable.body = res.body
typeof(body)
Vector{UInt8} (alias for Array{UInt8, 1})
This is unexpected! We were promised an application/json
content, and here
we are with a long array of unsigned 8-bit encoded integers. Why? In a
nutshell: there is no reason to expect that we will be querying text. We can
use HTTP to request sound, images, videos, or even streaming data. And so
what we get is the raw output. Thankfully, we can transform it into a
string:
String(copy(body))
"{\"country abbreviation\": \"CA\", \"places\": [{\"place name\": \"Rimouski Central\", \"longitude\": \"-68.5232\", \"post code\": \"G5L\", \"latitude\": \"48.4525\"}, {\"place name\": \"Rimouski Northeast\", \"longitude\": \"-68.4973\", \"post code\": \"G5M\", \"latitude\": \"48.4547\"}, {\"place name\": \"Rimouski Southwest\", \"longitude\": \"-68.5122\", \"post code\": \"G5N\", \"latitude\": \"48.4277\"}], \"country\": \"Canada\", \"place name\": \"Rimouski Central\", \"state\": \"Quebec\", \"state abbreviation\": \"QC\"}"
copy
here because if we access body
directly, it
will be cleared. The recommended design pattern when dealing with HTTP
responses it to process the body
field in one go, to avoid losing this
information:We are now a step away from having our JSON object:
riki = JSON.parse(String(body))
Dict{String, Any} with 6 entries:
"state abbreviation" => "QC"
"country abbreviation" => "CA"
"place name" => "Rimouski Central"
"places" => Any[Dict{String, Any}("post code"=>"G5L", "latitude"=>"48.4525", "longitude"=>"-68.5232", "place name"=>"Rimouski Central"), Dict{String, Any}("post code"=>"G5M", "latitude"=>"48.4547", "longitude"=>"-68.4973", "place name"=>"Rimouski Northeast"), Dict{String, Any}("post code"=>"G5N", "latitude"=>"48.4277", "longitude"=>"-68.5122", "place name"=>"Rimouski Southwest")]
"country" => "Canada"
"state" => "Quebec"
body
already, and so it is now empty. This module has
a lot of warnings. Welcome to working with remote data.The output we get is now our standard JSON object, so we can do a little thing like:
for place in riki["places"]
@info "The post code for $(place["place name"]) is $(place["post code"])"
end
[ Info: The post code for Rimouski Central is G5L
[ Info: The post code for Rimouski Northeast is G5M
[ Info: The post code for Rimouski Southwest is G5N
Most APIs we use in practice for research are a lot more data-rich, and can have highly structured fields. When this is the case, it is a good idea to take the output and represent it as a custom type: an example of this approach can be found in, e.g., the GBIF package for biodiversity data retrieval.