Data exploration on linked COVID-19 datasets

As efforts to fight the ongoing pandemic of COVID-19 caused by SARS-CoV-2 are ramping up across the world, more and more authoritative and high-quality datasets are becoming available for research and analysis.

Official COVID-19 datasets are published by governments in a variety of different formats and normally do not follow the same structure. Aggregating them is essential for getting a unified, global view of the pandemic.

When published as linked data in the RDF format, datasets "automatically" become part of the global data graph that connects all linked data sources. The interconnected data can be viewed and analysed as a single dataset which is key to revealing new information and generating new insights.

There are a number of linked datasets for COVID-19 covering all aspects of the disease and the pandemic which can be jointly queried using SPARQL.

The data, tools, and sample queries

The popular source of linked data, including the data related to the pandemic, is Wikidata. Wikidata is the central storage for the structured data used by Wikipedia and other Wikimedia projects, and can be easily queried using the Wikidata Query Service and offers a SPARQL query endpoint for remotely tapping in to the data and joining it with local RDF datasets.

The following query returns the number of COVID-19 cases recorded globally over time and is a good starting point for exploring the COVID-19 data stored in Wikidata:

SELECT ?date ?cases
WHERE {
wd:Q81068910 p:P1603 ?casesNode .
?casesNode ps:P1603 ?cases ;
pq:P585 ?date .
}
ORDER BY ASC(?date)

Results (truncated):

datecases
20/01/2020282
21/01/2020314
......
01/08/202017,396,943
02/08/202017,660,523
......

Another useful resource is the RDFised version of The New York Times' COVID-19 dataset provided by Stardog. This dataset contains the cumulative counts of coronavirus cases in the United States at the state and county level. The RDF dataset is quite extensive (over 2 million triples) and is available in Stardog Studio.

For example, this is how the fact that as of 1 August 2020, 190,693 cases have been recorded in Los Angeles County, California (FIPS code 06-037) is represented in the dataset:

@prefix : <http://api.stardog.com/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<urn:uuid:648d5c0b-89b9-42f6-b06b-b65109d25a3e>
a :Report ;
:date "2020-08-01"^^xsd:date ;
:county :CountyLos%20Angeles-California ;
:cases "190693"^^xsd:integer .
:CountyLos%20Angeles-California
a :County ;
rdfs:label "Los Angeles, California" ;
:state :California ;
:fips "06037" .
:California
a :State ;
rdfs:label "California" .

To get the number of cases in the county over time, this query can be used:

SELECT ?date ?cases
WHERE {
?report a :Report ;
:date ?date ;
:county ?county ;
:cases ?cases .
?county :fips "06037" .
}
ORDER BY ASC(?date)

Results (truncated):

datecases
26/01/20201
27/01/20201
......
01/08/2020190,693
02/08/2020192,167
......

Being a linked dataset, this dataset can be joined with the data from Wikidata by means of SPARQL federation. For example, the population data for each county stored in Wikidata can be easily joined with the case statistics from The New York Times' dataset, which allows the occurrence to be calculated:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT ?countyLabel ?cases ?population (?cases / ?population AS ?occurrence)
WHERE {
?report a :Report ;
:date "2020-08-01"^^xsd:date ;
:county ?county ;
:cases ?cases .
?county rdfs:label ?countyLabel ;
:state :California ;
:fips ?fips .
{
SELECT ?fips ?population
WHERE {
SERVICE <https://query.wikidata.org/sparql> {
?countyWd wdt:P882 ?fips ;
wdt:P1082 ?population .
}
}
}
}
ORDER BY DESC(?occurrence)

Results (truncated):

countyLabelcasespopulationoccurrence
Imperial, California9,409181,2150.051922
Kings, California4,380152,9400.028639
Kern, California20,061900,2020.022285
Tulare, California9,454466,1950.020279
Lassen, California59830,5730.019560
............
Trinity, California612,2850.000488
Sierra, California13,0050.000333
Modoc, California28,8410.000226

The results show that as at 1 August 2020, of the 58 California counties, Imperial, Kings, and Kern demonstrate the highest rates of COVID-19.

Data on COVID-19 in New Zealand

There are no linked RDF datasets covering the COVID-19 pandemic in New Zealand in detail that can be found online. The existing data can however be RDFised and additionally interlinked with other datasets through the use of Wikidata identifiers (URIs) for specifying regions and district health boards. This can bring rich possibilities for reliably joining and augmenting the data with geographic, demographic, relief, and mobility data from Wikidata and other data providers. This can also be achieved by consistently using owl:sameAs instead of utilising Wikidata URIs or other external URIs directly.

As an example, this is how the fact that 3 new confirmed COVID-19 cases were recorded on 24 August 2020 by Auckland District Health Board (wd:Q24189683) can be represented in RDF:

@prefix : <http://example.org/> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:r0
a :Report ;
:date "2020-08-24"^^xsd:date ;
:dhb wd:Q24189683 ;
:newCases "3"^^xsd:integer .

This data can be linked with the facts available in Wikidata, such as the fact that Auckland District Health Board is located in Auckland (wd:Q37100) which in 2018 had the population of 1,467,800.

In conclusion

The more COVID-19 datasets are published as linked data, the more data integration and enrichment techniques become possible. SPARQL's built-in federation capabilities make it easier to query such interlinked datasets which facilitates the comprehensive analysis of the pandemic both in New Zealand and globally.

See also

Made by Anton Vasetenkov.

If you want to say hi, you can reach me on LinkedIn or via email. If you like my work, you can support me by buying me a coffee.