The GSS Data Project and table2qb: what is it, why it’s useful and how we’re using it
The ONS and Swirrl are working on the GSS Data Project, which aims to demonstrate and test an approach for official statistics publishers across government to share their data in a more interoperable and reusable way.
During the project we have developed a data transformation tool called table2qb. In this post we discuss what that is, why it’s useful in the context of this project and how we’re using it. (It will be released soon as open source, once we have finished a little polishing and documenting).
What is table2qb?
The aim of making data more interoperable and reusable requires us to convert it from a range of formats and encodings into a standards-based representation, mapping various codes and labels into harmonised classification schemes with consistent identifiers.
That involves some heavy lifting from a data transformation point of view, but rather than needing someone with programming skills every time, we wanted a configurable tool that could be applied to all statistical datasets.
Table2qb takes data in a tabular format, and converts it into a statistical ‘data cube’, expressed using the RDF Data Cube Vocabulary. ‘RDF’ standards for Resource Description Framework and is a very flexible way of representing data, well-suited to use on the web and a core part of the Linked Data approach to data publication. The RDF Data Cube Vocabulary is an agreed structure and set of terms specifically for statistical data, and was developed and standardised by a World Wide Web Consortium (W3C) working group.
No data conversion tool can deal with every conceivable input format, so we came up with a specification for the starting point for table2qb. That is based on the ‘tidy data’ approach: a simple table where each row represents an observation and each column represents a variable. It is particular popular and well-supported in the community of R users but is easy to produce with a wide range of tools readily available to statisticians and analysts. Most people working day to day with statistical data are able to generate data in this structure. Table2qb then takes input files for the observations and for any associated classification schemes, puts it through the wringer and comes out with Linked Data.
Standards and more standards
The industrious members of the W3C produced another standard that is highly relevant to this task: the catchily titled “Model for Tabular Data and Metadata on the Web” (edited by Jeni Tennison, now CEO of the Open Data Institute). Part of this work was a generic process for generating RDF from tabular data, ‘CSV2RDF’ for short.
That is exactly what we are doing here, so we wanted to make use of this. CSV2RDF takes a CSV file, and an associated file of JSON metadata that describes its structure, and uses that to generate data in RDF format. All the cleverness in that process is tied up in the JSON metadata: table2qb is essentially a pre-processor for CSV2RDF, specific to the needs of statistical datasets, that automatically generates the metadata in the right structure.
A bonus of taking this approach is that the CSV + JSON metadata representation of the statistical dataset is in itself a useful output. Some people might want to use it like that, without going through the further step of converting to RDF.
In the GSS Data Project we had some quite large datasets to work with, and we were building on some existing software infrastructure (the open source Grafter library for processing Linked Data and Swirrl’s PublishMyData software platform for Linked Data publishing). While there were a couple of pre-existing implementations of the CSV2RDF process, it made sense for us to develop a new version of CSV2RDF, that matched the rest of our tools and met our fairly stringent performance criteria. That is available under an open source licence for anyone to use.
This diagram summarises the steps in the process:
How table2qb transforms Tidy Data into RDF
The hard bit
All of the above took a lot of work, but the really hard bit of this process is not so much the tools for data transformation but the choice of how you organise that data to start with.
All statisticians are familiar with classification schemes: the way that the lowest level data is divided up and aggregated to suit the needs of statistical publications. Unfortunately, in many cases different organisations use different schemes, and the documentation on how such classifications are defined and encoded is not always easy to get hold of.
That is at the heart of why some statistical data is hard for people to use: it’s not always what the data means and whether or how data from different sources can be compared. Our aim in the GSS Data Project has been to investigate the idea of ‘dataset families’ where the relationships between various datasets on a topic have been examined, with a view to promoting their effective use.
Greater harmonisation and re-use of classification schemes in statistical data is a challenge and will require changes in working processes and perhaps also in mindset. It’s a challenge that the whole GSS will need to collaborate on. Table2qb doesn’t solve that problem but it provides a simple and effective tool to work with statistical data and statistical classifications and we think it’s an important building block for the future.