We are family – working with Swirrl
If your job is to analyse data, then anything that gets in the way of accessing that data is a pain. And if you get to the data and it’s not interoperable or comparable, it can be frustrating and time consuming.
We’re Swirrl, data publishers who are currently working with the Office for National Statistics (ONS) on the Government Statistical Service Data Project. This project aims to demonstrate and test an approach for official statistics publishers across government to share their data in a more interoperable and reusable way.
It uses the idea of dataset families: considering data in terms of groups of related data collections, rather than in terms of which organisation it comes from, and aims to break down the barriers for users to find, understand and apply data from related datasets.
At the moment, we’re concentrating on data on international trade as the prime example, using data from the ONS, Her Majesty’s Revenue and Customs (HMRC) and Department for International Trade (DIT). We’re planning to extend the work to consider other families, such as migration, health and housing.
The use cases that Swirrl and ONS are focusing on came from working with the consortium of trade data experts and data users. This led to a list of “jobs to be done” and an initial idea of the challenges which make those jobs difficult, including: identifying and choosing the right data for a particular purpose, working out exactly what the data means and whether different datasets are comparable, getting that data into the right analysis tools, and using it to help understand and explain complex and important questions.
The family analogy turns out to be a very useful way of thinking about this data: what is it about different datasets that make them part of a dataset family? What can we do when we share data to make those relationships between datasets explicit, and by strengthening the relationships, how do we make the data more useful?
Swirrl and ONS are investigating how to bring datasets closer together, by harmonising the structure of the data and the vocabularies of terms and identifiers used. At the core of that approach is a set of international open standards and best practices for data on the web, developed by the World Wide Web Consortium. Numerous established standards are relevant to the work: the Linked Data approach, the RDF Data Cube Vocabulary for representing statistical data on the web, and the Data on the Web Best Practices are particularly important.
To make this practical we need some good tools. Building on yet more W3C standards Swirrl is working on “table2qb” (pronounced “table to cube”), an open source tool-chain that defines simple CSV file templates for preparing source data, which is relatively easy for statisticians to produce. It then puts those through automatic transformation and validation processes to create machine-readable web-ready data in a more interoperable form.
Getting the data representation right, and putting some appropriate plumbing in place is an essential first step. That then gives us the raw materials for building improved ways of accessing the data, to address the challenges around finding and understanding data that came out of our stakeholder group “jobs to be done”.
Work is in progress on the first of these data access challenges: improved data search tools. After doing the hard work of making data more consistent in its structure and contents, search software has an easier task of helping users find data to meet their needs. As well as just looking for datasets by keywords appearing in their name or description, or according to which organisation published it, we can now look inside the dataset and see the details of what the data is about.
Building on the dataset family idea, if you have a dataset you know is relevant, we are also thinking about how to identify other datasets closely related to it.
Speeding up this process has been the ability to use Swirrl’s existing PublishMyData software as a place to manage the data and deliver it through a range of user interfaces and APIs (Application Programming Interfaces). PublishMyData is already designed for working with statistical data in this format and means that a lot of the essentials are already in place. It’s already in operational use by the Office for National Statistics, NHS, Ministry for Housing, Community and Local Government and the Scottish Government.
Once you have well-structured data on the web, and you’ve found some datasets that relate to your question then the next step is to do something with it. Better ways of flexibly selecting and extracting data to use in visualisation or analysis tools is lined up for later in the alpha stage of the project.
This is a journey that’s just beginning. The project has already thrown up numerous fascinating new ideas and we’ve had to be disciplined about putting some of those to one side for future investigation while we get the basics working and gather feedback on that. And we’re very aware that technology is only one part of the equation. Collaboration around standards, evolution of working processes in the GSS member organisations, and testing how ideas work in practice will all be essential to making this a success.
If you would like more information about this work, get in touch with Darren Barnes at the ONS. And if you’re interested in this, and other data projects from Swirrl, keep an eye on Swirrl’s Twitter account or subscribe to Swirrl’s newsletter.
Bill Roberts, Chief Executive Officer, Swirrl