The GSS Data Project and table2qb: a data transformation tool
a blog from Swirrl
Data science courses on the GSS website
The data science courses advertised on the GSS website
Users have difficulty in finding and using GSS data because:
Connected Open Government Statistics (COGS) – previously called the GSS Data Project – aims to fix this problem by:
This Linked Data is better for users because:
Government statistics are not produced or managed by a single entity. We operate a federated system under the common banner of the Government Statistical Service. All central government departments, plus arms length bodies and the devolved administrations are part of this union and between them they publish around 25,000 tables of statistical data each year. We distribute this huge number of outputs across the internet on many portals and websites.
This disparate landscape has been a problem for our users.
Making so much data available in the open, free to download and use (and reuse) is the right thing. But our users are struggling to get to grips with this wealth of data. Some of the problems are:
How do we overcome these barriers and challenges to accessing and using the data we produce?
A project can’t achieve this in isolation. We are all responsible, including our users, for the co-creation of this new vision for the GSS.
Connected Open Government Statistics (COGS) – previously called the GSS Data Project – is developing the tools that can help to support this. The heavy lifting equipment we are building will reduce the burden on departments. We are working in partnership with centralised teams that are already supporting the betterment of the GSS. We are collaborating with the growing communities of practices out there, such as Reproducible Analytical Pipeline (RAP) champions and presentation champions, that are already working support better data.
This project is about being that catalyst for change and running towards a better way.
To increase the impact and reach of GSS statistical data by improving how we organise, integrate and deliver GSS data online.
We need to re-imagine our current publication model. Spreadsheets have played the lead role for many years. They will continue to play a part but this should be a supporting role and not the primary channel as we expand our distribution routes.
To support this, our focus needs to be on how to make it easier for publishers of official statistics to share their data in ways that offer the greatest use to users.
This means moving to a place where we are producing data that is web-ready and not ready to sit on the web. COGS is building the blueprint that will enable us to exchange data and metadata over the web and make better use of all the different information we offer.
This will make our data more findable and useful for our users. This is the pain relief our users want.
To pioneer a better way for users to connect with our data, we need to understand the current journey taken by our users to find and reuse our data and the pain points they face along the way
The user journey is simple. Awareness that there is data available and the ability to find it. The users want to have just enough information provided that allows them to consider whether the data is what they need before they make any selection. Once they have found the right source they need to be able to work with it to complete their aim.
We have not catered for this simple user journey. Instead we have created a disparate and confusing landscape for our users. For example, housing data is published by a range of official statistics producers, in a range of different formats with various codes and labels. The myriad of possible entry points to this data confuses our users. Which data? Which website? Which organisation?
Users don’t care about departmental boundaries or services. They want our data. COGS is about creating a system that pulls the threads of all this related data together into related data groupings.
The approach is not as simple as just putting all the spreadsheets on one website. We need to make use of standards for data and metadata on the web. We must also do a better job of harmonising the statistics when possible and explain the differences when harmonisation is not possible. This is the biggest challenge we face given the differences across our statistical system.
The hard bit of the project is taking the data we have and making it ready for use on the web.
Spreadsheets lock the data into a single format. The formats can be whatever we want them to be.
This leads to organisations using different structures, deriving bespoke standards and inconsistent formatting. The documentation on how our reference data are defined and encoded is nowhere near the data. This makes it hard for users to uncover and, even when found, it is incomprehensible for many users.
That is at the heart of why some statistical data is hard for people to use: it’s not always clear what the data means and whether data from different sources can be compared.
This is where our concept of “dataset families” comes into play.
Dataset families aim to link groups of related data into collections for users to discover and navigate around. It requires no previous or explicit knowledge of what organisation the data comes from.
Users are interested in getting data on a subject or domain. Breaking down the organisational boundaries and offering an interconnected set of related datasets should make it easier for users to find, understand and apply the data to their needs.
By examining and modelling the relationship between the various datasets in that “family” collection it should allow:
The ONS has been working with a company called Swirrl, investigating how to bring datasets closer together as a family, by working on the structure of the data and the vocabulary of terms and identifiers used.
Our first step is to take the spreadsheets that have been published and remodel them. Using tools like Python and Pandas we normalise the data into a simpler format – stripping out all the presentational stuff from the spreadsheets.
The output of this we call “Tidy Data”. This is a machine-ready version of the data we started with. Having data in this form makes it much easier to push the data through the next stage of our pipelines to create the web-ready formats which are “Linked Data”.
This step is where the clever stuff happens and adds the real value to the data. We have developed ‘table2qb’ (pronounced ‘table to cube’) that takes the Tidy Data format and integrates it with an enriched metadata file. This produces a web standard format called CSV on the Web (CSVW). The World Wide Web (W3C) consortium developed this form to support data on the web.
The CSVW provides the ingredients we need to create Linked Data which is the mechanism used to link all the things together and makes it easier to discover new related things about the data on the web. For our purposes Linked Data outputs not only provide an enhanced machine-readable format but a web-ready format.
Having data that is machine-readable and web-ready means greater use for us working in the Government Statistical Service and our users working outside of government departments:
The project needs to grow from a small scale feasibility study into a full blown programme of work. This programme will need time (we are looking at four to five years to deliver the full benefits). The programme shape and size is being developed now.
The strands of the programme will be:
The team will need to expand to deliver on these areas. Most of the resource and effort will be two-fold:
An exciting development that will be reduce, and eventually end the reliance on the need for this transformation is the Reproducible Analytical Pipeline (RAP). These initiatives build the tools that create Tidy Data from the source data and help support the automation of the outputs.
The good news is that some departments are already designing RAPs to produce their statistics and reproducible analysis is growing across the GSS. The RAP champions page and GangStaS Data RAP blog give further details.
The project is at the ‘proof of concept’ stage. We have focussed on our attention on two “dataset families” – trade and migration. This has allowed us to work out if the approach was workable. We’ve also talked to teams within health and social care and teams within housing about the next families.
The project has engaged with many departments. Those most involved so far are:
Office for National Statistics, Department for Business Energy and Industrial Strategy, Home Office, Department for Exiting the European Union, Ministry for Housing, Communities and Local Government, Scottish Government, Department for Education, Department for Environment, Food and Rural Affairs, Welsh Government, Northern Ireland Statistics and Research Agency, HM Revenue and Customs, Department for Work and Pensions, Department for International Trade, Foreign and Commonwealth Office, HM Treasury, NHS, Department for Health, Public Health England.
The project team is also getting great support from centralised teams and communities already supporting the GSS – the harmonisation team, the classifications teams, the Best Practice and Impact division, the RAP champions and the presentation champions.
Meet our project board:
Julie Stanborough – Sponsor (ONS)
Neil McIvor (Department for Education)
Ian Knowles (Department for Transport)
David Fry (Department for Business Energy Industrial Strategy)
Siobhan Carey (Northern Ireland Statistics and Research Agency)
Roger Halliday (Scottish Government)
Glyn Jones (Welsh Government)
Andrea Prophet (HM Revenue and Customs)
Sandra Tudor (Ministry for Housing, Communities and Local Government)
Nicola Tyson (Census)
The GSS Presentation and Dissemination Committee
The GSS Data Project and table2qb: a data transformation tool
a blog from Swirrl
Data science courses on the GSS website
The data science courses advertised on the GSS website