The GSS Data Project
Users have difficulty in finding and using GSS data because:
- metadata (the information about data e.g. definitions, caveats, information about how the data was collected) is often kept separately to the data (e.g. in a methodology document)
- there are lots of different departments producing data on a similar topic but there is no easy way to bring this all together.
The GSS Data Project aims to fix this problem by:
- finding all the spreadsheets on a similar topic and bringing them together into “dataset families”
- putting these datasets into “Tidy Data” format by stripping out all the presentational stuff
- finding the metadata
- putting the Tidy Data and the metadata together and feeding it through a pipeline in order to get something called “Linked Data”
Why Linked Data is better
This Linked Data is better for users because:
- it makes data easier to find (search engines can work with Linked Data better than data in spreadsheets)
- users no longer have to navigate to different websites and different spreadsheets to find what they need
- search engines like Google can “scrape” the data to answer questions people type in
- technical users find it easier to build new tools with Linked Data
- Linked Data allows online tools to be automated (e.g. dashboards can pull through the most up to date data automatically).
Government statistics are not produced or managed by a single entity. We operate a federated system under the common banner of the GSS. A number of departments are in this union and between them they publish around 25,000 tables of statistical data each year. We distribute this huge number of outputs across the internet on many portals and websites.
This disparate landscape has been a problem for our users.
Making so much data available in the open, free to download and use (and reuse) is the right thing. But our users are struggling to get to grips with this wealth of data. Some of the problems are:
- low awareness of what is available: Where do I start? What do I look for? What is the data called?
- users are not able to find the data they need: Where do I look? Who does the data? What website?
- output formats are impenetrable to a lot of users: Where is the data I need? What does it mean?
How do we overcome these barriers and challenges to accessing and using the data we produce?
A project can’t achieve this in isolation. We are all responsible, even our users, for the co-creation of this new vision for the GSS.
The GSS Data project is developing the tools that can help this. The heavy lifting equipment we are building will reduce the burden on departments. We are working in partnership with centralised teams that are already supporting the betterment of the GSS. We are collaborating with the growing communities of practices out there, such as Reproducible Analytical Pipeline (RAP) champions and presentation champions, that are already working support better data.
This project is about being that catalyst for change and running towards a better way.
The aim – what do we need to do?
To increase the impact and reach of GSS statistical data by improving how we organise, integrate and deliver GSS data online.
We need to re-imagine our current publication model. Spreadsheets have played the lead role for many years. They will continue to play a part but this should be a supporting role and not the primary channel as we expand our distribution routes.
To support this, our focus needs to be on how to make it easier for publishers of official statistics to share their data in ways that offer the greatest use to users.
This means moving to a place where we are producing data that is web-ready and not ready to sit on the web. The GSS Data project is building the blueprint that will enable us to exchange data and metadata over the web and make better use of all the different information we offer.
This will make our data more findable and useful for our users. This is the pain relief our users want.
Understanding user relationships with our data
To pioneer a better way for users to connect with our data, we need to understand the current journey taken by our users to find and reuse our data and the pain points they face along the way
The user journey is simple. Awareness that there is data available and the ability to find it. The users want to have just enough information provided that allows them to consider whether the data is what they need before they make any selection. Once they have found the right source they need to be able to work with it to complete their aim.
We have not catered for this simple user journey. Instead we have created a disparate and confusing landscape for our users. For example, housing data is published by a range of official statistics producers, in a range of different formats with various codes and labels. The myriad of possible entry points to this data confuses our users. Which data? Which website? Which organisation?
Users don’t care about departmental boundaries or services. They want our data. The GSS data project aims to create a system that pulls the threads of all this related data together into related data groupings.
The approach is not as simple as just putting all the spreadsheets on one website. We need to make use of standards for data and metadata on the web. We must also do a better job of harmonising the statistics when possible and explain the differences when harmonisation is not possible. This is the biggest challenge we face given the differences across our statistical system.
Organising the data
The hard bit of the project is taking the data we have and making it ready for use on the web.
Spreadsheets lock the data into a single format. The formats can be whatever we want them to be.
This leads to organisations using different structures, deriving bespoke standards and inconsistent formatting. The documentation on how our reference data are defined and encoded is nowhere near the data. This makes it hard for users to uncover and, even when found, it is incomprehensible for many users.
That is at the heart of why some statistical data is hard for people to use: it’s not always clear what the data means and whether data from different sources can be compared.
This is where our concept of “dataset families” comes into play.
Dataset families aim to link groups of related data into collections for users to discover and navigate around. It requires no previous or explicit knowledge of what organisation the data comes from.
Users are interested in getting data on a subject or domain. Breaking down the organisational boundaries and offering an interconnected set of related datasets should make it easier for users to find, understand and apply the data to their needs.
By examining and modelling the relationship between the various datasets in that “family” collection it should allow:
- better discovery of GSS statistical data
- easier interaction with and connection to relevant data no matter where it came from
- consistent and understandable descriptions and definitions
- data and metadata that can be reused across a wide range of options
Integrating the data
The ONS has been working with a company called Swirrl, investigating how to bring datasets closer together as a family, by working on the structure of the data and the vocabulary of terms and identifiers used.
Our first step is to take the spreadsheets that have been published and remodel them. Using tools like Python and Pandas we normalise the data into a simpler format – stripping out all the presentational stuff from the spreadsheets.
The output of this we call “Tidy Data”. This is a machine-ready version of the data we started with. Having data in this form makes it much easier to push the data through the next stage of our pipelines to create the web-ready formats which are “Linked Data”.
This step is where the clever stuff happens and adds the real value to the data. We have developed ‘table2qb’ (pronounced ‘table to cube’) that takes the Tidy Data format and integrates it with an enriched metadata file. This produces a web standard format called CSV on the Web (CSVW). The World Wide Web (W3C) consortium developed this form to support data on the web.
The CSVW provides the ingredients we need to create Linked Data which is the mechanism used to link all the things together and makes it easier to discover new related things about the data on the web. For our purposes Linked Data outputs not only provide an enhanced machine-readable format but a web-ready format.
Delivering the data
Having data that is machine-readable and web-ready means greater use for us working in the GSS and our users working outside of government departments:
- it creates data that is modelled in a way that allows improved findability and discoverability – search engines can look inside a data cube
- it reduces reliance on having to create spreadsheets as the main distribution of data – we can create a multitude of output formats free of charge when the data is managed in this way
- it changes the landscape for all our users (including us) as we can:
- access that one observation we might need
- customise the data more efficiently by pulling out data from multiple data cubes and multiple domains without having to navigate different sites and different spreadsheets
- download the pre-canned spreadsheets if still desired
- tap directly into the data via Application Programming Interfaces (APIs) and other query engines rather than going to websites – powering new web services and applications
- it makes it easier for technical users to create new innovations or services that can consume our data for the greater good
- it allows us to automate services that Local Authorities and charities are manually compiling to save time and improve services
- it will connect with other data sources that follow Linked Data models (Eurostat are working toward this format).
- it allows us to power, automate and iterate more of the dissemination processes we use (e.g. dashboards, interactives and reports)
- it will integrate our statistical data into search engine providers such as Google – potentially increasing the impact of our statistics
- it creates boundless possibilities as the machine-learning evolution continues
- Interactive map showing top 10 exports by area of UK.
- Interactive solar system of stats.
- Auto-generated report from the Department for International Trade.
- Prototype GSS data site with a video tutorial.
- Interactive map showing trade and migration figures for English regions.
- A tool from the Department for Work and Pensions to help policy colleagues gain better insights into data, with a short video about how this is revolutionising policy making.
Phases of the GSS Data Project
- Began (2016) – the Presentation and Dissemination Committee kicked the project off in 2016 to understand the issues different departments were having when publishing statistics on gov.uk.
- Pre-discovery phase (2016) – uncovered the GSS landscape to better understand how we were disseminating statistics.
- Discovery phase (2017) – investigated what might be a new way of integrating our data – making it easier to find, use and reuse.
- Alpha phase (2018) – developed a small-scale ‘proof of concept’ that proved we could organise and integrate data from various departments.
- Beyond Alpha phase (2019 to 2024) – we are planning a new programme of work to take this from ‘proof of concept’ to reality.
Plans for the project
The project needs to grow from a small scale feasibility study into a full blown programme of work. This programme will need time (we are looking at four to five years to deliver the full benefits). The programme shape and size is being developed now.
The strands of the programme will be:
- transformational and data modelling
- scale and develop the technical infrastructure
- building capability (such as Reproducible Analytical Pipelines etc.)
- user research
- harmonisation and standards
The team will need to expand to deliver on these areas. Most of the resource and effort will be two-fold:
- target the transformation of data from spreadsheets into Tidy Data
- develop the pipelines that ensure a repeatable, automated and sustainable approach
An exciting development that will be reduce, and eventually end the reliance on the need for this transformation is RAPs. These initiatives build the tools that create Tidy Data from the source data and help support the automation of the outputs.
The good news is that some departments are already designing RAPs to produce their statistics and this is growing across the GSS. See the GangStaS Data RAP blog for further details.
Who has been involved in the project so far?
The project is at the ‘proof of concept’ stage. We have focussed on our attention on two “dataset families” – trade and migration. This has allowed us to work out if the approach was workable. We’ve also talked to teams within health and social care and teams within housing about the next families.
The project has engaged with many departments. Those most involved so far are:
Office for National Statistics, Department for Business Energy and Industrial Strategy, Home Office, Department for Exiting the European Union, Ministry for Housing, Communities and Local Government, Scottish Government, Department for Education, Department for Environment, Food and Rural Affairs, Welsh Government, Northern Ireland Statistics and Research Agency, HM Revenue and Customs, Department for Work and Pensions, Department for International Trade, Foreign and Commonwealth Office, HM Treasury, NHS, Department for Health, Public Health England.
The project team is also getting great support from centralised teams and communities already supporting the GSS – the harmonisation team, the classifications teams, the Best Practice and Impact division, the RAP champions and the presentation champions.
Meet our project board:
Julie Stanborough – Sponsor (ONS)
Neil McIvor (Department for Education)
Ian Knowles (Department for Transport)
David Fry (Department for Business Energy Industrial Strategy)
Siobhan Carey (Northern Ireland Statistics and Research Agency)
Roger Halliday (Scottish Government)
Glyn Jones (Welsh Government)
Andrea Prophet (HM Revenue and Customs)
Sandra Tudor (Ministry for Housing, Communities and Local Government)
Nicola Tyson (Census)