Connected Open Government Statistics (COGS)

The problem

Users have difficulty in finding and using GSS data because:

metadata (the information about data e.g. definitions, caveats, information about how the data was collected) is often kept separately to the data (e.g. in a methodology document)
there are lots of different departments producing data on a similar topic but there is no easy way to bring this all together.

The aim

Connected Open Government Statistics (COGS) – previously called the GSS Data Project – aims to fix this problem by:

1. finding all the spreadsheets on a similar topic and bringing them together into ‘dataset families’
2. putting these datasets into ‘Tidy Data’ format by stripping out all the presentational stuff
3. finding the metadata
4. putting the Tidy Data and the metadata together and feeding it through a pipeline in order to get something called ‘Linked Data’

Why Linked Data is better

This Linked Data is better for users because:

it makes data easier to find (search engines can work with Linked Data better than data in spreadsheets)
users no longer have to navigate to different websites and different spreadsheets to find what they need
search engines like Google can ‘scrape’ the data to answer questions people type in
technical users find it easier to build new tools with Linked Data
Linked Data allows online tools to be automated (e.g. dashboards can pull through the most up to date data automatically).

The problem

Government statistics are not produced or managed by a single entity. We operate a federated system under the common banner of the Government Statistical Service. All central government departments, plus arms length bodies and the devolved administrations are part of this union and between them they publish around 25,000 tables of statistical data each year. We distribute this huge number of outputs across the internet on many portals and websites.

This disparate landscape has been a problem for our users.

Making so much data available in the open, free to download and use (and reuse) is the right thing. But our users are struggling to get to grips with this wealth of data. Some of the problems are:

low awareness of what is available: Where do I start? What do I look for? What is the data called?
users are not able to find the data they need: Where do I look? Who does the data? What website?
output formats are impenetrable to a lot of users: Where is the data I need? What does it mean?

How do we overcome these barriers and challenges to accessing and using the data we produce?

A project can’t achieve this in isolation. We are all responsible, including our users, for the co-creation of this new vision for the GSS.

Connected Open Government Statistics (COGS) – previously called the GSS Data Project – is developing the tools that can help to support this. The heavy lifting equipment we are building will reduce the burden on departments. We are working in partnership with centralised teams that are already supporting the betterment of the GSS. We are collaborating with the growing communities of practices out there, such as Reproducible Analytical Pipeline (RAP) champions and presentation champions, that are already working support better data.

This project is about being that catalyst for change and running towards a better way.

The aim – what do we need to do?

To increase the impact and reach of GSS statistical data by improving how we organise, integrate and deliver GSS data online.

We need to re-imagine our current publication model. Spreadsheets have played the lead role for many years. They will continue to play a part but this should be a supporting role and not the primary channel as we expand our distribution routes.

To support this, our focus needs to be on how to make it easier for publishers of official statistics to share their data in ways that offer the greatest use to users.

This means moving to a place where we are producing data that is web-ready and not ready to sit on the web. COGS is building the blueprint that will enable us to exchange data and metadata over the web and make better use of all the different information we offer.

This will make our data more findable and useful for our users. This is the pain relief our users want.

Understanding user relationships with our data

To pioneer a better way for users to connect with our data, we need to understand the current journey taken by our users to find and reuse our data and the pain points they face along the way

The user journey is simple. Awareness that there is data available and the ability to find it. The users want to have just enough information provided that allows them to consider whether the data is what they need before they make any selection. Once they have found the right source they need to be able to work with it to complete their aim.

We have not catered for this simple user journey. Instead we have created a disparate and confusing landscape for our users. For example, housing data is published by a range of official statistics producers, in a range of different formats with various codes and labels. The myriad of possible entry points to this data confuses our users. Which data? Which website? Which organisation?

Users don’t care about departmental boundaries or services. They want our data. COGS is about creating a system that pulls the threads of all this related data together into related data groupings.

The approach is not as simple as just putting all the spreadsheets on one website. We need to make use of standards for data and metadata on the web. We must also do a better job of harmonising the statistics when possible and explain the differences when harmonisation is not possible. This is the biggest challenge we face given the differences across our statistical system.

Organising the data

The hard bit of the project is taking the data we have and making it ready for use on the web.

Spreadsheets lock the data into a single format. The formats can be whatever we want them to be.

This leads to organisations using different structures, deriving bespoke standards and inconsistent formatting. The documentation on how our reference data are defined and encoded is nowhere near the data. This makes it hard for users to uncover and, even when found, it is incomprehensible for many users.

That is at the heart of why some statistical data is hard for people to use: it’s not always clear what the data means and whether data from different sources can be compared.

This is where our concept of ‘dataset families’ comes into play.

Dataset families

Dataset families aim to link groups of related data into collections for users to discover and navigate around. It requires no previous or explicit knowledge of what organisation the data comes from.

Users are interested in getting data on a subject or domain. Breaking down the organisational boundaries and offering an interconnected set of related datasets should make it easier for users to find, understand and apply the data to their needs.

By examining and modelling the relationship between the various datasets in that ‘family’ collection it should allow:

better discovery of GSS statistical data
easier interaction with and connection to relevant data no matter where it came from
consistent and understandable descriptions and definitions
data and metadata that can be reused across a wide range of options

Integrating the data

The ONS has been working with a company called Swirrl, investigating how to bring datasets closer together as a family, by working on the structure of the data and the vocabulary of terms and identifiers used.

Our first step is to take the spreadsheets that have been published and remodel them. Using tools like Python and Pandas we normalise the data into a simpler format – stripping out all the presentational stuff from the spreadsheets.

The output of this we call ‘Tidy Data’. This is a machine-ready version of the data we started with. Having data in this form makes it much easier to push the data through the next stage of our pipelines to create the web-ready formats which are ‘Linked Data’.

This step is where the clever stuff happens and adds the real value to the data. We have developed ‘table2qb’ (pronounced ‘table to cube’) that takes the Tidy Data format and integrates it with an enriched metadata file. This produces a web standard format called CSV on the Web (CSVW). The World Wide Web (W3C) consortium developed this form to support data on the web.

The CSVW provides the ingredients we need to create Linked Data which is the mechanism used to link all the things together and makes it easier to discover new related things about the data on the web. For our purposes Linked Data outputs not only provide an enhanced machine-readable format but a web-ready format.

Delivering the data

Having data that is machine-readable and web-ready means greater use for us working in the Government Statistical Service and our users working outside of government departments:

it creates data that is modelled in a way that allows improved findability and discoverability – search engines can look inside a data cube
it reduces reliance on having to create spreadsheets as the main distribution of data – we can create a multitude of output formats free of charge when the data is managed in this way
it changes the landscape for all our users (including us) as we can:
- access that one observation we might need
- customise the data more efficiently by pulling out data from multiple data cubes and multiple domains without having to navigate different sites and different spreadsheets
- download the pre-canned spreadsheets if still desired
- tap directly into the data via Application Programming Interfaces (APIs) and other query engines rather than going to websites – powering new web services and applications
it makes it easier for technical users to create new innovations or services that can consume our data for the greater good
it allows us to automate services that Local Authorities and charities are manually compiling to save time and improve services
it will connect with other data sources that follow Linked Data models (Eurostat are working toward this format).
it allows us to power, automate and iterate more of the dissemination processes we use (e.g. dashboards, interactives and reports)
it will integrate our statistical data into search engine providers such as Google – potentially increasing the impact of our statistics
it creates boundless possibilities as the machine-learning evolution continues

Examples

Phases of the COGS project

Began (2016) – the Presentation and Dissemination Committee kicked the project off in 2016 to understand the issues different departments were having when publishing statistics on gov.uk.
Pre-discovery phase (2016) – uncovered the GSS landscape to better understand how we were disseminating statistics.
Discovery phase (2017) – investigated what might be a new way of integrating our data – making it easier to find, use and reuse.
Alpha phase (2018-2019) – developed a small-scale ‘proof of concept’ that proved we could organise and integrate data from various departments.
GSS Data Project: Final Report.
From July 2019 the project was renamed Connected Open Government Statistics – we are taking the proof-of-concept into the first stages of a production-ready service

Plans for the COGS project

The summer of 2019 saw the project grow from a small scale feasibility study into a full project team of business analysts, data engineers, architects and data analysts . This project will need time to deliver all of its objectives. We have suggested a five-year programme of work. There will be several phases throughout the lifetime of the project. The first phase is to build the foundations – establishing standards for data and metadata. The second will involve developing the service to deliver a connected statistical service. The third will see the implementation of the products and services that will manage these processes into the future.

The strands of this first phase:

transformational and data modelling
scaling and developing the technical infrastructure
supporting capability within departments (e.g. Reproducible Analytical Pipelines)
carrying out user research and user testing
building on harmonisation and standards

The team has expanded to deliver on these areas. Most of the resource and effort will be focused on:

targeting the transformation of published data outputs into Tidy Data
creating generic pipelines to consume data from machine-to-machine services offered by the likes of Fingertips within Public Health England, StatWales and ScotStats from the devolved administrations and others
working with RAP and other initiatives developing Tidy-ish Data formats to support alignment of practices
ensuring we build the pipelines with a repeatable, automated and sustainable approach in mind
supporting the developing data and metadata standards across the GSS

It’s important that this work follows standards. Adopting data and metadata standards will have the biggest impact. It will improve our capabilities as individual producers and ensure operability between producers.

Who’s involved?

We welcome engagement with anyone across the GSS – you don’t have to be involved in RAP or other initiative just get in touch with us if you’d like to know more or get involved earlier. The project has engaged with many departments.

Those most involved so far are:

Office for National Statistics (ONS)
Department for Business Energy and Industrial Strategy (BEIS)
Home Office
Department for Exiting the European Union
Ministry for Housing, Communities and Local Government
Scottish Government
Department for Education
Department for Environment, Food and Rural Affairs
Welsh Government
Northern Ireland Statistics and Research Agency (NISRA)
His Majesty’s Revenue and Customs (HMRC)
Department for Work and Pensions (DWP)
Department for International Trade (DIT)
Foreign and Commonwealth Office
His Majesty’s Treasury
NHS
Department for Health
Public Health England

The project team is also getting great support from centralised teams and communities already supporting the GSS: harmonisation team, classifications teams, RAP champions and presentation champions.

The dataset families we are working on

We are currently working on:

Trade
Migration
Alcohol-related Deaths
Dwellings By Tenure
Disabilities
Affordable Housing
Homelessness

And there are many more are in the pipeline.

Immediate benefits

Once a dataset family is onboarded into the COGS platform the data is available immediately to users and has the following benefits:

Outputs are available as 5* Linked Open Statistics – not just at the 3* level required by the code of practice
Data is available via API
Data is customisable and downloadable in other formats
Automated method of updating once the next release is published
Easier to build PowerBi, Tableau or other applications on top of the data
Data is available alongside other related data from other departments
Data is more discoverable through a new data registry
Data is more interoperable and easier to use with other data for better insights

Governance

Our project board is made up of:

Owen Brace – Senior Responsible Officer (ONS)
Tomas Sanchez – Technical (ONS)
Julie Stanborough – Sponsor (ONS)
Mike Jones (Department for Education)
Ian Knowles (Department for Transport)
David Fry (Department for Business Energy Industrial Strategy)
Siobhan Carey (Northern Ireland Statistics and Research Agency)
Roger Halliday (Scottish Government)
Glyn Jones (Welsh Government)
Kevin Fletcher (HM Revenue and Customs)
Paul Vickers (Ministry for Housing, Communities and Local Government)

The GSS Data Project and table2qb: a data transformation tool
a blog from Swirrl

Data science courses on the GSS website
The data science courses advertised on the GSS website

Connected Open Government Statistics (COGS)

In a nutshell

The problem

The aim

Why Linked Data is better

The problem

The aim – what do we need to do?

Understanding user relationships with our data

Organising the data

Dataset families

Integrating the data

Delivering the data

Examples

Phases of the COGS project

Plans for the COGS project

Who’s involved?

The dataset families we are working on

Immediate benefits

Governance

Related

Reproducible Analytical Pipeline (RAP) champion network

GSS Data Project: Final Report