Connected Open Government Statistics (COGS)
Users have difficulty in finding and using GSS data because:
- metadata (the information about data e.g. definitions, caveats, information about how the data was collected) is often kept separately to the data (e.g. in a methodology document)
- there are lots of different departments producing data on a similar topic but there is no easy way to bring this all together.
Connected Open Government Statistics (COGS) – previously called the GSS Data Project – aims to fix this problem by:
- finding all the spreadsheets on a similar topic and bringing them together into “dataset families”
- putting these datasets into “Tidy Data” format by stripping out all the presentational stuff
- finding the metadata
- putting the Tidy Data and the metadata together and feeding it through a pipeline in order to get something called “Linked Data”
Why Linked Data is better
This Linked Data is better for users because:
- it makes data easier to find (search engines can work with Linked Data better than data in spreadsheets)
- users no longer have to navigate to different websites and different spreadsheets to find what they need
- search engines like Google can “scrape” the data to answer questions people type in
- technical users find it easier to build new tools with Linked Data
- Linked Data allows online tools to be automated (e.g. dashboards can pull through the most up to date data automatically).
Government statistics are not produced or managed by a single entity. We operate a federated system under the common banner of the Government Statistical Service. All central government departments, plus arms length bodies and the devolved administrations are part of this union and between them they publish around 25,000 tables of statistical data each year. We distribute this huge number of outputs across the internet on many portals and websites.
This disparate landscape has been a problem for our users.
Making so much data available in the open, free to download and use (and reuse) is the right thing. But our users are struggling to get to grips with this wealth of data. Some of the problems are:
- low awareness of what is available: Where do I start? What do I look for? What is the data called?
- users are not able to find the data they need: Where do I look? Who does the data? What website?
- output formats are impenetrable to a lot of users: Where is the data I need? What does it mean?
How do we overcome these barriers and challenges to accessing and using the data we produce?
A project can’t achieve this in isolation. We are all responsible, including our users, for the co-creation of this new vision for the GSS.
Connected Open Government Statistics (COGS) – previously called the GSS Data Project – is developing the tools that can help to support this. The heavy lifting equipment we are building will reduce the burden on departments. We are working in partnership with centralised teams that are already supporting the betterment of the GSS. We are collaborating with the growing communities of practices out there, such as Reproducible Analytical Pipeline (RAP) champions and presentation champions, that are already working support better data.
This project is about being that catalyst for change and running towards a better way.
The aim – what do we need to do?
To increase the impact and reach of GSS statistical data by improving how we organise, integrate and deliver GSS data online.
We need to re-imagine our current publication model. Spreadsheets have played the lead role for many years. They will continue to play a part but this should be a supporting role and not the primary channel as we expand our distribution routes.
To support this, our focus needs to be on how to make it easier for publishers of official statistics to share their data in ways that offer the greatest use to users.
This means moving to a place where we are producing data that is web-ready and not ready to sit on the web. COGS is building the blueprint that will enable us to exchange data and metadata over the web and make better use of all the different information we offer.
This will make our data more findable and useful for our users. This is the pain relief our users want.
Understanding user relationships with our data
To pioneer a better way for users to connect with our data, we need to understand the current journey taken by our users to find and reuse our data and the pain points they face along the way
The user journey is simple. Awareness that there is data available and the ability to find it. The users want to have just enough information provided that allows them to consider whether the data is what they need before they make any selection. Once they have found the right source they need to be able to work with it to complete their aim.
We have not catered for this simple user journey. Instead we have created a disparate and confusing landscape for our users. For example, housing data is published by a range of official statistics producers, in a range of different formats with various codes and labels. The myriad of possible entry points to this data confuses our users. Which data? Which website? Which organisation?
Users don’t care about departmental boundaries or services. They want our data. COGS is about creating a system that pulls the threads of all this related data together into related data groupings.
The approach is not as simple as just putting all the spreadsheets on one website. We need to make use of standards for data and metadata on the web. We must also do a better job of harmonising the statistics when possible and explain the differences when harmonisation is not possible. This is the biggest challenge we face given the differences across our statistical system.
Organising the data
The hard bit of the project is taking the data we have and making it ready for use on the web.
Spreadsheets lock the data into a single format. The formats can be whatever we want them to be.
This leads to organisations using different structures, deriving bespoke standards and inconsistent formatting. The documentation on how our reference data are defined and encoded is nowhere near the data. This makes it hard for users to uncover and, even when found, it is incomprehensible for many users.
That is at the heart of why some statistical data is hard for people to use: it’s not always clear what the data means and whether data from different sources can be compared.
This is where our concept of “dataset families” comes into play.
Dataset families aim to link groups of related data into collections for users to discover and navigate around. It requires no previous or explicit knowledge of what organisation the data comes from.
Users are interested in getting data on a subject or domain. Breaking down the organisational boundaries and offering an interconnected set of related datasets should make it easier for users to find, understand and apply the data to their needs.
By examining and modelling the relationship between the various datasets in that “family” collection it should allow:
- better discovery of GSS statistical data
- easier interaction with and connection to relevant data no matter where it came from
- consistent and understandable descriptions and definitions
- data and metadata that can be reused across a wide range of options
Integrating the data
The ONS has been working with a company called Swirrl, investigating how to bring datasets closer together as a family, by working on the structure of the data and the vocabulary of terms and identifiers used.
Our first step is to take the spreadsheets that have been published and remodel them. Using tools like Python and Pandas we normalise the data into a simpler format – stripping out all the presentational stuff from the spreadsheets.
The output of this we call “Tidy Data”. This is a machine-ready version of the data we started with. Having data in this form makes it much easier to push the data through the next stage of our pipelines to create the web-ready formats which are “Linked Data”.
This step is where the clever stuff happens and adds the real value to the data. We have developed ‘table2qb’ (pronounced ‘table to cube’) that takes the Tidy Data format and integrates it with an enriched metadata file. This produces a web standard format called CSV on the Web (CSVW). The World Wide Web (W3C) consortium developed this form to support data on the web.
The CSVW provides the ingredients we need to create Linked Data which is the mechanism used to link all the things together and makes it easier to discover new related things about the data on the web. For our purposes Linked Data outputs not only provide an enhanced machine-readable format but a web-ready format.
Delivering the data
Having data that is machine-readable and web-ready means greater use for us working in the Government Statistical Service and our users working outside of government departments:
- it creates data that is modelled in a way that allows improved findability and discoverability – search engines can look inside a data cube
- it reduces reliance on having to create spreadsheets as the main distribution of data – we can create a multitude of output formats free of charge when the data is managed in this way
- it changes the landscape for all our users (including us) as we can:
- access that one observation we might need
- customise the data more efficiently by pulling out data from multiple data cubes and multiple domains without having to navigate different sites and different spreadsheets
- download the pre-canned spreadsheets if still desired
- tap directly into the data via Application Programming Interfaces (APIs) and other query engines rather than going to websites – powering new web services and applications
- it makes it easier for technical users to create new innovations or services that can consume our data for the greater good
- it allows us to automate services that Local Authorities and charities are manually compiling to save time and improve services
- it will connect with other data sources that follow Linked Data models (Eurostat are working toward this format).
- it allows us to power, automate and iterate more of the dissemination processes we use (e.g. dashboards, interactives and reports)
- it will integrate our statistical data into search engine providers such as Google – potentially increasing the impact of our statistics
- it creates boundless possibilities as the machine-learning evolution continues
- Interactive map showing top 10 exports by area of UK.
- Interactive solar system of stats.
- Auto-generated report from the Department for International Trade.
- Prototype GSS data site with a video tutorial.
- A tool from the Department for Work and Pensions to help policy colleagues gain better insights into data, with a short video about how this is revolutionising policy making.
Phases of the COGS project
- Began (2016) – the Presentation and Dissemination Committee kicked the project off in 2016 to understand the issues different departments were having when publishing statistics on gov.uk.
- Pre-discovery phase (2016) – uncovered the GSS landscape to better understand how we were disseminating statistics.
- Discovery phase (2017) – investigated what might be a new way of integrating our data – making it easier to find, use and reuse.
- Alpha phase (2018-2019) – developed a small-scale ‘proof of concept’ that proved we could organise and integrate data from various departments.
- From July 2019 the project was renamed Connected Open Government Statistics – we are taking the proof-of-concept into the first stages of a production-ready service
Plans for the COGS project
The summer of 2019 saw the project grow from a small scale feasibility study into a full project team of business analysts, data engineers, architects and data analysts . This project will need time to deliver all of its objectives. We have suggested a five-year programme of work. There will be several phases throughout the lifetime of the project.The first phase is to build the foundations – establishing standards for data and metadata. The second will involve developing the service to deliver a connected statistical service. The third will see the implementation of the products and services that will manage these processes into the future.
The strands of this first phase:
- transformational and data modelling
- scaling and developing the technical infrastructure
- supporting capability within departments (e.g. Reproducible Analytical Pipelines)
- carrying out user research and user testing
- building on harmonisation and standards
The team has expanded to deliver on these areas. Most of the resource and effort will be focused on:
- targeting the transformation of published data outputs into Tidy Data
- creating generic pipelines to consume data from machine-to-machine services offered by the likes of Fingertips within Public Health England, StatWales and ScotStats from the devolved administrations and others
- working with RAP and other initiatives developing Tidy-ish Data formats to support alignment of practices
- ensuring we build the pipelines with a repeatable, automated and sustainable approach in mind
- supporting the developing data and metadata standards across the GSS
It’s important that this work follows standards. Adopting data and metadata standards will have the biggest impact. It will improve our capabilities as individual producers and ensure operability between producers.
We welcome engagement with anyone across the GSS – you don’t have to be involved in RAP or other initiative just get in touch with us if you’d like to know more or get involved earlier. The project has engaged with many departments.
Those most involved so far are:
- Office for National Statistics,
- Department for Business Energy and Industrial Strategy
- Home Office
- Department for Exiting the European Union
- Ministry for Housing, Communities and Local Government
- Scottish Government
- Department for Education
- Department for Environment, Food and Rural Affairs
- Welsh Government
- Northern Ireland Statistics and Research Agency
- HM Revenue and Customs
- Department for Work and Pensions
- Department for International Trade
- Foreign and Commonwealth Office
- HM Treasury
- Department for Health
- Public Health England
The project team is also getting great support from centralised teams and communities already supporting the GSS: harmonisation team, classifications teams, Best Practice and Impact division, RAP champions and presentation champions.
The dataset families we are working on
We are currently working on:
- Alcohol-related Deaths
- Dwellings By Tenure
- Affordable Housing
And there are many more are in the pipeline.
Once a dataset family is onboarded into the COGS platform the data is available immediately to users and has the following benefits:
- Outputs are available as 5* Linked Open Statistics – not just at the 3* level required by the code of practice
- Data is available via API
- Data is customisable and downloadable in other formats
- Automated method of updating once the next release is published
- Easier to build PowerBi, Tableau or other applications on top of the data
- Data is available alongside other related data from other departments
- Data is more discoverable through a new data registry
- Data is more interoperable and easier to use with other data for better insights
Our project board is made up of:
- Owen Brace – Senior Research Officier (ONS)
- Tomas Sanchez – Technical (ONS)
- Julie Stanborough – Sponsor (ONS)
- Mike Jones (Department for Education)
- Ian Knowles (Department for Transport)
- David Fry (Department for Business Energy Industrial Strategy)
- Siobhan Carey (Northern Ireland Statistics and Research Agency)
- Roger Halliday (Scottish Government)
- Glyn Jones (Welsh Government)
- Kevin Fletcher (HM Revenue and Customs)
- Paul Vickers (Ministry for Housing, Communities and Local Government)
- The GSS Presentation and Dissemination Committee