This item is archived. Information presented here may be out of date.

It’s a kind of magic…

Sofi Nickson

The Connected Open Government Statistics (COGS) project will do some amazing things to bring data together from across the GSS. But the project doesn’t have magical fairies that sprinkle Python dust over all the spreadsheets to transform them into Tidy Data. To take our existing outputs and make them 5* Linked Open Data takes a fair bit of planning, research, analysis and finally the coding bit.

I wanted to pull back the veil and explain how the COGS “data squads” work on taking the aggregated published data from your outputs into this mysterious 5* Linked Open Data. 

Our data squads are multidisciplinary units and we have two – Titan and Atlas! They comprise of business analysts, data engineers, data architects, data managers, user researchers and testers. This is an overview of how we approach the whole dataset family transformation piece. These activities described are not linear and do overlap as we work through to the end and publish. 

1. Engaging

We start off with the core team identifying a dataset family and engaging with the statistical producers of that family. This is our “consortium group” and we do things like agreeing the landscape and the data we will work on, etc. This forms the starting point for the data squads. 

2. Researching

The dataset family transformation journey begins. This stage involves user researcher and business analyst roles. 

User researcher (John) works with producers to gain knowledge of potential users and how to contact them to discuss needs. We have many tools to gather intelligence on what users are doing with data, why they need data, and how can the service be easier to use. This information validates or adds new requirements for our development of the service.

Our business analysis team (Rob, Tracey and David) review the dataset family landscape and will work with producers to secure the best output format – the cleaner the format the easier the data is to transform

3. Analysing

This brings the business analysts and the data managers (Grace and Leigh) together.

The business analysts review what the outputs contain, and document this, creating the first pass of our transformation instructions. The data managers then help with the modelling aspects, discussions around harmonisation of labels, utilising and mapping against standard codes lists and finalising the steps for the full transformation record

This is the Extract, Transform and Load (ETL) record. This is the set of instructions used by the data engineers to build the transformation pipelines. The data managers report any issues or problems back to the business analysts to review and work through with producers.

4. Transforming

This involves our data engineers (Vamshi, Mike, Shannon, JJ, and Martyn), data managers and user researchers. The point of this stage is to build out the transformation pipelines that takes the original published data source (both the observational data and its associated metadata) and transform it into the CSVW format. 

At this point we will see back and forth between the data engineers and data managers as they work to understand or fix problems within the data. The end result is to push the data into the platform as 5* Open Data formats. This allows the user researcher to test the data and functionality with users and producers.

To sum up . . .

This is a very simplified explanation of our transformation process. There is other stuff too, of course. The harmonisation team, architect community and consortium groups are all involved across the piece to help around standards, metadata and alignment of labelling, etc.

What we end up with is a repeatable, automated process for taking your published outputs and recreating them as datacubes, data that is ready for the web. And we want to do this with all of you! It comes with a free data explorer and API and most importantly without a whopping great price tag.

The project has *one dream, one soul, one prize, one goal. One golden glance of what the GSS should be. It’s not really a kind of magic but rather a really complex and collaborative piece of work to make all of this happen. Where are those fairies?


*Songwriters: Roger Meddows Taylor

A Kind Of Magic lyrics © Sony/ATV Music Publishing LLC, BMG Rights Management


Darren Barnes
Sofi Nickson
Darren has been a civil servant for 30 years and has worked in the digital space for nearly 20 of these. He is currently working on the Connected Open Government Statistics project, he is the open data lead at the Office for National Statistics and the co-chair of the GSS open data subgroup.