GangStaS Data RAP
I’m no 50 Cent or Dr Dre and this ain’t no piece on the musical genre, but it is coming “Straight Outta Newport”.
This is about a great potential future for the Government Statistical Service (GSS) data ecosystem. We met up with the team from Department for Education (DfE) who are doing amazing things in advancing the transformation of data within their department using Reproducible Analytical Pipelines (RAP). I’d known about RAP through demos provided by Matt Upson from Government Digital Service (GDS) sometime ago. I was impressed by this early initiative – aside from them stealing the basic premise from DataBaker that I introduced to the Office for National Statistics several years ago!
Anyway, bitterness aside, the GSS Data Project is about reaching out toward a joined-up GSS community. Its aim is to break away from using data in traditional products and introduce a new model. We want to bring data together and create a rich interconnected resource we can all do much more with; one that still gives you your Excel spreadsheets!
To build a robust data landscape in the short term we need to do a lot of heavy lifting. The equipment to do this wasn’t available, so we had to build it ourselves. Over time the idea is to reduce the reliance on this heavy machinery and build leaner, more efficient tools that can still achieve the same outcome and probably more.
We envisaged this transition being a long time in the future. But that all changed for me one sunny day visiting a blue and yellow glass building behind Manchester Piccadilly Station. Who’d have known! We met with Laura, Tom, David and Sean at DfE. I bored them (only a little I hope) with our vision for the future but they blew me away with the possibility of making our longer-term vision happen in a much shorter time frame!
The team are working within their department to introduce RAP processes that produce Tidy Data formats. The team are taking their statistical knowledge and understanding and putting it into code, which provides a “repeatable” and “automatable” process; they’re driving the process with tests on the outputs that can quickly identify where things are going wrong. They’ve taken great strides in moving this forward so far and their ambition to roll this out across the department was admirable – hoping to achieve full RAP usage within two years. I don’t believe you’re delusional Laura, you can do it!
The process of “tidying” the data before loading for analysis (for example in R) has been described in Tidy Data by Hadley Wickham in the Journal of Statistical Software. The main idea is to arrange the data into a table where each row represents an observation, and each column a variable/dimension.
Our data geeks use Python and Pandas to tear apart presentational heavy spreadsheets and create Tidy-ish Data formats. This creates the foundation onto which we add the metadata building blocks to create the data standard CSV on the Web (CSVW) and then connect these together to form Linked Open Data.
The GSS Data Project is about getting our statistics in the Web, not just on it. This is about integrating our statistical data on the biggest platform available. Linked Open Data is the method in which we can harness the power of that platform. Here, the possibilities are boundless.
Why am I so excited about our meeting with DfE? With the department moving toward producing Tidy Data by default it offers a game changer for us. This can reduce a significant portion of our time creating the transformation routines, as we get data that doesn’t need additional wrangling. We can reduce the time taken to produce Linked Data versions. We can improve the repeatability of the process and simplify the automation. It also reduces ongoing maintenance needs.
Over the next few months we are hoping to work with the DfE team and investigate ways for augmenting their Tidy Data outputs to transition to CSVW more seamlessly. This is an exciting opportunity. The GSS is at the start of its RAP journey, and we can play a big part in drawing a blueprint for departments to follow as they develop their own RAP initiatives. Setting up a common approach across the community is a massive win-win for the GSS. These two initiatives can form a powerful partnership and one that can evidence what the GSS needs in the terms of future capabilities and skills.
Working together I can see how we may realise our vision of a sustainable linked statistical community in the much nearer future.
As Fiddy Cent said: “Get Rich (data) or Die Tryin'”.