Mapping Data Ecosystems: GSS Alpha Project
A guest post by Sarah Roberts of Swirrl
We’re working on the GSS Alpha project, which aims to demonstrate and test an approach for official statistics publishers across government to share their data in a more interoperable and reusable way. In this post, we use the Open Data Institute’s Mapping Data Ecosystems tool to:
“help identify the data stewards responsible for managing and ensuring access to a dataset, the different types of data users and the relationships between them”.
The Open Data Institute (ODI) think this “can help to communicate where and how the use of open data creates value”.
We reckon it can too. Here’s a map of the data ecosystem being created by this project that we put together following the ODI approach:
So, where’s the value here? Who are the data stewards, users and their relationships? And how does the data flow?
The point of this project is to bring data from lots of different government departments together to consider data in terms of groups of related data collections (‘dataset families’), rather than in terms of which organisation it comes from. So, for example, if someone wants to find data about housing they can go to one place to find and use it, instead of heading to both MHCLG and the Land Registry.
The specific value created by this alpha project is to make related international trade data from different organisations interoperable and easy to find: for researchers and analysts in government, business or the media. We’re planning to extend the work to consider other related dataset families, such as migration, health and housing, which will then be of benefit to data analysts working with that data.
The ODI ecosystem mapping guide lists categories of actors involved in the ecosystem. We have worked through each of these categories below to highlight the most important actors in our data ecosystem.
The main ‘suppliers’ of data in the project are the trade statisticians and experts at HMRC, ONS and the Department for International Trade (DIT). The data specialists and programmers (‘tech team’ for short) at the ONS look after selecting, processing and enabling access to datasets in the project. That team is assisted by Swirrl: we provide the infrastructure and technology for publishing the data, through our PublishMyData platform, and we’ve been working with the ONS tech team (plus the indomitable Alex Tucker) on metadata and to clean, curate and standardise the data.
During this alpha phase of the project, the technical team has been working with our group of trade experts to select or develop standard code-lists for the statistical classifications used in the datasets: for countries, products being exported or imported etc. In future, a system will be needed for domain experts to create and maintain these re-usable codes more systematically and on a larger scale.
People / Organisations
The data is about import and export trade services and goods. The variety of people and organisations covered by that data includes:
- businesses related to a variety of industries including tourism, telecoms, sports, creative industry, gambling, digital, cultural
- companies with a presence abroad, including those with temporary movement of staff
The main contributors to our current selection of data are HMRC, DIT and ONS.
The UK Statistics Authority create the policies and legislative frameworks within which the project operates.
In the context of the current project, the main value-added services around the data are provided via the PublishMyData platform, but in a future operational context, we would anticipate external organisations making use of APIs and downloads to provide a range of additional services for particular audiences and needs.
Showing how we can aggregate related datasets from various organisations to provide easier and richer access is one of the themes of the project, so our prototype system is an example of an aggregator. By giving easy machine-readable access to data, in a future operational situation with many more datasets, then all kinds of other organisations could do their own aggregations for their own purposes.
Creators (or re-users)
Our main target information-creators are data analysts, researchers and journalists. In some cases, these will come from the same organisations that provide the data, but not necessarily.
The principles and tools being tested and demonstrated in the project should benefit a wide range of public sector data users. Within the trade data exemplar, probably the key beneficiaries will be policy makers in the relevant government departments, supported by the trade data experts, analysts and researchers. Analysts working for businesses or in journalism should also benefit from better access to this data, to understand our current patterns of international trade, and how that might change in the future, for example in different post-Brexit scenarios, researchers and journalists.
The ODI ecosystem mapping guidance notes that researchers are a sub-group of beneficiaries: probably one of the most important groups for the kinds of data we are looking at.
Providing reliable and usable data to policy makers will be a key outcome of the approach we are prototyping: these will come from organisations across the Government Statistical Service.
How the data flows
From a technology angle, this project involves several stages.
- At the start, the data is in lots of different formats.
- The ONS technical team review the contents of the data, identify the statistical components, consider the best sets of identifiers to use for the associated codelists and turn the data into a tidy data structure
- The ‘table2qb’ software processes the observation data and associated component and code definitions, to generate a CSV file (still ‘tidy data’) and associated JSON metadata (according to the W3C standard for ‘Generating RDF from Tabular Data on the Web’).
- This is now a complete description of the statistical dataset. This is processed through the project’s implementation of the W3C CSV2RDF processing algorithm to generate a Linked Data version of the data.
- This is loaded into the PublishMyData linked data platform making it reusable through advanced search and APIs, linking to related datasets and accessible to applications.
Hear all about it!
Swirrl and the ONS will be speaking at Linked Data Cardiff about this project on Thursday 21st June 6pm - 8pm (register here). And if you can’t make that, Bill from Swirrl will be at the ODI in London presenting a Friday Lunchtime Lecture the following week (Friday 29 June). For more on this project head to our previous GSS guest post here and keep an eye on our Twitter stream as the project evolves.