Getting the most from the nation’s statistics: the importance of standards
The objective of the GSS Data Project is to pioneer a better way for users to connect with government data. By improving findability, usability and interoperability we will increase the impact and reach of GSS statistical data.
As reported in previous articles on the project, there are currently obstacles at each step of this process: in finding the right data, understanding if it’s applicable to your problem, then being able to analyse it to answer the question at hand.
At the heart of the solution is delivering data in a way that works well in software, so that it can be searched, filtered, processed further or analysed. It’s not the whole story, but it’s an essential enabler for the range of things people need to do with government statistics.
When all statistics publishers use the same approach to data structures and formats, it means that if software works for one dataset, it should work for all of them. Some data formats are easier to write software for than others, and with a good choice of format, we can build on existing capabilities of software tools used by data scientists and visualisation specialists.
The challenge is choosing or designing standards and then getting everyone to follow them.
Standards come at a range of levels: a whole stack of standards is in use every time you connect your phone or computer to the internet, covering how devices and programs talk to each other.
The web is the only show in town as a way of sharing data. That choice means that many layers of the stack of standards are a given. The GSS Data Project has been working on what sits on top of that: the data structures, formats and identifiers that are needed for effective dissemination of statistics.
We’re not forgetting that the aim is to present information to a person who can use it to make their decisions, and that will typically involve charts, tables and commentary. Getting to that point involves a lot of data processing steps.
Until recently that has most commonly happened behind the scenes, and publications have taken the form of reports, supported by sets of data tables. That approach is not wrong or bad: it’s generally easy for people to understand and a sensible way of doing things. A lot of good work has been going on in the GSS on authoring statistical publications in an accessible way. However, it’s just one part of what is needed and we’ve been missing an important aspect. The currently established approach often makes further processing of the data difficult and that is limiting the opportunities to apply that data to a broad range of government and societal challenges.
A good quality dataset can often be applied to a wide range of problems – and to understand the most important and complex problems you need data from many sources. So, the ability to process, filter and combine data is essential to getting the most value out of it. The human-facing presentation made for one purpose is often not what’s needed when applying the same information in a different context.
Therefore, we are advocating supplementing the current approach to publication of statistical data by also making it available in the best form for further processing.
So, what does that look like? What does better use of standards mean in practice? The most important requirements are:
- the data should follow a regular structure to make it easy to process in software
- all statistical data publishers across government should follow the same approach
- the stats need to be accompanied by good metadata and definitions of terms, so that the data is understandable and can be used reliably by people not involved in the original data collection
- everyone should call the same thing by the same name (i.e. use the same identifier) – or at least provide ‘translations’ or lookups between equivalent alternative identifiers
- two different things should always have different names: make sure identifiers are unique in the range of contexts they might be used in
We have investigated and tested well-established standards for data structures and data formats and have chosen a set that have backing from W3C and other major organisations and are a good match to the needs of statistics.
That forms a solid foundation but there is a further key step that needs to be taken by the GSS, which is to agree on, maintain and apply sets of identifiers for the things that statistics describe, and for the classification schemes and codelists used to aggregate and organise data. The GSS’s well-used set of geographical identifiers is a great example here, because they are clearly defined, have a clear governance process and are almost universally used across government organisations. Establishing similar standards for other key statistical dimensions should be a priority in the coming months and years.
As well as establishing and maintaining standards, the community needs the necessary documentation, tooling and skills to apply them effectively and to benefit from the opportunities this new approach will bring. That will be an important part of the next phase of the GSS Data Project.