Synthetic data: innovation for public good
What is synthetic data, and how can it be used for public good? This was the question asked at a recent event at the Office for National Statistics (ONS).
Synthetic data are artificially generated data that have the look and structure of real data, but do not contain any information on individuals. They also contain more general characteristics that are used to find patterns in the data.
They are modelled on real data, but designed in a way which safeguards the legal, ethical and confidentiality requirements of the original data. Given their resemblance to the original data, synthetic data are useful in a range of situations, for example when data is sensitive or missing. They are used widely as teaching materials, to test code or mathematical models, or as training data for machine learning models.
Given its potential, there is growing attention and research emerging around synthetic data. To bridge insights and advancements in this fast-changing space, over 60 synthetic data experts representing 20 organisations from across government, industry and academia, gathered on 1 April for an afternoon of discussion in Newport. The event, hosted jointly by the Royal Statistical Society, Data Science Campus and Best Practice and Impact division explored possible applications, latest developments, and challenges to adopting synthetic data.
Innovation in this space
The afternoon’s presentations offered diverse perspectives on synthetic data, showcasing innovative work from across sectors. This ranged from academic research in synthesising administrative datasets, to developments in the data science sphere. Also discussed were approaches to disseminating sensitive synthesised data.
There’s currently a wealth of research emerging from the health sector, as the nature of data published is often sensitive. Public Health England have synthesised cancer data which can be freely accessed online. NHS Scotland are making advances in cutting-edge machine learning methods such as Variational Auto Encoders and Generative Adversarial Networks (GANs).
There is growing interest in this area of research, and its influence extends beyond the statistical community. While the Data Science Campus have also used GANs to generate synthetic data in their latest research, its power is not limited to data generation. It can be trained to construct features almost identical to our own across imagery, music, speech and text. In fact, GANs have been used to create a painting of Edmond de Belamy, which sold for $432,500 in 2018!
Within the ONS, a pilot to create synthetic versions of securely held Labour Force Survey data has been carried out using a package in R called “synthpop”. This synthetic dataset can be shared with approved researchers to de-bug codes, prior to analysis of data held in the Secure Research Service.
Although much progress is done in this field, one challenge that persists is guaranteeing the accuracy of synthetic data. We must ensure that the statistical properties of synthetic data match properties of the original data.
Additional features, such as the presence of non-numerical data, add to this difficult task. For example, if something is listed as “animal” and can take the possible values “dog”,”cat” or “elephant”, it is difficult to convert this information into a format suitable for precise calculations. Furthermore, given that datasets have different characteristics, there is no straightforward solution that can be applied to all types of data.
During the event, particular focus was also placed on the use of synthetic data in the field of privacy, following from the challenges and opportunities identified by the National Statistician’s Quality Review of privacy and data confidentiality methods published in December 2018.
Attendees at the event voiced concerns surrounding disclosure risk of synthesised datasets. They also questioned the usefulness of protected synthetic data for research purposes. There is indeed a trade-off between enhanced privacy and utility of data, and the closing panel session explored the need find a balance between both, keeping the context of data in mind.
Are you working in synthetic data, or interested in the field?
If you are currently working with synthetic data, or keen to explore its potential, we would like to hear from you! The Data Science Campus are leading on innovation in this space using a collaborative approach. They are formulating research questions and investigating the application of synthetic data across government.
To keep up with these developments, we have established a private Slack channel to facilitate ongoing discussion and knowledge sharing between analysts in the Government Statistical Service and academia. If you are interested in joining this community of experts, please email firstname.lastname@example.org for an invitation and further information.