Reproducible Analytical Pipelines
Reproducible Analytical Pipelines (RAPs) are automated statistical and analytical processes. They incorporate elements of software engineering best practice to ensure that the pipelines are reproducible, auditable, efficient, and high quality.
RAPs increase the efficiency of statistical and analytical processes, delivering value. Reproducibility and auditability increase trust in the statistics. The pipelines are easier to quality assure than manual processes, leading to higher quality.
Statisticians and analysts should look to implement the RAP principles in parts of their processes.
A RAP will:
- improve the quality of the analysis
- increase trust in the analysis by producers, their managers and users
- create a more efficient process
- improve business continuity and knowledge management
In order to achieve these benefits, at a minimum a RAP must:
- minimise manual steps, for example copy-paste, point-click or drag-drop operations. Where it is absolutely necessary to include a manual step in the process this must be documented as described below
- be built using open source software which is available to anyone, preferably R or python
- deepen technical and quality assurance processes with peer review to ensure that the process is reproducible and that the below requirements have been met
- guarantee an audit trail using version control software, preferably Git
- be open to anyone – this can be facilitated most easily through the use of file and code sharing platforms
- follow existing good practice for quality assurance – guidance set by departments or organisations, and by Best Practice and Impact and Data Quality Hub for the Analysis Function and Government Statistical Service
- contain well-commented code and have documentation embedded and version controlled within the product, rather than saved elsewhere
- There may be restrictions, such as access to databases, which stop analysis producers building a RAP for their full end-to-end process. In this case, the above requirements apply to the selected part of the process.
- There may be restrictions, such as sensitive or confidential content, which stop analysis producers from sharing their RAP publicly. In this case, it may be possible to share the RAP within a department or organisation instead.
- It is recommended that when possible a RAP should be built collaboratively – this will improve the quality of the final product and helps to facilitate knowledge sharing.
There is no specific tool that is required to build a RAP, but both R and Python provide the power and flexibility to carry out end-to-end analytical processes, from data source to final presentation.
Once the minimum RAP has been implemented statisticians and analysts should attempt to further develop their pipeline using:
- functions or code modularity
- unit testing of functions
- error handling for functions
- documentation of functions
- code style
- input data validation
- logging of data and the analysis
- continuous integration
- dependency management
To read analysts’ experiences of RAP please see:
RAP Champions lead the development of reproducible analytical pipelines in their organisations. Find out more about the responsibilities of RAP champions and how to become one.