24th GSS Methodology Symposium
- 17th July 2019 10:00am to 5:00pm
- The KIA Oval, London, SE11 5SS
Message from Charles Lound, Symposium Organiser, 22 July 2019
A big thank you to everyone who made the Symposium a success. Special thanks to the business support team, Jo, Karen, Jess, Matt and Joe, for locating the new venue at the eleventh hour and making the day run so smoothly. Great to have a full program of keynotes and parallel sessions, with session chairs volunteering to keep the sessions flowing and in some cases present as well.
Best wishes, and see you next year,
More data, better statistics
The 24th GSS Methodology Symposium was on 17 July 2019 at The Oval. The programme for the Symposium, is linked here: (Link to Programme).
We are sharing and linking more data, transforming data collection systems to centre on the respondent and exploring non-traditional sources. This means we have a greater volume and variety of data for statistics, which presents both opportunities and challenges to extract maximum value and meet the needs of our users.
The Symposium included keynote talks from
- Tom Smith, Managing Director, ONS Data Science Campus
- Rachel Skentelbery, Deputy Director, Head of Methodology at ONS, deputised by Hannah Finselbach Asst. Deputy Director, on the day
We had parallel sessions with presentations on work including:
- working with large and complex data sets
- understanding the impact of linking data
- managing coherence and comparability when data or methods change
- explaining the quality of statistical products and its role in decision making
[Tables below under construction]
Morning Parallel Sessions
|Coverage Adjustment of the 2021 Census for England and Wales
Kirsten Piller; Andrew Penn; Alison Whitworth; Office for National Statistics
In 2021 there will be an online census of all usually resident households and communal establishments in England and Wales. Online collection means that more data will be available sooner and processing speed could be improved. The census coverage estimation of the population can therefore be carried out at higher geographical levels compared to 2011 (i.e. national or regional) and estimated for a larger range of population characteristics.
Census coverage adjustment amends the unit level census database so that it is consistent with the population estimates derived from coverage estimation. As the estimates will be provided at local authority level (by key characteristics), the adjustment will allow robust census population estimates to be obtained for lower level geographies.
The adjustment takes place in two steps, selecting donors to meet the estimates and placing the donors into postcodes. The population estimates will be used as constraints for combinatorial optimisation (CO), a synthetic micro simulation method, to obtain unit level data for the shortfall. The placement method for 2011, which we are looking at building on for 2021, placed donor households into postcodes of census dummy forms, empty households or random postcodes. Administrative data are being considered as a source to potentially improve the placement of the donors, either to provide an indicator of a household or additional information about dummy forms.
The 2021 adjustment strategy will address two main factors; the practical difficulties that were experienced during implementation of the 2011 methods and the new outputs from coverage estimation.
|Overcoverage estimation strategy for the 2021 Census of
England & Wales
Viktor Račinskij; Ceejay Hammond; ONS
The aim of a census is to correctly enumerate everyone within the target population. However, every census of a human population is subject to a degree of missingness (undercoverage) and incorrect inclusion (overcoverage). Overcoverage occurs when a member of the population is either enumerated more than once, enumerated in the wrong location, a member of a non-target population is enumerated, or a completely fictitious census enumeration is created. To estimate the extent of coverage errors across England and Wales, a post-enumeration survey, called the census coverage survey is carried out.
In the 2011 Census, the level of overcount was estimated to be 0.6% across England and Wales. This level of overcoverage was due to contributing factors, such as the increase of census fieldwork (from 4 to 6 weeks), the use of online questionnaires, the option to post back (both lead to less field control and contact) and social changes, such as an increase in individuals having second residences. Similarly, to the previous census, the extent of overcount will be estimated in the 2021 Census.
In this presentation the approach of detecting overcount cases will be discussed and two methods considered for the overcoverage estimation in the 2021 Census of England and Wales will be presented. This will be followed by the way in which the undercoverage and overcoverage estimation methods are combined to produce the census coverage error adjusted population size estimates. To support the discussion, the results of a simulation study will demonstrate the performance of the estimators under consideration.
|Toward the 2021 England & Wales Census Imputation Strategy: Building on 2011
Fern Leather; Katie Sharp; Office for National Statistics,
One of the traditional aims of the England and Wales Census has been to produce a complete and consistent census database through editing and imputation of all census variables (Wardman, Aldrich, & Rogers, 2014). This meets user needs for complete coverage and reduces any bias in the results arising from differences between the characteristics of respondents and non-respondents.
Item imputation was carried out in 2011 using a nearest neighbour conditional hot deck approach implemented in the CANCEIS software. The 2011 Census edit and imputation methodology was successful in meeting its overall aims and objectives, and the quality of item imputation was of a high standard with few issues identified (ONS, 2012).
Since 2011, a programme of work has been undertaken to continue building on the success of the 2011 methodology and address lessons learnt. Here we present a summary of key pieces of research completed so far, including optimisation of the donor search and processing unit size, managing response mode bias and addressing changes since 2011. We highlight how, through building on the lessons from 2011, the design and development of 2021 editing and imputation strategy will not only meet the aims and objectives of the 2021 Census, but also ensure that we continue to deliver a high-quality item imputation service for future large-scale imputation projects.
|More data, better statistics ... saving the world!
Matt Steel, Office for National Statistics
On 16th July Baroness Sugg (DfID) will present the UK’s first Voluntary National Review of SDGs at the UN. Data is at the core of the review, providing a measure of the UK’s progress towards each Goal. Collation of the data and the public reporting against each of the 17 Goals, 169 Targets, and 244 Indicators is the responsibility of the Office for National Statistics. Identifying data for each of the indicators is challenging and so far we have amassed data for 74%, through collaboration with statisticians across the GSS. BUT our success to date hides the inherent difficulties in recycling data to measure against specifications decreed by another body: in the case of the SDGs the difficulty of aligning data collected for UK policy requirements to those stipulated by the UN. Coherent explanations of differences between national and global methodology is key to ensuring transparency of the SDG data reporting. The SDGs are a global challenge and consequently the relevance of some indicators to the UK is less apparent and the data needed to measure progress are non-existent. In these instances, persuading Departments to fund data collection would be a non-starter – neither cost effective or high priority. Instead, innovative solutions are sought – working with big data and alternative data sources – the success of these new methodologies will enable the UK to challenge the UN SDG methodology and provide alternative solutions that can be influential internationally. Not only are 26% of the indicators still without supporting data, but the UK has signed up to the Leave No One Behind charter with the aim of making the “invisible visible” through disaggregation of the data for each indicator. How can this be achieved? Through shared responsibility for the SDGs across the GSS and proactive collaboration with ONS when developing new and existing statistics to ensure that SDG requirements are at the heart.
|Exploration of the Global Database of Events, Language and Tone (GDELT) - with specific application to disaster reporting
(Methodology, Policy and Analysis)
This paper summarises an investigation into the potential, within statistics, of using data available in the Global Database of Events, Language and Tone (GDELT). The database contains information about news media from around the world and this investigation specifically assessed its use to inform on disaster reporting for the UN’s Sustainable Development Goals.
The main benefits of GDELT are its timeliness (updated every 15 minutes), geographic reach, level of spatial reporting and the detail that it automatically extracts from news articles. It is also freely accessible and actively maintained.
Using GDELT data from 2015 the two main UK disasters in that year could be identified from a simple frequency analysis on the proportion of articles with a designated “UK” location and “disaster” theme. However, the noise of the data, and difficulties in semantic understanding, prevented the extraction of reliable figures on the impact of these disasters such as numbers of lives lost or people affected.
The paper presents the main findings of the investigation and provides discussion around data potential, limitations and quality, and outlines suggestions for additional research and different applications. The appendix provides technical details relevant for anyone interested in using GDELT data, including an overview of data access options, key databases, key variables and examples of inaccuracies discovered in the data that should be considered when using GDELT.
NB. We are currently preparing 1.3 TB of GDELT data for ingestion into the DAP in order to make it available to ONS colleagues interested in exploring applications for the rich data source or those looking for real life big data for training purposes.
|Faster Indicators of UK Economic Activity by using over 108 rows of VAT returns
Luke Shaw; Office for National Statistics (ONS)
There is currently a great appetite for faster information on UK economic activity. Indeed, the Independent Review of Economic Statistics (Bean, 2016) stated that “the longer a decision-maker has to wait for the statistics, the less useful are they likely to be”. Using cutting-edge big data techniques, this project is part of the Office for National Statistics’ response to the demand.
The project identifies close-to-real-time large administrative and alternative data-sets that are related to important economic concepts. From these data-sources, we develop a set of timely indicators that allow early identification of potential large economic changes. Data-sets included thus far are: HM Revenue and Customs Value Added Tax (VAT) returns, ship tracking data from automated identification systems for UK waters, and road traffic sensor data for England.
Here we focus on indicators built from VAT returns data. We have constructed monthly and quarterly diffusion indices built using turnover and expenditure data from VAT returns, and several indicators based on VAT reporting behaviour. We discuss the methodology behind the indices and present results-to-date. We caution that care should be used in interpreting these indicators, and they are supplementary to, not a proxy for gross domestic product. However, the suite of indicators shows promise in identifying large changes to economic activity.
Since April 2019 we have been publishing these indicators monthly, within a month of the end of the period of interest. This is one month in advance of official GDP estimates.
|Assessing mode effects in the transformed ONS Opinions and Lifestyle Survey
Tariq Aziz; Office for National Statistics
As part of the Government Digital Strategy to be ‘Digital by Default’, the ONS Opinions and Lifestyle Survey has moved from face to face data collection to telephone data collection as the first step to moving to an online followed by telephone survey. The transformed survey should ensure more efficient data collection, reduced respondent burden and could potentially obtain harder to reach sub-population groups.
Adopting online/telephone mixed mode requires having telephone numbers available for all the target population. However, telephone directories suffer from high under-coverage, therefore the sample was changed to become a follow-up to the last wave of the Labour Force Survey.
An iterative delivery was implemented using a two-stage approach to manage, monitor and control these major changes effectively. Specialist research, design and testing have been an intensive part of the OPN transformation process to optimise use of the new data collection methods and maintain statistical quality.
In this paper, we describe the sampling and weighting methods developed to address the effects of differential attrition in the LFS, and the analysis of the mode effects using direct comparisons between estimates and statistical modelling using data from parallel samples.
|Rocking the boat: new systems for collecting and validating maritime statistics using R and Cloud
Nicola George; Matthew Tranter; Sylvia Bolton; Department for Transport
Port freight statistics are collected to provide information on the handling of freight traffic at UK sea ports, to meet an EU directive and for policy monitoring. Until recently, data collection - from over 400 providers – was done via a 15-year-old website managed externally, which had become increasingly out of date causing frustrations for suppliers.
We have worked with the Department’s Digital Services team to develop a new system to collect, store and validate the data, using an Agile, Cloud-first approach and following GDS principles – including putting users at the heart of the development. We were able to identify where admin data could replace elements of the collection, and received positive feedback from users.
Changing the system has presented challenges in ensuring comparability and coherence of data collection, but also opportunities to enhance quality.
We developed metrics to compare data from the new and old systems, which identified areas for future improvements. We also improved the efficiency of the data validation process, developing new routines using R to more quickly identify and resolve discrepancies. As a result, timeliness of publication can be improved, and other user needs addressed.
This project was the first time a statistical collection in DfT has been developed in-house using Agile and Cloud approaches. Although there were obstacles to overcome, the new system delivers the core data at a considerable saving, and provides scope for enhancing the quality of the statistics in future.
|The impact of changing data and methodology on the proven reoffending National Statistics and the payment by result statistics
Liz Whiting and Andrea Solomou; Ministry of Justice
In 2015 there was a big change to the management and rehabilitation of offenders in the community in England and Wales – known as Transforming Rehabilitation. As part of this 21 private Community Rehabilitation Companies (CRCs) were introduced, and a payment by results approach was used to reward providers who reduced reoffending. Following the introduction of these reforms and a public consultation, changes were made to the proven reoffending statistics to align them with the payment by result statistics. The aim of this was to create one consistent measure of proven reoffending, to allow users to relate the performance of the CRCs in reducing reoffending with the overall figure for England and Wales, and to better reflect the way offenders were being managed across the system.
Aligning the established proven reoffending statistics required a change in data source used to compile the statistics, and several changes to the methodology – which created many challenges for the team. This presentation will discuss the many issues faced and the approach to dealing with them, while trying to maintain a comparable time series and set of statistics for the user. The data source change had a far-reaching impact, as MoJ had to make contract adjustments for the CRCs. This came under particular scrutiny from the National Audit Office when they conducted their review of Transforming Rehabilitation. More recently, MoJ has announced plans to renationalise probation and end the CRC contracts. What does this mean for the future of the statistics?
|Addressing the methodological challenges of administrative data
Claire Shenton; Lucy Tinkler; Hannah Finselbach; Office for National Statistics
The Office for National Statistics (ONS) is transforming to put administrative and alternative data sources at the core of our statistics. Combining new sources with surveys will allow us to meet the ever-increasing user demand for improved and more detailed statistics. However, using this data involves addressing a range of statistical challenges. ‘The Admin Data Methods Research Project’ (ADMRP) is being led by ONS following an extraordinary general meeting of the Government Statistical Service Methodology Advisory Committee in response to a paper on the topic by Prof Hand. The ADMRP will support internal and external projects which work to address the methodological challenges set out in Prof Hand’s paper. This presentation will discuss the programme, highlighting the importance of sharing information, engaging and working collaboratively with the methodological community to help prioritise the challenges and shape the work programme.
|Identifying communal residences using linked administrative data
Karen Tingay; ONS Methodology and Swansea University
Charles Musselwhite; Swansea University
Communal residences, such as student halls of residence, care homes, or hostels, are important for monitoring migration, socioeconomic status and mobility, and health and wellbeing issues such as spread of disease. However, these populations may be transitory or unable to complete surveys. Equally, being able to identify and exclude these residences may save interviewers time and effort for surveys using single-residence household. Previous work conducted by the authors using a k-means clustering model on Welsh demographic data successfully identified student halls but was less accurate at older adult care homes, and found two other “general” communal residences. This updated project linked the demographic administrative dataset to GP records, to find demographic and health data that indicate different types of communal residences. A training dataset was created based on aggregate data for each residence, and feature selection selected those variables most likely to indicate distinct residence types. Hierarchical clustering identified distinct groups based on numbers of children, young people, adults, and older people, as well as duration of residency and broad health problems. These clusters were further identified using k-means. The model requires further validation, but the methodology shows promise as a way of automatically clustering potential communal residences using administrative data.
|Why building a model of households using admin data isn’t as easy as it sounds
Karen Tingay; ONS Methodology
Athanasios Anastasiou; Malvern Panalytical
Charles Musselwhite; Swansea University
Longitudinal administrative data sources are increasingly being explored as an alternative to cross-sectional surveys for governmental censuses. For households, changes in both composition and location are routine and even longitudinal surveys might not capture household instability. Surely, though, this is just who is living with who over time, right? In theory, yes, but the data is based on people’s behaviour, and people often don’t think about data. We used Welsh administrative data to define rules for, and develop an algorithm to implement, a finite, deterministic household model. The rules cover different types of households, including transient and multi-generational, but not communal residences such as student halls or residential care homes.
The model was implemented in Python, and provides a running count of the total number of residents at each residence at any given time. Comparison against the 2011 Census showed fewer residences overall, but good accuracy for household size, with no household size category being more than 2% different from the Census. However, the model showed higher, but non-significant, percentages of 3- and 4-person households compared to Census: 17.76% and 12.57% vs 15.01% and 12.1% respectively. These findings might reflect known issues with the way data is collected.
Further validation for household demographics is needed to identify potential data collection and quality issues. Additional data sources may be required to reduce these.
Afternoon Parallel Sessions
|Dealing with Discontinuities in Survey Reporting Periods and Their Impact on Seasonal Adjustment of Time Series
Sam Jukes; Duncan Elliott; ONS
Survey reporting periods refer to the dates when respondents are asked to provide data. While the methods for seasonal adjustment can be tailored to account for specific calendar effects arising from the nature of the survey reporting periods, discontinuities in the time series may arise if these periods are changed. This research looks into the case of UK Retail Sales Index, which has been historically based on a 4-4-5 week reporting period and will change to a calendar month period in 2019. A 4-4-5 reporting frequency does not feature trading day effects and causes artificial moving holidays and phase shifts that affect seasonality. The switch to a calendar month reporting will remove these anomalies and introduce a trading day effect, which will be difficult to estimate due to a short span. Possible procedures are assessed for calendarizing the historic 4-4-5 series so they can be used for seasonal adjustment after the reporting period change. Methods include using a mixture of simulated and real values, temporal disaggregation, RegARIMA modelling for trading day estimation; state-space modelling, and benchmarking for consistency between different frequencies.
|Dealing with High Frequency Time Series: Seasonal Adjustment of Road Traffic Data
Atanaska Nikolova and Duncan Elliott; ONS
Time series in official statistics are typically of monthly, quarterly, or annual frequency. In recent years there is an increasing interest in higher frequency data, which is timelier and can be used to complement traditional lower frequencies and serve as a faster indicator for changes in the economic climate. This research compares different methods for seasonal adjustment of road traffic data from Highways England, which is available at hourly frequency, but is currently aggregated to monthly for the purpose of regular seasonal adjustment. Some of the methods used include fractional airline decomposition and state space modelling, using newly developed capabilities of the software JDemetra+. The estimated seasonal and calendar effects are compared between different aggregations of the data, investigating how information from higher frequencies (e.g. daily and weekly effects) can be used to inform movements in the monthly equivalent.
|Developing a model for monthly labour market data, constraining to population totals
Office for National Statistics, UK
A new method for single month estimates of labour market data for the number of people in employment, unemployment and inactivity has been included in regular monthly publications since June 2019. This new method uses a state space framework to estimate unobserved components in the data which provide improved estimates of monthly change. Estimation of the input data for the models ensures that the sum of estimates for employment, unemployment and inactivity are equal to known population totals. The outputs of the models can be used to produce non-seasonally adjusted and seasonally adjusted estimates, but these will not necessarily meet this population constraint. This research explores alternative models that attempt to meet such population constraints.
|A statistical quality framework for longitudinally linked administrative data on international migration
Louisa Blackwell, Sarah Cummins, ONS Methodology
Nicky Rogers, ONS Migration Statistics Division
The Office for National Statistics (ONS) is committed to increasing its use of administrative data in the production of statistics. As part of this transformation programme, the aim is to transition incrementally to an admin-based population statistics system. This programme of work forms part of a larger Government Statistical Service (GSS) transformation plan which is recognised in the Home Affairs Select Committee’s report and government response.
Our recent research has focused on the potential to use Home Office Exit Checks data to improve estimation of international migration. These data describe the travel patterns of people who arrived in the UK on non-visit visas. Since the data are generated for operational use, our focus has been to understand and describe their statistical properties. In this research, we are indebted to migration experts at the Home Office, whose encyclopedic knowledge of the data and their systems is greatly contributing to our understanding.
Methodologists working alongside migration experts within ONS have developed a framework for 1) reporting on the statistical quality of linked administrative data on international migration, and 2) identifying the key issues in the statistical design of linked administrative datasets. Our approach draws heavily on the Statistics New Zealand framework for describing administrative data quality. We have taken this taxonomy of statistical error a step further, to include the additional issues and errors associated with the longitudinal linkage of events-based data. We will describe the proposed framework, which will be valuable for administrative data linkage projects more generally.
|Assessing the quality of the Inter-Departmental Business Register through changes in enterprise structures
Rhonda Hypolite; Office for National Statistics
Verity Barnes; Office for National Statistics
The Inter-Departmental Business Register (IDBR) is the main sampling frame used for business surveys by ONS and across government. It contains data for approximately 2.7 million live businesses in all sectors of the UK economy.
Administrative data from Her Majesty’s Revenue and Customs (HMRC) and Companies House are uploaded to the IDBR on a continual basis. We present an investigation into the impact of these updates over time on the structure of businesses recorded on the IDBR. This new approach provides a new quality aspect of the IDBR and will be a key measure of reliability for the development of a new Statistical Business Register.
|Improving estimates of confidence intervals around smoking quit rates
Marie Horton, Population Health Analysis, Public Health England
Paul Fryers, Public Health Data Science, Public Health England
Clare Griffiths, Population Health Analysis, Public Health England
The 95% confidence intervals for indicators in the Local Tobacco Control Profiles (LTCP) related to smoking quitters per 100,000 smoking population previously were calculated using Byar’s method which indicates the level of uncertainty around the numerator but assumes there is no uncertainty around the denominator. However, the denominator for this indicator has been derived from population estimates and survey estimates of smoking prevalence, meaning there is some level of uncertainty. We explored other confidence interval methods in order to take this into account.
Using Excel and VBA we used the Wilson Score confidence interval method to calculate the standard errors for the logit-transformed smoking prevalence estimates from the Annual Population Survey (APS). A simulation method using random numbers generated by the Mersenne Twister algorithm was used to create randomised indicator values. From the array of randomised indicator values, the 95th centile smallest and largest values were identified and returned as the 95% confidence intervals.
For the majority of local authority estimates the confidence intervals were widened using the new method. In the LTCP this is indicated by more local authorities appearing as amber (not significantly different to England), however our estimates now present a more accurate reflection of the variation in these indicators.
Where indicators require complex calculations that include estimated values, for example from survey data, it is important to take this into consideration when calculating confidence intervals and ensure that users are aware of the uncertainty surrounding them.
|Improving Efficiency of Imputation Using Machine Learning
Vinayak Anand-Kumar; Katie Davies; ONS
Missing data can be problematic as they may reduce the accuracy and reliability of statistics. Imputation creates values and/ or units, that fill in the missingness, in an effort to create a dataset that is more representative of the population and concept of interest. Ideally, imputation methods would be advised by the nature of missingness and, be developed using data available. Unfortunately, imputation models are not always empirically tested due to the large volume of data or timeliness constraints. The Methodology in collaboration with the Data Science Campus investigated the use of supervised Machine Learning (ML) to carry out imputation; using an automated and data driven approach, which would be faster than the current manual/ multi-stage approach. The project used a ML software called XGBoost to directly impute missing values and comparing this to the standard approach. The presentation will cover the key concepts behind XGBoost and the findings from this program of work.
|Research into statistical methods for estimating weights for web-scraped data
Heledd Thomas; Daniel Ayoubkhani; Office for National Statistics
Web-scraped price data have the potential to increase sample coverage and reduce costs associated with the Consumer Prices Index (CPI), a headline ONS output and a key indicator of UK economic performance. However, lack of corresponding quantity data makes it impossible to directly calculate the relative importance of each product in the index. The focus of our research was therefore on estimating the products’ weights according to their webpage rankings when sorted “by popularity”.
Using historical transactional data for two consumer items supplied to ONS by a high-street retailer, we applied various formulae to directly translate product rankings to weights. However, the generalisability of these formulae to items beyond the two under analysis remains unknown, so we also investigated the use of statistical distributions for predicting product quantities using only basic summary statistics that may be supplied by retailers - an innovate methodological development in the measurement of consumer prices.
In this presentation we will outline our methods and results in detail, framing our research in the context of broader development work being undertaken by ONS to enhance the use of non-survey data sources in the production of the CPI.
|Valuing Green Spaces in Urban Areas: a Hedonic Price Approach using Machine Learning techniques
Luke Lorenzi & Vahé Nafilyan , ONS
In this paper we estimate the value of recreational and aesthetic services provided by green and blue spaces in urban areas in Great Britain. To do so, we create a unique house-level dataset by linking data from a property website to a comprehensive data set of urban green spaces, as well as data on air and noise pollution and measures of school distance and quality. We extend the traditional hedonic pricing approach by using Machine Learning techniques to flexibly model house prices. Unlike standard hedonic pricing via linear regression, our model does not rely on any assumptions regarding the relationship between house prices and the wide range of structural, neighbourhood and environmental characteristics. We compute partial dependency plots to display the marginal effects of green and blue spaces on house prices and test whether they are linear.
|Developing a model to impute rent for owner-occupiers’ housing costs
Melanie Lewis; Daniel Ayoubkhani; Office for National Statistics
In addition to headline indices that measure price inflation for the UK economy, such as the Consumer Prices Index including owner occupiers’ housing costs (CPIH), the Office for National Statistics (ONS) produces CPIH-consistent inflation rates for different household groups. The CPIH uses “rental equivalence” methodology (also known as “imputed rents”) to estimate owner occupiers’ housing costs (OOH), reflecting the idea that dwellings provide a flow of services which are consumed by the homeowner. The actual rent paid by tenants is used to impute the equivalent rent for owner occupiers. In essence, the methodology asks, “How much rent would a homeowner need to pay to rent the home they live in?”
Using microdata from the Living Costs and Food Survey (LCF), as well as aggregate data from bodies such as the Valuation Office Agency, we have developed a statistical model to estimate OOH for different household groups. As tenancy type is not a random characteristic - people rent rather than owning their own home for a variety of reasons - a two-stage Heckman modelling approach was used to account for selection bias in the modelled sample. Another key feature was the use of a Box-Cox transformation to overcome skew in the distribution of rent, which required a bias correction to be applied to the back-transformed predictions.
This presentation will explain the CPIH-consistent rental equivalence methodology and outline details of the Heckman model, noting recent methodological improvements and discussing the subsequent impact on the final output compared with results in previous years.
|Developing measures to track dispersion in Public Health Outcomes Framework indicators
David Jephson; James Westwood; Paul Fryers; Public Health England
Public Health England have been developing a method to measure changes in the dispersion of values across local authorities for a range of indicators. Traditionally, we track whether national values for an indicator are improving or worsening over time and do the same, separately, for local authorities. This new measure aims to help assess both together by considering whether changes at national level are due to the gap between the best and worst performing areas narrowing or widening over time. A number of approaches have been investigated and a Gini coefficient based measure using boot-strapped confidence intervals has been developed. This has been applied to each year of data available for each indicator. A chi squared test for trend has then been used to determine whether there have been changes over time for both national values and dispersion of local authorities. Initial results show a mixed picture. In some cases, whilst improvements have been made at national level, the dispersion of local authority values has been widening. This suggests that areas that began with better outcomes were improving at a faster rate than those with worse outcomes. In other cases, improvements at national level have seen a narrowing of the dispersion. This measure could be used to track progress for a broad range of public health topics. It aims to enable policy decision making to be based not only on the direction of travel at national level, but to also be informed by the spread of outcomes by local area.
|The changing costs of defence: how we improved our price index forecast tool using R
Anne Foulger, Ministry of Defence
The MOD's Price Indices team provides advice and data on inflation rates to help the department's procurement teams manage inflation in long term contracts.
We collect data on around 200 price indices from ONS and run a backwards stepwise regression model on each index to produce a forecast. We update the forecasts regularly and present them on an interactive tool called Indigo, which can be accessed by anyone in MOD.
In this talk I will discuss recent developments to the Indigo tool, including how we have automated the collection and formatting of large amounts of ONS price index data using R, and how we have tested various regression modelling approaches to see if we can improve our forecasts.