GSS Data Blog – ‘What we can do with prices data scraped from the web’

Posted by
|

ONS has recently published updated research into using web scraping technologies in the calculation of consumer price statistics. Read for more information.

My name is Tanya Flower. I work in the Prices Division at ONS, on the prices web scraping project. As Jane Naylor, Head of the Big Data team at ONS, mentioned in her last blog, this is one of a number of projects investigating the benefits and the challenges of using such data and the associated technologies within official statistics. Prices Division together with the Big Data team and Methodology colleagues have been working to investigate how price data from on-line supermarket chains could be used within prices statistics.

T Flowers 1The growth of online retailing over recent years means that price information for many goods and services can now be found online. Web scrapers are software tools for extracting these data from web pages. The Big Data Team has developed prototype web scrapers for three online supermarket chains: Tesco, Sainsbury and Waitrose, which have been running since June 2014. These scrapers were programmed in Python using the scrapy module. Every day at 5.00 am the web scrapers automatically extract prices for 33 items in the Consumer Price Index (CPI) basket, covering things like bread and alcohol.

The web scraper uses the websites’ own classification structure to identify suitable products that fit within the CPI item description. For example, for the CPI item apples (dessert), per kg, products collected include Pink Lady Apples and Tesco Pack 4 Apples. The number of products extracted within each item category varies depending on the number of products stocked by each supermarket. On average over the period, approximately 5,000 price quotes are extracted by the web scrapers per day for the 33 items (approximately 150,000 a month). By contrast, the traditional collection approach for most grocery items is for a price collector to go into a local retailer once a month and collect prices for representative products. For these 33 items, this equates to approximately 6,800 a month.

Once collected, there are a number of steps involved in the development of experimental research indices using this data. Methodology and Big Data have been experimenting with machine learning techniques to identify misclassified items. These results are then validated using an algorithm designed to identify anomalies, such as a loaf of bread priced at £100, which returns a much more accurate and reliable source of price data than the raw data scraped from the website.

 

T Flowers 2

Compiling high frequency data into price indices presents a unique set of challenges, which must be resolved before the data can be put to effective use. We may see differences in price levels or price dynamics depending on the choice of index compilation method or type of good.

For more information about this work, and a list of upcoming planned methodological developments for ONS web scrapers in the next 6-12 months, please see the recent update we published on the 23 May: “Research indices using web scraped price data: May 2016 update”.

This update includes an interactive tool, which is a useful to compare different indices across items and frequencies.