+ All Categories
Home > Documents > Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Date post: 25-Feb-2016
Category:
Upload: beyla
View: 38 times
Download: 0 times
Share this document with a friend
Description:
Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014 Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation - PowerPoint PPT Presentation
Popular Tags:
29
Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014 Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation Riccardo Giannini ([email protected]), Rosanna Lo Conte ([email protected]), Stefano Mosca ([email protected]), Federico Polidoro ([email protected]) , Francesca Rossetti ([email protected])
Transcript
Page 1: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June)

Q2014European Conference on quality in official statistics

Vienna, 2-5 June 2014

Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation

Riccardo Giannini ([email protected]), Rosanna Lo Conte ([email protected]), Stefano Mosca ([email protected]), Federico Polidoro ([email protected]),

Francesca Rossetti ([email protected])

Page 2: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June)

Outline of the presentation

2

1. Implementing European project “Multipurpose price statistics”

2. Centralised data collection for Italian Harmonized Index of

Consumer Prices (HICP)

3. Testing and implementing web scraping techniques on the survey

concerning prices of “consumer electronics” products

4. Testing web scraping techniques on the survey concerning

“airfares”

5. IT choices adopted to implement web scraping procedures

6. Possible future developments and conclusive remarks

Page 3: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 3

• The key role of price statistics for the choices of policy makers

• The requirements to ensure and improving quality in terms of methodology and production process, with specific reference to the data collection phase

• The demand for timely and cost efficient production of high quality statistical data increases, as well the need for new solutions to declining response levels (Scheveningen Memorandum)

• Multipurpose price statistics as reply to these requirements

Implementing European project “Multipurpose price statistics”

Page 4: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June)

Implementing European project “Multipurpose price statistics”

Mul

tipur

pose

pric

e st

atis

tics

Modernizing data collection methods

Linking of HICP and PPP processes

Developing a data warehouse approach

Providing more detailed and timely HICPs, Price Level Indices (PLIs)

and Price Level Data (DAP)

4

Page 5: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 5

• Modernization of data collection tools for improving HICP quality is one of the pillars of “Multipurpose Price Statistics”

• Focus on “scanner data” and “web scraping” techniques as tools to capture big amount of data for the compilation of inflation

• Concerning web scraping Istat is testing and implementing procedures to “scrape” big amount of data for HICP aims, using the Internet as data source

• Focus on two groups of products: “consumer electronics” (goods) and “airfares” (services)

Implementing European project “Multipurpose price statistics”

Page 6: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 6

Entire external data bases (medicines, school books, household contribution to National Health Service) 0.6%

The most efficient way to collect prices necessary for indices compilation (i.e. camping, package holidays, highway toll) 11.6%

Prices referred to the real purchase on the Internet (i.e. air tickets, consumer electronics and e-book readers) 2.3%

Other prices centrally collected (i.e. tobacco and cigarettes)7.0%

Centralised data collection for Italian HICP

Territorial data collection 78.5%Centralized data collection 21.5%

BREAKDOWN OF THE BASKET OF PRODUCTS IN TERMS OF WEIGHTS

Page 7: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 7

Centralised data collection for Italian HICP

CRITERIA TO SELECT THE PRODUCTS FOR TESTING WEB SCRAPING TECHNIQUES

Representativeness of both goods and services

Relevance of web as retail trade channel

Products for which the phase of data collection is extremely time consuming

Products for which it is important widening the coverage of the sample in both temporal and spatial terms overcoming the constraints due to manual data collection

Page 8: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 8

Centralised data collection for Italian HICPRELEVANCE OF WEB AS RETAIL TRADE CHANNEL

Table 1. E-commerce. Individuals aged 14 and over who have used the web during the last 12 months who have bought or ordered goods or services for private use over the Internet, by groups of products purchased or ordered. 2012. PercentagesOvernight stays for holidays (hotels, pension etc.). 35.5Other travel expenditures for holidays (railway and air tickets, rent a car, etc.) 33.5Clothing and footwear 28.9Books, newspapers, magazines, including e-books 25.1Tickets for shows 19.7Consumer electronics products 18.6Articles for the house, furniture, toys, etc.. 17.9Others 15.1Film, music 14.4Telecommunication services 14.0Sofware for computer and updates (excluding videogames) 11.5Hardware for computer 8.4Videogames and their updates 8.0Financial and insurance services 6.0Food products 5.6Material for e-learning 2.8Games of chance 1.2Medicines 0.8Source: Istat survey on “Aspects of daily life”

Page 9: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 9

• The choice has finally fallen on two groups of products: consumer electronics (goods) and airfares (services)

• Testing web scraping techniques on these two groups of products was aimed first of all at making the on line data collection more efficient, for products for which the web is a relevant retail trade channel

• The aim of exploring the potentialities of web scraping to allow a better coverage of the reference population (linked to the issue about the use of big data for statistical purposes and the consequences of this use on the traditional sampling methodologies)

Centralised data collection for Italian HICP

Page 10: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 10

• Survey about prices of consumer electronics is carried out in ten phases and it concerns:

Testing and implementing web scraping techniques on “consumer electronics” products

Mobile phones

Smartphones

PC notebook

PC desktop

PC Tablet

Pc peripherals: monitors

Pc peripherals: printers

Cordless or wired telephones

Digital Cameras

Video cameras

Page 11: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 11

Phase 2 Market segmentation based

on technical specifications and

performance (annually fixed). Example1 –

digital cameras: seg1= ‘compact’ camera;

seg2= ‘bridge’ camera; etc.

Phase 3 Identification of minimum

requirements to be satisfied (annually

fixed) Example1- PC Desktop: O.S. at least

Windows XP, HD capacity 160 Gb or

higher, RAM memory at least 2 Gb, etc..

Phase 4 Monthly data collection of all the range of

models in terms of commercial name and main technical

specifications offered on the market by the main brands,

within the segments identified at phase 2 and satisfying the

minimum requirements identified at phase 3. In this

phase the sample is selected for a specific month (‘continually

updated’ sample with ‘automated’ replacement of

models that are losing importance in the market).

Phase 5 Price data collection, for all the models included in the sample, from each

web site of the shops sampled. Manual detection - for some shops (9) price

collectors scanned the corresponding websites manually and registered the price in

external files or databases;Semi - automatic detection - for other shops

(9) price lists were manually downloaded (“copy and paste”), and then formatted and

submitted to SAS procedures that linked (automatically) the product codes identified in phase 4 to the codes in the list from each

store.

Testing and implementing web scraping techniques on “consumer electronics” products

Focus for web scraping test on

Phase 5 and semi automatic detection

(time consuming phase and feasibility

of the test)

Page 12: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 12

Testing and implementing web scraping techniques on “consumer electronics” products

EVALUATION OF THE RESULTS OBTAINEDIN TERMS OF

Amount of prices downloaded in the lists

Amount of prices that was possible to link automatically for each store

(via SAS procedures) to the product codes in the sample selected in

phase 4

Improvements of efficiency in terms of time saving

Page 13: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 13

Testing and implementing web scraping techniques on “consumer electronics” products

Products Manual downloaded lists of prices February 2013

Web scraped lists of prices March 2013

Cordless or wired telephones 195 185

Mobile phones 102 111

Smartphones 174 171

PC desktop 102 83

PC peripherals: monitors 142 310

PC notebook 328 433

PC peripherals: printers 383 421

PC Tablet 100 87

Digital Cameras 392 322

Video cameras 103 83

Total 2021 2206

Table 2. Number of elementary price quotes manually collected and web scraped and usable for indices compilation. A comparison between February and March 2013

Source: Istat

Page 14: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 14

Testing and implementing web scraping techniques on “consumer electronics” products

Source: Istat

On line shops website

Number of products

manual download: navigation, copy, and

paste(minutes)

manual download: standardization of formats (minutes)

iMacros download: macro execution

(minutes)

iMacros download: formatting

output (minutes)

www.compushop.it 10 50 80 15 50

www.ekey.it 6 30 20 15 70

www.misco.it 10 60 90 10 45

www.pmistore.it 7 40 90 15 20

www.softprice.it 10 90 180 25 40

www.syspack.it 8 45 90 20 45

Total time 5 hours 15 minuts 9 hours 10 minutes 1 hour 40 minutes 4 hours 30 minutes

Table 3. Current workload for monthly data collection. Comparison between manual and web scraping download. Hours

The comparison between workload necessary for the manual detection of prices and the download of prices through web scraping macros shows the efficiency gains obtained

Page 15: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 15

Testing and implementing web scraping techniques on “consumer electronics” products

Source: Istat

Manual download Web scrapingStarting workload (annual changing base) 0 34Current maintenance 0 12Current data collection 173 74Total working hours 173 120

Table 4. Annual working hours for half of shops sample for data collection of prices of consumer electronics products. Comparison between manual download and web scraping. Hours

A correct evaluation of the gains in terms of efficiency assessing the annual workload, taking into account the starting cost of developing macros (starting workload to be considered for annual index changing base) and the maintenance of the macros (current maintenance)

Also on annual basis efficiency increases: the workload necessary to manage the survey is reduced from about 23 working days to 16 working days (more than 30% of time saved that increases if we take into account that human resources are available for other tasks).

The choice of switching from test to production the use of web scraping macros for consumer electronics

Page 16: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 16

Testing and implementing web scraping techniques on “consumer electronics” products

Product number of models in the sample

number of price quotes web

scraped

number of price quotes collected and linked to the

sample

Price quotes linked/price

quotes scraped (%)

Cordless or wired telephones 190 844 224 26.5Mobile phones 63 2024 108 5.3Smartphones 131 2396 187 7.8Digital Cameras 352 2642 400 15.1PC desktop 37 1837 81 4.4PC peripherals: monitors 273 2734 299 10.9PC notebook 179 3597 288 8.0PC peripherals: printers 143 5887 370 6.3PC Tablet 179 1824 42 2.3Video cameras 152 560 56 10.0TOTAL 1699 24345 2055 8.4

Table 5. Sample of models, price quotes scraped and price quotes collected for consumer electronics products survey for Italian CPI/HICP compilation. January 2014. Units and percentages

First column: the number of models selected in the

sample in phase 4

Second column: the amount of elementary price quotes scraped

Third column: the number of elementary quotes that it was possible to link with the codes of the models

selected in phase 4

Page 17: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 17

Testing and implementing web scraping techniques on “consumer electronics” products

The potentialities of web scraping techniques in terms of amount of information captured

A big “waste” of information: why ?

For being, from the statistical point of view, useless for the aims of estimating inflation estimation of consumer electronics products ?

Or for sampling schemes that were conceived taking into account the constraints of the data collection and in particular the limitations to collect all the information available on web ?

The possible massive use of web scraping techniques opens “new doors” to statisticians concerning the capture of the information necessary to measure the object of the survey ?

Page 18: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 18

Testing web scraping techniques on the survey concerning “airfares”

A BRIEF DESCRIPTION OF THE SURVEY

COICOP class (passenger transport by air, weight on the total HICP basket of products=0.85% in 2013) articulated in three consumption segments: Domestic flights, European flights, Intercontinental flights, further stratified by type of vector, destination and route

In 2013 final sample consisted of 208 routes (from/to16 Italian airports): 47 national routes, 97 European routes, 64 intercontinental routes. 81 routes referred to TCs and 127 routes referred to LCCs.

Product definition: one ticket, economy class, adult, fixed route connecting two cities, outward and return trip, on fixed departure/return days, final price including airport or agency fees

Page 19: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 19

Testing web scraping techniques on the survey concerning “airfares”

A BRIEF DESCRIPTION OF THE SURVEY

Prices are collected by means of purchasing simulations on Internet, according to a pre-fixed yearly calendar (first Tuesday of the month considering 2 or 4 time distances from the date of departure)

In 2013, data collection was carried out on 16 LCCs’ websites and on three web agencies selling air tickets (Opodo, Travelprice and Edreams), where only TC’s airfares are collected

More than 960 elementary price quotes are registered monthly, which correspond to the cheapest economy fare available at the moment of booking for the route and for the dates selected, including taxes and compulsory services charges

Page 20: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 20

Testing web scraping techniques on the survey concerning “airfares”

TESTING WEB SCRAPING The aim of testing web scraping techniques on airfares is twofold:

verifying the improvements of efficiency and evaluating the chance of extending data collection to further dates (two and three months of “purchasing advance”) with respect to those ones ordinary scheduled (monthly or twice a month with departure dates ten days and one month after), exploiting the potentialities of web scraping procedures

Taking into account characteristics and peculiarities of the survey on airfares the activity of testing web scraping techniques on airfares data collection has required not only developing and assembling scraping macros but also implementing a multitude of logic controls, derived from the statistical design of the survey

Web scraping techniques have been tested on EasyJet, Ryanair, and Meridiana and for the traditional airlines companies on Opodo.it

Page 21: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 21

Testing web scraping techniques on the survey concerning “airfares”

RESULTS OF TESTING WEB SCRAPING

EasyJet did not allow to scrape directly the prices using the traditional link www.easyjet.com/it/ and required specific airport descriptions (different from the simple airport IATA codes)

Ryanair, at the very beginning of the tests, presented CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart"), a challenge-response test used to determine whether the web user is human or no

Meridiana website, in replying to a specific query, showed additional pages offering optional services or asking for travellers information before displaying the final price, thus obligating us to develop a distinctive and more complex macro to scrape prices.

With regard to the LCCs, each airline company site showed its own specific problems:

Page 22: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 22

Testing web scraping techniques on the survey concerning “airfares”

RESULTS OF TESTING WEB SCRAPING

Finally, attention was concentrated on EasyJet

The macros developed have provided excellent results in correctly replicating manual data collection. They are used in the current activities, starting from the more recent data collections

Improvements in terms of time saving have been quite small. This is due to the time spent in preparing the input files used by the macros to correctly identify the routes and dates for which scraping the prices and returning a correct output usable for the index compilation, but also to the limited amount of elementary quotes involved (60) that does not allow to have a meaningful measure of time saving deriving from the adoption of web scraping techniques as a powerful tool to acquire big amount of elementary data in an efficient way.

Page 23: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 23

Testing web scraping techniques on the survey concerning “airfares”

RESULTS OF TESTING WEB SCRAPING

For airfares offered by traditional airlines companies web scraping macros have been tested on the web agency Opodo (www.opodo.it). In this case, an amount of about 160 monthly price quotes was involved.

Improvements in terms of efficiency more meaningful than those ones obtained with EasyJet macro (1 hour and 48 minutes to download the 160 elementary price quotes manually downloaded in about 2 hours and half)

Also for Opodo it is necessary to prepare an input file to drive the macro in searching the correct sample of routes and, in addition to Easyjet macro, in managing the distinction between traditional and low cost carriers

Page 24: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 24

Testing web scraping techniques on the survey concerning “airfares”

RESULTS OF TESTING WEB SCRAPING

Therefore the total time necessary for automatic detection of prices is not so different with respect to the manual detection; and time to update the macro is also needed

But it has to be considered that, if the Opodo macro works correctly and only marginal check activity is needed, then the two hours time of manual work is saved and could be dedicated to other phases of the production process or to improve quality and coverage of the survey

Also in the case of airfares the possibilities (enlarging the amount of elementary data collected through web scraping), to better cover the reference universe emerge clearly

Page 25: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 25

IT choices adopted to implement web scraping procedures

The choice of Imacros as software to be used for testing web scraping techniques in the field of consumer price data collection. Why ?

It allows speeding up the acquisition of textual information on the web and above all it can be used with the help of programming languages and scripting (e.g. Java, JavaScript)

iMacros tasks can be performed with the most popular browsers

Documented by wiki (i.e. http://wiki.imacros.net/iMacros_for_Firefox) and fora (e.g. http://forum.iopus.com/viewforum.php)

It is possible to take advantage from some projects (e.g. http://sourceforge.net/projects/jacob-project/) for the use of Java, delivering to user a great potential for interface and integration with other solutions software and legacy environments.

Page 26: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 26

IT choices adopted to implement web scraping procedures

The approach adopted has been implementing two different macros for each survey: pointing and scraping macro (the pointing ones to reach the page, the scraping ones to collect data and register them into a flat file)

Main advantages: a) Easy maintenance due to modularity that helps the identification of problems when they occur; b) In all cases in which problems reside into pointing macro, there is no need of IT specialist support in maintenance

The main disadvantages are: a) lower usability (collectors are forced to use two macros instead of one; b) More time necessary to execute the complete activity of web scraping

But advantages prevail on disadvantages

Page 27: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 27

Possible future developments and conclusive remarks Developing and testing web scraping procedures for the Italian

consumer price survey have confirmed the enormous potentialities of the use of automatic detection of prices (and related information)

Concerning the efficiency the improvements are clear when data collection is carried out on a few websites with a big amount of information. The situation appears to be partially different if it is necessary to collect few prices on several distinct websites

This issue stresses the potential use of web scraping techniques to collect information for Purchasing Power Parity (PPP) or Detailed Average Price (DAP) exercise at international level of comparison but seems to limit their use for sub national spatial comparison, for which the data collection on a certain amount of websites should be necessary

But the actual challenges emerged have become more clear and they are in front of the statisticians in terms of use of “big data” for statistical purpose

Page 28: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June) 28

Possible future developments and conclusive remarks

The open questions regard the adoption of web scraping techniques to gather big amount of data useful to better estimate inflation. This challenges is already proposed by the study carried out by economic researchers at the Massachusetts Institute of Technology (MIT), within the project called "The Billion Prices Project @ MIT" that is aimed at monitoring daily price fluctuations of online retailers across the world

Web scraping (and scanner data) are the future of consumer price statistics as basis of reengineering production processes or also challenges to deal with a deep revision of the statistical survey design ?

Is it possible fully exploiting the potentiality of these “big data” (web scraped prices and scanner data) to enhance the quality of official statistical information in a so delicate field as inflation estimation ?

Page 29: Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014

Q2014 - European Conference on quality in official statisticsVienna, 2-5 June 2014 (4 June)

Thank you for the attention


Recommended