Full-text resources of CEJSH and other databases are now available in the new Library of Science.
Visit https://bibliotekanauki.pl

PL EN


2023 | 68 | 12 | 49-64

Article title

Current challenges and possible big data solutions for the use of web data as a source for official statistics

Content

Title variants

PL
Współczesne wyzwania i możliwości w zakresie stosowania narzędzi big data do uzyskania danych webowych jako źródła dla statystyki publicznej

Languages of publication

Abstracts

PL
Web scraping jest coraz popularniejszy w badaniach naukowych, zwłaszcza w dziedzinie statystyki. Przygotowanie środowiska do scrapowania danych nie przysparza obecnie trudności i może być wykonane relatywnie szybko, a uzyskiwanie informacji w ten sposób wymaga jedynie podstawowych umiejętności cyfrowych. Dzięki temu statystyka publiczna w coraz większym stopniu korzysta z dużych wolumenów danych, czyli big data. W drugiej dekadzie XXI w. zarówno krajowe urzędy statystyczne, jak i Eurostat włożyły dużo pracy w doskonalenie narzędzi big data. Nadal istnieją jednak trudności związane z dostępnością, ekstrakcją i wykorzystywaniem informacji pobranych ze stron internetowych. Tym problemom oraz potencjalnym sposobom ich rozwiązania został poświęcony niniejszy artykuł. Omówiono studium przypadku masowego web scrapingu wykonanego w 2022 r. za pomocą narzędzi big data na próbie 503 700 stron internetowych. Z analizy wynika, że dostarczenie wiarygodnych danych na podstawie tak dużej próby jest niemożliwe, ponieważ w czasie badania zwykle do 20% stron internetowych może być niedostępnych. Co więcej, dokładna liczba aktywnych stron internetowych w poszczególnych krajach nie jest znana ze względu na dynamiczny charakter Internetu, skutkujący ciągłymi zmianami stron internetowych.
EN
Web scraping has become popular in scientific research, especially in statistics. Preparing an appropriate IT environment for web scraping is currently not difficult and can be done relatively quickly. Extracting data in this way requires only basic IT skills. This has resulted in the increased use of this type of data, widely referred to as big data, in official statistics. Over the past decade, much work was done in this area both on the national level within the national statistical institutes, and on the international one by Eurostat. The aim of this paper is to present and discuss current problems related to accessing, extracting, and using information from websites, along with the suggested potential solutions. For the sake of the analysis, a case study featuring large-scale web scraping performed in 2022 by means of big data tools is presented in the paper. The results from the case study, conducted on a total population of approximately 503,700 websites, demonstrate that it is not possible to provide reliable data on the basis of such a large sample, as typically up to 20% of the websites might not be accessible at the time of the survey. What is more, it is not possible to know the exact number of active websites in particular countries, due to the dynamic nature of the Internet, which causes websites to continuously change.

Year

Volume

68

Issue

12

Pages

49-64

Physical description

Dates

published
2023

Contributors

author
  • Eindhoven University of Technology, Department of Mathematics and Computer Science, the Netherlands
  • Uniwersytet Gdański, Wydział Zarządzania; Urząd Statystyczny w Gdańsku, Ośrodek Inżynierii Danych / University of Gdańsk, Faculty of Management; Statistical Office in Gdańsk, Centre for Data Engineering

References

  • Anglin, K. L. (2019). Gather-Narrow-Extract: A Framework for Studying Local Policy Variation Using Web-Scraping and Natural Language Processing. Journal of Research on Educational Effectiveness, 12(4), 685–706. https://doi.org/10.1080/19345747.2019.1654576.
  • Antonov, O., & Laktionova, O. (2020). Evaluation of Real Estate Market Value in Ukraine Using Web-Scraping. Galician Economic Journal, 63(2), 35–44. https://doi.org/10.33108/galicianvisnyk_tntu2020.02.035.
  • Ascheri, A., Marconi, G., Meszaros, M., & Reis, F. (2022). Online Job Advertisements for Labour Market Statistics using R. Romanian Statistical Review, (1), 3–26. https://www.revistadestatistica.ro/2022/03/online-job-advertisements-for-labour-market-statistics-using-r/.
  • Boegershausen, J., Datta, H., Borah, A., & Stephen, A. T. (2022). Fields of Gold: Scraping Web Data for Marketing Insights. Journal of Marketing, 86(5), 1–20. https://doi.org/10.1177/00222429221100750.
  • Cavallo, A., & Rigobon, R. (2016). The Billion Prices Project: Using Online Prices for Inflation Measurement and Research. Journal of Economic Perspectives, 30(2), 151–178. https://doi.org/10.1257/jep.30.2.151.
  • Daas, P. J. H., & van der Doef, S. (2020). Detecting Innovative Companies via their Website. Statistical Journal of IAOS, 36(4), 1239–1251. https://doi.org/10.3233/SJI-200627.
  • Daas, P. J. H., Puts, M. J., Buelens, B., & van den Hurk, P. A. M. (2015). Big Data as a Source for Official Statistics. Journal of Official Statistics, 31(2), 249–262. https://doi.org/10.1515/jos-2015-0016.
  • Dogucu, M., & Çetinkaya-Rundel, M. (2020). Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities. Journal of Statistics and Data Science Education, 29(sup1), 112–122. https://doi.org/10.1080/10691898.2020.1787116.
  • European Commission. (n.d. a). ESSNet Big Data. Retrieved August 17, 2022, from https://ec.europa.eu/eurostat/cros/content/essnet-big-data-1_en.
  • European Commission. (n.d. b). ESSNet Big Data II. Retrieved August 17, 2022, from https://ec.europa.eu/eurostat/cros/essnet-big-data-2_en.
  • European Commission. (n.d. c). Experimental big data statistics. Retrieved August 17, 2022, from https://ec.europa.eu/eurostat/cros/content/Experimental_big_data_statistics_en.
  • European Commission (n.d. d). Web scraping policy. Retrieved April 21, 2023, from https://cros-legacy.ec.europa.eu/content/item-04-web-scraping-policy_en.
  • European Commission. (n.d. e). Trusted Smart Statistics – Web Intelligence Network. Retrieved August 17, 2022, from https://ec.europa.eu/eurostat/cros/WIN_en.
  • European Commission. (2022a). Deliverable 2.1: WP2 1st Interim Progress Report. https://cros.ec.europa.eu/system/files/2023-12/wp2_deliverable_2_1_wp2_1st_interim_progress_report_20220331_revision_2.pdf.
  • European Commission. (2022b). Report: URL finding methodology. https://cros-legacy.ec.europa.eu/system/files/20220131_url_finding_methodology.pdf.
  • Khder, M. A. (2021). Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application. International Journal of Advances in Soft Computing and its Applications, 13(3), 144–168. https://doi.org/10.15849/ijasca.211128.11.
  • Krotov, V., & Tennyson, M. (2018). Research Note: Scraping Financial Data from the Web Using the R Language. Journal of Emerging Technologies in Accounting, 15(1), 169–181. https://doi.org/10.2308/jeta-52063.
  • Nasiboglu, R., & Akdogan, A. (2020). Estimation of the Second Hand Car Prices from Data Extracted via Web Scraping Techniques. Journal of Modern Technology & Engineering, 5(2), 157–166. http://jomardpublishing.com/UploadFiles/Files/journals/JTME/V5N2/NasibogluR.pdf.
  • Oancea, B., & Necula, M. (2019). Web scraping techniques for price statistics – the Romanian experience. Statistical Journal of the IAOS, 35(4), 657–667. https://doi.org/10.3233/SJI-190529.
  • Office for National Statistics. (n.d.). Web Scraping Policy. Retrieved August 17, 2022, from https://www.ons.gov.uk/aboutus/transparencyandgovernance/datastrategy/datapolicies/webscrapingpolicy.
  • Orbis. (n.d.). Overview [Data set]. Retrieved April 28, 2023, from https://www.bvdinfo.com/en-gb/our-products/data/international/orbis.
  • Palys, T. (2008). Purposive sampling. In L. M. Given (Ed.), The Sage Encyclopedia of Qualitative Research Methods, Vol. 2 (pp. 697–698). Sage. https://doi.org/10.4135/9781412963909.
  • Pegueroles, P., Guerrero, R., Fernández, A., & López, D. (2021). Price’s Index through of Web Scraping. Revista Chilena de Economía y Sociedad, 15(1), 32–54. https://rches.utem.cl/wp-content/uploads/sites/8/2022/01/revista-chilena-de-economia-y-sociedad-vol15-n1-2021-Pegueroles-Guerrero-Fernandez-Lopez.pdf.
  • Polidoro, F., Giannini, R., Lo Conte, R., Mosca, S., & Rossetti, F. (2015). Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation. Statistical Journal of the IAOS, 31(2), 165–176. https://doi.org/10.3233/SJI-150901.
  • Schedlbauer, J., Raptis, G., & Ludwig, B. (2021). Medical informatics labor market analysis using web crawling, web scraping, and text mining. International Journal of Medical Informatics, 150, 1–9. https://doi.org/10.1016/j.ijmedinf.2021.104453.
  • Wirthmann, A., & Reis, F. (2021). The Web Intelligence Hub – A tool for integrating web data in Official Statistics. 63rd ISI World Statistics Congress, Online. https://cros-legacy.ec.europa.eu/sites/default/files/isi_-_web_intelligence_hub_eurostat_paper.pdf.

Document Type

Publication order reference

Identifiers

Biblioteka Nauki
31232088

YADDA identifier

bwmeta1.element.ojs-doi-10_59139_ws_2023_12_3
JavaScript is turned off in your web browser. Turn it on to take full advantage of this site, then refresh the page.