Abstract
This chapter reviews the development of data collection procedures on the web with an emphasis on current practices, data cleansing and matching, data quality and transparency. There are several issues to be considered when collecting data from the web. Transparency is essential to know what is included in the data source, how recent and comprehensive the data are, what timeframe is covered etc. Data quality relates to reliability and accuracy. Mistakes are inevitable, data providers, aggregators, and researchers all make mistakes, but these mistakes should be reduced to a minimum so that meaningful conclusions may be reached from the data analysis. Extensive data cleansing before starting the analysis is needed to try to correct mistakes in the data. When several data sources are used, data from different sources should be matched, and duplicates should be removed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
R. Caillieau: About WWW, J. Univers. Comput. Sci. 1(4), 221–231 (1995)
Pew Research Center: World Wide Web timeline, http://www.pewinternet.org/2014/03/11/world-wide-web-timeline/ (2014)
K.A. Zimmermann: Internet History Timeline: ARPANET to the World Wide Web, https://www.livescience.com/20727-internet-history.html (2012)
R.R. Larson: Bibliometrics of the world wide web: An exploratory analysis of the intellectual structure of cyberspace. In: Proc. 59th ASIS Annu. Meet., Baltimore (1996)
Google: Refine web searches, https://support.google.com/websearch/answer/2466433?hl=en (2017)
T.C. Almind, P. Ingwersen: Informetric analyses on the world wide web: Methodological approaches to ‘webometrics', J. Doc. 53(4), 404–426 (1997)
P. Ingwersen: The calculation of web impact factors, J. Doc. 54(2), 236–243 (1998)
A.G. Smith: A tale of two web spaces: Comparing sites using web impact factors, J. Doc. 55(5), 577–592 (1999)
M. Thelwall: Web impact factors and search engine coverage, J. Doc. 56(2), 185–189 (2000)
B. Cronin, H.W. Snyder, H. Rosenbaum, A. Martinson, E. Callahan: Invoked on the web, J. Am. Soc. Inf. Sci. 49(14), 1319–1328 (1998)
R. Rousseau: Sitations: An exploratory study, Cybermetrics 1(1), paper 1 (1997)
J. Bar-Ilan: The “mad cow disease”, usenet newsgroups and bibliometric laws, Scientometrics 39(1), 29–55 (1997)
J. Bar-Ilan: The mathematician, Paul Erdos (1913–1996) in the eyes of the internet, Scientometrics 43(2), 257–267 (1998)
J. Bar-Ilan: On the overlap, the precision and estimated recall of search engines. A case study of the query “Erdos”, Scientometrics 42(2), 207–228 (1998)
J. Bar-Ilan, B. Peritz: The lifespan of “informetrics” on the web: An eight year study (1998–2006), Scientometrics 79(1), 7–25 (2008)
M. Thelwall: Extracting accurate and complete results from search engines: Case study Windows Live, J. Am. Soc. Inf. Sci. Technol. 59(1), 38–50 (2008)
J. Bar-Ilan, B.C. Peritz: A method for measuring the evolution of a topic on the web: The case of “informetrics”, J. Am. Soc. Inf. Sci. Technol. 60(9), 1730–1740 (2009)
W. Koehler: A longitudinal study of web pages continued: A consideration of document persistence, Inf. Res. 9(2), 9–2 (2004)
D. Gomes, M.J. Silva: Modeling information persistence on the web. In: Proc. 6th Int. Conf. Web Eng (2006) pp. 193–200
R. Baeza-Yates, B. Poblete: Evolution of the chilean web structure composition. In: Proc. IEEE/LEOS 3rd Int. Conf. Numer. Simul. Semicond. Optoelectron. Devices (2003) pp. 11–13
H. Snyder, H. Rosenbaum: Can search engines be used as tools for web-link analysis? A critical view, J. Doc. 55(4), 375–384 (1999)
J. Bar-Ilan: Search engine results over time: A case study on search engine stability, Cybermetrics 2/3(1), paper 1 (1999)
J. Bar-Ilan: The web as an information source on informetrics? A content analysis, J. Am. Soc. Inf. Sci. 51(5), 432–443 (2000)
W. Mettrop, P. Nieuwenhuysen: Internet search engines—fluctuations in document accessibility, J. Doc. 57(5), 623–651 (2001)
J. Bar-Ilan: How much information do search engines disclose on the links to a web page? A longitudinal case study of the ‘cybermetrics' home page, J. Inf. Sci. 28(6), 455–466 (2002)
E. Sharp: The first page of Google, by the numbers, http://www.protofuse.com/blog/first-page-of-google-by-the-numbers/ (2014)
Wikipedia: Data cleansing, https://en.wikipedia.org/w/index.php?title=Data_cleansingoldid=771405405 (2017)
J. Bar-Ilan: Data collection methods on the web for infometric purposes—A review and analysis, Scientometrics 50(1), 7–32 (2001)
M. Thelwall: Data cleansing and validation for multiple site link structure analysis. In: Web Mining: Applications and Techniques, ed. by A. Scime (IGI Global, Hershey 2005) pp. 208–227
M. Thelwall: Results from a web impact factor crawler, J. Doc. 57(2), 177–191 (2001)
M. Thelwall: Extracting macroscopic information from web links, J. Am. Soc. Inf. Sci. Technol. 52(13), 1157–1168 (2001)
M. Thelwall: A comparison of sources of links for academic web impact factor calculations, J. Doc. 58(1), 66–78 (2002)
M. Thelwall: Conceptualizing documentation on the Web: An evaluation of different heuristic based models for counting links between university web sites, J. Assoc. Inf. Sci. Technol. 53(12), 995–1005 (2002)
M. Thelwall, D. Wilkinson: Three target document range metrics for university web sites, J. Am. Soc. Inf. Sci. Technol. 54(6), 490–497 (2003)
M. Thelwall: Evidence for the existence of geographic trends in university web site interlinking, J. Doc. 58(5), 563–574 (2002)
M. Thelwall, R. Tang, L. Price: Linguistic patterns of academic web use in Western Europe, Scientometrics 56(3), 417–432 (2003)
M. Thelwall, A. Smith: Interlinking between Asia-Pacific university web sites, Scientometrics 55(3), 363–376 (2002)
M. Thelwall (Ed.): Link Analysis: An Information Science Approach (Elsevier, Amsterdam 2004)
M. Thelwall: Introduction to webometrics: Quantitative web research for the social sciences. In: Synthesis Lectures on Information Concepts, Retrieval, and Services (Morgan Claypool, San Rafael 2009)
L. Vaughan: Visualizing linguistic and cultural differences using web co-link data, J. Am. Soc. Inf. Sci. Technol. 57(9), 1178–1193 (2006)
K.M. Kousha, M. Thelwall: Motivations for URL citations to open access LIS library and information science articles, Scientometrics 68(3), 501–517 (2006)
P. Sud, M. Thelwall: Linked title mentions: A new automated link search candidate, Scientometrics 101(3), 1831–1849 (2014)
H.J. Kim: Motivations for hyperlinking in scholarly electronic articles: A qualitative study, J. Am. Soc. Inf. Sci. 51(10), 887–899 (2000)
D. Wilkinson, G. Harries, M. Thelwall, L. Price: Motivations for academic web site interlinking: Evidence for the web as a novel source of information on informal scholarly communication, J. Inf. Sci. 29(1), 49–56 (2003)
J. Bar-Ilan: What do we know about links and linking? A framework for studying links in academic environments, Inf. Process. Manag. 41(4), 973–986 (2005)
L. Vaughan: Exploring website features for business information, Scientometrics 61(3), 466–477 (2004)
L. Vaughan, J. You: Comparing business competition positions based on web co-link data: The global market vs. the Chinese market, Scientometrics 68(3), 611–628 (2006)
L. Vaughan, Y. Gao, M. Kipp: Why are hyperlinks to business websites created? A content analysis, Scientometrics 67(2), 291–300 (2006)
L. Leydesdorff, M. Curran: Mapping university-industry-government relations on the internet: The construction of indicators for a knowledge-based economy, Cybermetrics 4(1), 1–17 (2000)
D. Stuart, M. Thelwall: Investigating triple helix relationships using URL citations: A case study of the UK West Midlands automobile industry, Res. Eval. 15(2), 97–106 (2006)
L. Vaughan, D. Shaw: Bibliographic and web citations: What is the difference?, J. Am. Soc. Inf. Sci. Technol. 54(14), 1313–1322 (2003)
L. Vaughan, D. Shaw: Web citation data for impact assessment: A comparison of four science disciplines, J. Assoc. Inf. Sci. Technol. 56(10), 1075–1087 (2005)
P. Jacsó: Google Scholar: The pros and the cons, Online Inf. Rev. 29(2), 208–214 (2005)
P. Jacsó: Google scholar revisited, Online Inf. Rev. 32(1), 102–114 (2008)
L.I. Meho, K. Yang: Impact of data sources on citation counts and rankings of LIS faculty: Web of Science versus Scopus and Google Scholar, J. Am. Soc. Inf. Sci. Technol. 58(13), 2105–2125 (2007)
C. Neuhaus, H.D. Daniel: Data sources for performing citation analysis: An overview, J. Doc. 64(2), 193–210 (2008)
A.W.K. Harzing, R. Van der Wal: Google Scholar as a new source for citation analysis, Ethics Sci. Environ. Polit. 8(1), 61–73 (2008)
M. Shultz: Comparing test searches in PubMed and Google Scholar, J. Med. Libr. Assoc. 95(4), 442–445 (2007)
P. Jacsó: As we may search—comparison of major features of the Web of Science, Scopus, and Google Scholar citation-based and citation-enhanced databases, Curr. Sci. 89(9), 1537–1547 (2005)
K. Bauer, N. Bakkalbasi: An examination of citation counts in a new scholarly communication environment, D-Lib Magazine (2005), https://doi.org/10.1045/september2005-bauer
C. Neuhaus, E. Neuhaus, A. Asher, C. Wrede: The depth and breadth of Google Scholar: An empirical study, Portal 6(2), 127–141 (2006)
M. Norris, C. Oppenheim: Comparing alternatives to the Web of Science for coverage of the social sciences' literature, J. Informetr. 1(2), 161–169 (2007)
J.J. Meier, T.W. Conkling: Google Scholar's coverage of the engineering literature: An empirical study, J. Acad. Librariansh. 34(3), 196–201 (2008)
K. Kousha, M. Thelwall: Google Scholar citations and Google web/URL citations: A multi discipline exploratory analysis, J. Am. Soc. Inf. Sci. Technol. 58(7), 1055–1065 (2007)
A.W. Harzing: Publish or Perish, http://harzing.com/pop.gtm (2007)
D. Adams: Publish or Perish version 5, http://www.harzing.com/blog/2016/10/publish-or-perish-version-5 (2016)
A.W. Harzing, R. Van Der Wal: A Google Scholar h-index for journals: An alternative metric to measure journal impact in economics and business, J. Am. Soc. Inf. Sci. Technol. 60(1), 41–46 (2009)
L. Bornmann, W. Marx, H. Schier, E. Rahm, A. Thor, H.D. Daniel: Convergent validity of bibliometric Google Scholar data in the field of chemistry—Citation counts for papers that were accepted by Angewandte Chemie International Edition or rejected but published elsewhere, using Google Scholar, Science Citation Index, Scopus, and Chemical Abstracts, J. Informetr. 3(1), 27–35 (2009)
J.E. Hirsch: An index to quantify an individual's scientific research output, Proc. Natl. Acad. Sci. 102(46), 16569–16572 (2005)
J. Bar-Ilan: Which h-index?—A comparison of WoS, Scopus and Google Scholar, Scientometrics 74(2), 257–271 (2008)
L.I. Meho, Y. Rogers: Citation counting, citation ranking, and h-index of human computer interaction researchers: Comparison of Scopus and Web of Science, J. Am. Soc. Inf. Sci. Technol. 59(11), 1711–1726 (2008)
A.W. Harzing: A preliminary test of Google Scholar as a source for citation data: A longitudinal study of Nobel prize winners, Scientometrics 94(3), 1057–1075 (2013)
A.W. Harzing: A longitudinal study of google scholar coverage between 2012 and 2013, Scientometrics 98(1), 565–575 (2014)
H.F. Moed, J. Bar-Ilan, G. Halevi: A new methodology for comparing Google Scholar and Scopus, J. Informetr. 10(2), 533–551 (2016)
P. Jacsó: Deflated, inflated and phantom citation counts, Online Inf. Rev. 30(3), 297–309 (2006)
J. Bar-Ilan: Citations to the “Introduction to Informetrics” indexed by WoS, Scopus and Google Scholar, Scientometrics 82(3), 495–506 (2010)
E. Delgado López-Cózar, N. Robinson-García, D. Torres-Salinas: The Google Scholar experiment: How to index false papers and manipulate bibliometric indicators, J. Assoc. Inf. Sci. Technol. 65(3), 446–454 (2014)
J. Pino-Díaz, E. Jiménez-Contreras, R. Ruíz-Baños, R. Bailón-Moreno: Strategic knowledge maps of the techno-scientific network (SK maps), J. Am. Soc. Inf. Sci. Technol. 63(4), 796–804 (2012)
Google: Google books history, http://books.google.com/googlebooks/about/history.html
K. Kousha, M. Thelwall: Google book search: Citation analysis for social science and the humanities, J. Am. Soc. Inf. Sci. Technol. 60(8), 1537–1549 (2009)
K. Kousha, M. Thelwall: An automatic method for extracting citations from Google Books, J. Assoc. Inf. Sci. Technol. 66(2), 309–320 (2015)
DORA: San Francisco declaration on research assessment, http://www.ascb.org/files/SFDeclarationFINAL.pdf (2012)
D. Hicks, P. Wouters, L. Waltman, S. De Rijcke, I. Rafols: Bibliometrics: The Leiden Manifesto for research metrics, Nature 520, 429–431 (2015)
E. Delgado López-Cózar, Á. Cabezas-Clavijo: Google Scholar Metrics: An unreliable tool for assessing scientific journals, http://digibug.ugr.es/bitstream/handle/10481/21540/GSM_castellano.pdf?sequence=6&isAllowed=y (2012)
E. Orduña-Malea, E.D. Delgado López-Cózar: Google scholar metrics evolution: An analysis according to languages, Scientometrics 98(3), 2353–2367 (2014)
A.W. Harzing, S. Alakangas: Microsoft Academic: Is the phoenix getting wings?, Scientometrics 110(1), 371–383 (2017)
P. Davis, M. Fromerth: Does the arXiv lead to higher citations and reduced publisher downloads for mathematics articles?, Scientometrics 71(2), 203–215 (2007)
H.F. Moed: The effect of “open access” on citation impact: An analysis of ArXiv's condensed matter section, J. Am. Soc. Inf. Sci. Technol. 58(13), 2047–2054 (2007)
V. Larivière, C.R. Sugimoto, B. Macaluso, S. Milojević, B. Cronin, M. Thelwall: ArXiv E-prints and the journal of record: An analysis of roles and relationships, J. Assoc. Inf. Sci. Technol. 65(6), 1157–1169 (2014)
X. Li, M. Thelwall, K. Kousha: The role of arXiv, RePEc, SSRN and PMC in formal scholarly communication, Aslib J. Inf. Manag. 67(6), 614–635 (2015)
J. Priem, D. Taraborelli, P. Groth, C. Neylon: Altmetrics: A manifesto, http://altmetrics.org/manifesto/ (2010)
S. Haustein, T.D. Bowman, R. Costas: Interpreting ‘altmetrics': Viewing acts on social media through the lens of citation and social theories. In: Theories of Informetrics and Scholarly Communication: A Festschrift in Honor of Blaise Cronin, ed. by C.R. Sugimoto (De Gruyter, Berlin 2016) pp. 372–406
X. Li, M. Thelwall, D. Giustini: Validating online reference managers for scholarly impact measurement, Scientometrics 91(2), 461–471 (2011)
J. Bar-Ilan, S. Haustein, I. Peters, J. Priem, H. Shema, J. Terliesner: Beyond citations: Scholars' visibility on the social web, https://arxiv.org/abs/1205.5611 (2012)
S. Haustein, V. Larivière, M. Thelwall, D. Amyot, I. Peters: Tweets vs. Mendeley readers: How do these two social media metrics differ?, IT-Inf. Technol. 56(5), 207–215 (2014)
E. Mohammadi, M. Thelwall: Mendeley readership altmetrics for the social sciences and humanities: Research evaluation and knowledge flows, J. Assoc. Inf. Sci. Technol. 65(8), 1627–1638 (2014)
Z. Zahedi, R. Costas, P. Wouters: How well developed are altmetrics? A cross-disciplinary analysis of the presence of ‘alternative metrics' in scientific publications, Scientometrics 101(2), 1491–1513 (2014)
S. Haustein, I. Peters, J. Bar-Ilan, J. Priem, H. Shema, J. Terliesner: Coverage and adoption of altmetrics sources in the bibliometric community, Scientometrics 101(2), 1145–1163 (2014)
M. Thelwall, S. Haustein, V. Larivière, C.R. Sugimoto: Do altmetrics work? Twitter and ten other social web services, PloS One 8(5), e64841 (2013)
Altmetric Support: When did altmetric start tracking attention to each attention source?, https://help.altmetric.com/support/solutions/articles/6000136884-when-did-altmetric-start-tracking-attention-to-each-attention-source- (2017)
S. Haustein, I. Peters, C.R. Sugimoto, M. Thelwall, V. Larivière: Tweeting biomedicine: An analysis of tweets and citations in the biomedical literature, J. Assoc. Inf. Sci. Technol. 65(4), 656–669 (2014)
R. Costas, Z. Zahedi, P. Wouters: Do “altmetrics” correlate with citations? Extensive comparison of altmetric indicators with citations from a multidisciplinary perspective, J. Assoc. Inf. Sci. Technol. 66(10), 2003–2019 (2015)
H. Shema, J. Bar-Ilan, M. Thelwall: Do blog citations correlate with a higher number of future citations? Research blogs as a potential source for alternative metrics, J. Assoc. Inf. Sci. Technol. 65(5), 1018–1027 (2014)
L. Bornmann: Validity of altmetrics data for measuring societal impact: A study using data from Altmetric and F1000Prime, J. Informetr. 8(4), 935–950 (2014)
E. Mohammadi, M. Thelwall: Assessing non-standard article impact using F1000 labels, Scientometrics 97(2), 383–395 (2013)
P. Kraker, E. Lex: A critical look at the ResearchGate score as a measure of scientific reputation. In: Proc. Quantif. Anal. Sch. Commun. Web Workshop, ASCW'15 (2015)
M. Thelwall, K. Kousha: ResearchGate: Disseminating, communicating, and measuring scholarship?, J. Assoc. Inf. Sci. Technol. 66(5), 876–889 (2015)
M. Thelwall, K. Kousha: Academia.edu: Social network or academic network?, J. Assoc. Inf. Sci. Technol. 65(4), 721–731 (2014)
Springer: Bookmetrix, http://www.springer.com/bookmetrix?SGWID=0-1773415-0-0-0 (2017)
J.C. Wallis, E. Rolando, C.L. Borgman: If we share data, will anyone use them? Data sharing and reuse in the long tail of science and technology, PloS One 8(7), e67332 (2013)
European Commission: Open Innovation, Open Science, Open to the World (European Commission, Brussels 2016)
C. Neylon, S. Wu: Article-level metrics and the evolution of scientific impact, PLoS Biology 7(11), e1000242 (2009)
PLOS: A comprehensive assessment of impact with article-level metrics (ALMs), https://www.plos.org/article-level-metrics
J. Bar-Ilan: Expectations versus reality—Search engine features needed for web research at mid 2005, Cybermetrics 9, paper 2 (2005)
J. Wilsdon, L. Allen, E. Belfiore, P. Campbell, S. Curry, S. Hill, R. Jones, R. Kain, S. Kerridge, M. Thelwall, J. Tinkler, I. Viney, P. Wouters, J. Hill, B. Johnson: The Metric Tide: Report of the Independent Review of the Role of Metrics in Research Assessment and Management, https://doi.org/10.13140/RG.2.1.4929.1363 (2015)
NISO: Altmetrics data quality code of conduct—Draft for public comment, http://www.niso.org/apps/group_public/document.php?document_id=16121wg_abbrev=altmetrics-quality (2016) NISO RP-25-201X-3
J. Wilsdon, J. Bar-Ilan, R. Frodeman, E. Lex, I. Peters, P. Wouters: Next-Generation Metrics: Responsible Metrics and Evaluation for Open Science (European Commission, Brussels 2017), https://ec.europa.eu/research/openscience/pdf/report.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Bar-Ilan, J. (2019). Data Collection from the Web for Informetric Purposes. In: Glänzel, W., Moed, H.F., Schmoch, U., Thelwall, M. (eds) Springer Handbook of Science and Technology Indicators. Springer Handbooks. Springer, Cham. https://doi.org/10.1007/978-3-030-02511-3_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-02511-3_30
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02510-6
Online ISBN: 978-3-030-02511-3
eBook Packages: Economics and FinanceEconomics and Finance (R0)