Skip to main content
Log in

Data collection methods on the Web for infometric purposes — A review and analysis

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

We present different methods of data collection from the Web for informetric purposes. For each method, some studies utilizing it are reviewed, and advantages and shortcomings of each technique are discussed. The paper emphasizes that data collection must be carried out with great care. Since the Web changes constantly, the findings of any study are valid only in the time frame in which it was carried out, and are dependent on the quality of the data collection tools, which are usually not under the control of the researcher. At the current time, the quality and the reliability of most of the available search tools are not satisfactory, thus informetric analyses of the Web mainly serve as demonstrations of the applicability of informetric methods to this medium, and not as a means for obtaining definite conclusions. A possible solution is for the scientific world to develop its own search and data collection tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Aguillo, I. F. (1997). STM information on the Web and development of New Internet R& D databases and indicators. Online Information 97 Proceedings, 239-243.

  • Aguillo, I. F. (2000-a). A new generation of tools for search, recovery and quality evaluation of World Wide Web medical resources. Online Information Review, 24 (2); 138-143.

    Google Scholar 

  • Aguillo, I. F. (2000-b). Mirroring individual scientometric contributions in the Cybermetrics site. In SIGMETRICS discussion list. [Online]. Available: http://listserv.utk.edu/cgi-bin/wa?A1=ind0009&L=sigmetrics (September 2000).

  • Aguillo, I. F. (no date). Cybermetrics. Papers and Abstracts. [Online]. Available: http://www.cindoc.csic.es/cybermetrics/links03.html (September 2000).

  • Aguillo, I. F., Pareja, V. M. (2000). Indicators of the Internet presence of the Western European Research Councils. Poster Presentation in S&T 2000, Leiden, May 2000 [Online]. Available: http://sahara.fsw.leidenuniv.nl/cwts/abs/AGUILLO.txt (September 2000).

  • Almind, T. C. & Ingwersen, P. (1997). Informetric analyses on the World Wide Web: Methodological approaches to “Webometrics”. Journal of Documentation, 53 (4), 404-426.

    Google Scholar 

  • Albert, R., Jeong, H., Barabasi, A. L. (1999). Diameter of the World Wide Web. Nature, 401: 130-131.

    Google Scholar 

  • AltaVista (2000). Advanced Search Tutorial. [Online]. Available: http://doc.altavista.com/adv_search/ast_i_index.html (September 2000).

  • Barabasi, A. L., Albert, R. (1999). Emergence of scaling in random networks. Science, 286 (5439): 509-512.

    Google Scholar 

  • Bar-Ilan, J. (1998a). On the overlap, the precision and estimated recall of search engines–A case study of the query “Erdos”. Scientometrics, 42 (2): 207-228.

    Google Scholar 

  • Bar-Ilan, J. (1998b). The mathematician, Paul Erdos (1913-1996) in the eyes of the Internet. Scientometrics, 43 (2): 257-267.

    Google Scholar 

  • Bar-Ilan, J. (1999). Search engine results over time–A case study on search engine stability. Cybermetrics, 2/3(1), paper 1. [Online]. Available: http://www.cindoc.csic.es/cybermetrics/articles/v2i1p1.html (September 2000).

  • Bar-Ilan, J. (2000a) The Web as information source on informetrics?. A content analysis. JASIS, 51 (5): 432-443.

    Google Scholar 

  • Bar-Ilan, J. (2000b). Results of an extensive search for “S&T indicators” on the Web. A content analysis. Scientometrics, 49 (2): 257-277.

    Google Scholar 

  • Bar-Ilan, J. (2000c). Evaluating the stability of the search tools Hotbot and Snap: A case study. Online Information Review, 24(6).

  • Bar-Ilan, J., Peritz B. C. (1999). The life span of a specific topic on the Web; the case of “Iformetrics” a quantitative analysis. Scientometrics, 46 (3): 371-382.

    Google Scholar 

  • Bergman, M. K. (2000). White Paper. The Deep Web: Surfacing Hidden Value. [Online]. Available: http://128.121.227.57.download/deepwebwhitepaper.pdf (

  • Bharat, K., Broder, A. (1998). A technique for measuring the relative size and overlap of public Web search engines. In: Proceedings of the 7th International World Wide Web Conference, April 1998, Computer Networks and ISDN Systems, 30: 379-388 [Also online]. Available: http://decweb.ethz.ch/WWW7/1937/com1937.htm (September 2000).

    Google Scholar 

  • Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S. (1998). The connectivity server: Fast access to linkage information on the Web. In: Proceedings of the 7th International World Wide Web Conference, April 1998, Computer Networks and ISDN Systems, 30: 469-477 [Also online]. Available: http://www7.scu.edu.au/programme/fullpapers/1938/com1938.htm (September 2000)

    Google Scholar 

  • Bharat, K., Henzinger, M. (1998). Improved algorithms for topic distillation in a hypertext environment. In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, August, 1998, 104-111.

  • Boudourides, M. A. (no date). Webometrics and Organizations. [Online]. Available: http://hyperion.math.upatras.gr/weborg/ (September, 2000).

  • Bray, T. (1996). Measuring the Web. In Proceedings of the 5th International World Wide Web Conference, May 1996, Computer Networks and ISDN Systems, 28, 993-1005. [Also online] Available: http://www5conf.inria.fr/fich_html/papers/P9/Overview.html (September 2000).

    Google Scholar 

  • Brewington, B. E., Cybenko, G. (2000). How dynamic is the Web? In Proceedings of the 9th International World Wide Web Conference, May 2000, Computer Networks and ISDN Systems, 33, 257-276. [Also online] Available: http://www9.org/w9cdrom/264/264.html (August 2000).

    Google Scholar 

  • Brin, S., Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th International World Wide Web Conference, April 1998, Computer Networks and ISDN Systems, 30: 107-117. [Also online] Available: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm (September 2000).

    Google Scholar 

  • Broder, A., Kumar, R., Maghoul, F., Raghavan. P., Rajagopalan, S., Stata, R., Tomlins, A., Wiener, J. (2000). Graph structure in the Web. In: Proceedings of the 9th International World Wide Web Conference, May 2000, Computer Networks and ISDN Systems 33: 309-320. [Also online]. Available: http://www9.org/w9cdrom/160/160.html (September 2000).

    Google Scholar 

  • Brookes, B. C. (1990). Biblio-, sciento-, infor-metrics??? What are we talking about? In L. Egghe and R. Rousseau (Eds), Informetrics 89/90, 31-42. Amsterdam: Elsevier.

    Google Scholar 

  • Carriere, J., Kazman, R. (1997). WebQuery: Searching and visualiznig the Web through connectivity. In: Proceedings of the 6th International World Wide Web Conference, May 1997, 701-711. [Also online]. Available: http://www.cgl.uwaterloo.ca/Projects/Vanish/webquery-1.html (September 2000).

  • Chakrabarti, S., Dom B., Kumar, R. S., Raghavan, P., Rajagopalan, S., Tomkins, A., Kleinberg, J. M., Gibson, D. (1999). Hypersearching the Web. Scientific American, 280(6): 54-60. [Also online]. Available: http://www.sciam.com/1999/0699issue/0699raghavan.html (September 2000).

    Google Scholar 

  • Chakrabarti, S., Dom B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J. M. (1998). Automatic Resource Compliation by Analyzing Hyperlink Structure and Assoicated Text. In: Proceedings of the 7th International World Wide Web Conference, April 1998, Computer Networks and ISDN Systems, 30: 65-74 [Also online]. Available: http://decweb.ethz.ch/WWW7/1898/com1898.htm (September 2000).

    Google Scholar 

  • Chakrabarti, S., VAN DEN Berg, M., Dom, B. (1999). Focused crawling: A new approach to topic-specific Web resource discovery. In: Proceedings of the 8th International World Wide Web Conference, May 1999, 545-562. [Also online]. Available: http://www8.org/w8-papers/5a-search-query/crawling/index.html (September 2000). The Clever Project. (no date). [Online]. Available: http://www.almaden.ibm.com/cs/k53/clever.html (September 2000).

  • Cronin, B., Snyder, H. W., Rosenbaum, H., Martinson, A., Callahan, E. (1998). Invoked on the Web. Journal of American Society for Information Science, 49 (14): 1319-1328.

    Google Scholar 

  • Cui, L. (1999). Rating health Web sites using the principles of citation analysis: A bibliometric approach. Journal of Medical Internet Research, 1(1): e4. [Online]. Available: http://www.jmir.org/1999/1/e4/index.htm (September 2000).

    Google Scholar 

  • Dahn, M. (2000).Counting angels on a pinhead: Critically interpreting Web size estimates. Online 24 (1): 35-40. [Also online]. Available: http://www.onlineinc.com/onlinemag/OL2000/dahn1.html (August 2000).

    Google Scholar 

  • Dean, J., Henzinger, M. (1999). Finding related pages in the World Wide Web. In: Proceedings of the 8th International World Wide Web Conference, May 1999, 389-401. [Also online]. Available: http://www8.org/w8-papers/4a-search-mining/finding/finding.html

  • Feldman, S. (1997). “It Was Here a Minute Ago!”: Archiving the Net. Searcher, 5 (9), 52. [Also Online]. Available: http://www.info-sec.com/internet/internet_120397a.html-ssi (August 2000).

    Google Scholar 

  • Garfield, E., Welljams-Dorof, A. (1992). Citation data: Their use as quantitative indicators for science and technology evaluation and policy making. Science & Public Policy, 19 (5): 321-327. [Also online]. Available: http://www.garfield.library.upenn.edu/papers/sciandpubpolv19(5)p132y1992.html (August 2000).

    Google Scholar 

  • Grossman, J. W., Ion, P. D. F. (2000). The Erdos Number Project. [Online]. Available: http://www.oakland.edu/~grossman/erdoshp.html. (September 2000).

  • Huberman, B. A., Adamic, L. A. (1999). Growth dynamics of the World-Wide Web, Nature, 401, 131.

    Google Scholar 

  • Ingwersen. P. (1998). The calculation of Web impact factors. Journal of Documentation, 4(2): 236-243.

    Google Scholar 

  • Jansen, B. J. SPINK, A., Saracevic, T. (2000). Real life, real users and real needs: A study and analysis of user queries on the Web. Information Processing and Management, 36: 207-227.

    Google Scholar 

  • Katz, L. (1953). A new status index derived from sociometric analysis. Psycometrika. 18 (1): 39-43.

    Google Scholar 

  • Kleinberg, J. M. (1998). Authoritative sources in a hyperlinked environment. In: Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Also appeared in: Journal of the ACM, 46(5): 604-632, 1999. [Also online]. Available: http://www.cs.cornell.edu/home/kleinber/auth.ps (September 2000).

    Google Scholar 

  • Koehler, W. (1999). An analysis of Web page and Web site constancy and permanence. Journal of the American Society for Information Science, 50 (2): 162-180.

    Google Scholar 

  • Larson, R. (1966). Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. ASIS96. [Online]. Available: http://sherlock.berkeley.edu/asis96/asis96.html (September 2000).

  • Lawrence, S., Giles, C. L. (1998). Searching the World Wide Web. Science, 280, 98-100.

    Google Scholar 

  • Lawrence, S., Giles, C. L. (1999). Accessibility and distribution of information on the Web. Nature, 400: 107-110.

    Google Scholar 

  • Lawrence, S., Bollacker, K., Giles, C. L. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32 (6): 67-71.

    Google Scholar 

  • Leydesdorff, L., Curran, M. (2000). Mapping university-industry-government relations on the Internet: The construction of indicators for a knowledge-based economy. Cybermetrics, 4 (1), paper 2. [Online]. Available: http://www.cindoc.csis.es/cybermetrics/articles/v4i1p2.html (August 2000).

  • The Library OF Congress (2000). Facsinating Facts about the Library of Congress. [Online]. Available: http://www.loc.gov/today/fascinate.html (August 2000).

  • Mannina, B., Quoniam, L. (2000) How to hold a virtual library active? Cybermetics, 4 (1), paper 1. [Online]. Available: http://www.cindoc.csic.es/cybermetrics/articles/v4i1p1.html (September 2000)

  • Mccallum, A. K., Nigam, K., Rennie, J., Seymore, K. (2000). Automating the construction of Internet portals with machine learning. Information Retrieval, 3, 127-163.

    Google Scholar 

  • Moore, A., Murray, B. H. (2000). Sizing the Internet. [Online]. Available: http://www.cyveillance.com/resources/7921S_Sizing_the_Internet.pdf (August 2000).

  • Notess, G. R. (2000). Search Engine Statistics: Database Total Size Estimates. [Online]. Available: http://www.searchengineshowdown.com/stats/0002sizeest.shtml (August 2000).

  • Notess, G. R. (2000b). The Half Billion Crew: Google, Inktomi GEN3 & Webtop. [Online]. Available: http://www.searchengineshowdown.com/stats/500million.html (September, 2000).

  • Notess, G. R. (2000c). Inconsistencies Reports. [Online]. Available: http://www.searchengineshowdown.com/inconsistent.shtml (September 2000).

  • Pinski, G., Narin, F. (1976). Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing and Management, 12: 297-312.

    Google Scholar 

  • PubMed Overview (2000). [Online]. Available: http://www.ncbi.nlm.nih.gov:80/entrez/query/stattic/overview.html (August 2000).

  • Rosenbaum, H. (1998). Web-based community networks: A study of information organization and access. In: ASIS.98 Contributed Papers, 516-530.

  • Ross, N. C. M., Wolfram, D. (2000). End user searching on the Internet: An analysis of term pair topics submitted to the Excite search engine. Journal of the American Society for Information Science, 51 (10): 949-958.

    Google Scholar 

  • Rousseau, R. (1997). Sitations: An exploratory study. Cybermetrics, 1 (1), [Online]. Available: http://www.cindoc.es/cybermetrics/articles/v1i1p1.htm (August 2000).

  • Rousseau, R. (1999). Daily time series of common single word searches in AltaVista and NorthernLight. Cybermetrics, 2/3 (1), paper 2, [Online]. Available: http://www.cindoc.csis.es/cybermetricc/articles/v2i1p2.html (August 2000).

  • Salton, G. (1989). Automatic Text Processing. Addison-Wesley, Reading, MA.

    Google Scholar 

  • Sherman, C. (1999). The search engines speak. In Web Search. [Online]. Avaliable: http://websearch.about.com/internet/websearch/library/weekly/aa120399.htm (September 2000).

  • Smith, A. G. (1999). A tale of two Web spaces: Comparing sites using Web impact factors. Journal of Documentation, 55 (5): 577-592.

    Google Scholar 

  • Snyder, H., Rosenbaum, H. (1999). Can search engines be used as tools for Web-link analysis? A critical view. Journal of Documentation, 55 (4): 375-384

    Google Scholar 

  • Sullivan, D. (1998). Northern light adds search functions, freshens index. In SearchEngineWatch. [Online]. Available: http://www.searchenginewatch.internet.com/sereport/98/08northernlight.html (September 2000).

  • Sullivan, D. (2000). Search engine sizes. In: SearchEngineWatch. [Online]. Available: http://www.searchenginewatch.com/reports.sizes.html (August 2000).

  • Sullivan, D. (no date). Search assistance features. In: SearchEngineWatch. [Online]. Available: http://searchenginewatch.internet.com/facts/assistance.html (September 2000).

  • Tague-Sutcliffe, J. (1992). An introduction to informetrics. Information Processing and Management, 28 (1): 1-3.

    Google Scholar 

  • Thelwall, M. (2000). Web impact factors and search engine coverage. Journal of Documentation, 56 (2): 185-189.

    Google Scholar 

  • Wired Cybrarian. (1997). [Online]. Available: http://hotwired.lycos.com/cybrarian/ reference/search.html (August 2000).

  • Woodruff A., Aoki, P. M., Brewer E., Gauthier P., Rowe, L. A. (1996). An investigation of documents from the World Wide Web. In Proceedings of the 5th International World Wide Web Conference, May 1996, Computer Networks and ISDN Systems, 28: 963-980. [Also online]. Available: http://www5conf.inria.fr/fich_html/papers/P7/Overview.html (September 2000).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bar-Ilan, J. Data collection methods on the Web for infometric purposes — A review and analysis. Scientometrics 50, 7–32 (2001). https://doi.org/10.1023/A:1005682102768

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1005682102768

Keywords

Navigation