Abstract
We present different methods of data collection from the Web for informetric purposes. For each method, some studies utilizing it are reviewed, and advantages and shortcomings of each technique are discussed. The paper emphasizes that data collection must be carried out with great care. Since the Web changes constantly, the findings of any study are valid only in the time frame in which it was carried out, and are dependent on the quality of the data collection tools, which are usually not under the control of the researcher. At the current time, the quality and the reliability of most of the available search tools are not satisfactory, thus informetric analyses of the Web mainly serve as demonstrations of the applicability of informetric methods to this medium, and not as a means for obtaining definite conclusions. A possible solution is for the scientific world to develop its own search and data collection tools.
Similar content being viewed by others
References
Aguillo, I. F. (1997). STM information on the Web and development of New Internet R& D databases and indicators. Online Information 97 Proceedings, 239-243.
Aguillo, I. F. (2000-a). A new generation of tools for search, recovery and quality evaluation of World Wide Web medical resources. Online Information Review, 24 (2); 138-143.
Aguillo, I. F. (2000-b). Mirroring individual scientometric contributions in the Cybermetrics site. In SIGMETRICS discussion list. [Online]. Available: http://listserv.utk.edu/cgi-bin/wa?A1=ind0009&L=sigmetrics (September 2000).
Aguillo, I. F. (no date). Cybermetrics. Papers and Abstracts. [Online]. Available: http://www.cindoc.csic.es/cybermetrics/links03.html (September 2000).
Aguillo, I. F., Pareja, V. M. (2000). Indicators of the Internet presence of the Western European Research Councils. Poster Presentation in S&T 2000, Leiden, May 2000 [Online]. Available: http://sahara.fsw.leidenuniv.nl/cwts/abs/AGUILLO.txt (September 2000).
Almind, T. C. & Ingwersen, P. (1997). Informetric analyses on the World Wide Web: Methodological approaches to “Webometrics”. Journal of Documentation, 53 (4), 404-426.
Albert, R., Jeong, H., Barabasi, A. L. (1999). Diameter of the World Wide Web. Nature, 401: 130-131.
AltaVista (2000). Advanced Search Tutorial. [Online]. Available: http://doc.altavista.com/adv_search/ast_i_index.html (September 2000).
Barabasi, A. L., Albert, R. (1999). Emergence of scaling in random networks. Science, 286 (5439): 509-512.
Bar-Ilan, J. (1998a). On the overlap, the precision and estimated recall of search engines–A case study of the query “Erdos”. Scientometrics, 42 (2): 207-228.
Bar-Ilan, J. (1998b). The mathematician, Paul Erdos (1913-1996) in the eyes of the Internet. Scientometrics, 43 (2): 257-267.
Bar-Ilan, J. (1999). Search engine results over time–A case study on search engine stability. Cybermetrics, 2/3(1), paper 1. [Online]. Available: http://www.cindoc.csic.es/cybermetrics/articles/v2i1p1.html (September 2000).
Bar-Ilan, J. (2000a) The Web as information source on informetrics?. A content analysis. JASIS, 51 (5): 432-443.
Bar-Ilan, J. (2000b). Results of an extensive search for “S&T indicators” on the Web. A content analysis. Scientometrics, 49 (2): 257-277.
Bar-Ilan, J. (2000c). Evaluating the stability of the search tools Hotbot and Snap: A case study. Online Information Review, 24(6).
Bar-Ilan, J., Peritz B. C. (1999). The life span of a specific topic on the Web; the case of “Iformetrics” a quantitative analysis. Scientometrics, 46 (3): 371-382.
Bergman, M. K. (2000). White Paper. The Deep Web: Surfacing Hidden Value. [Online]. Available: http://128.121.227.57.download/deepwebwhitepaper.pdf (
Bharat, K., Broder, A. (1998). A technique for measuring the relative size and overlap of public Web search engines. In: Proceedings of the 7th International World Wide Web Conference, April 1998, Computer Networks and ISDN Systems, 30: 379-388 [Also online]. Available: http://decweb.ethz.ch/WWW7/1937/com1937.htm (September 2000).
Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S. (1998). The connectivity server: Fast access to linkage information on the Web. In: Proceedings of the 7th International World Wide Web Conference, April 1998, Computer Networks and ISDN Systems, 30: 469-477 [Also online]. Available: http://www7.scu.edu.au/programme/fullpapers/1938/com1938.htm (September 2000)
Bharat, K., Henzinger, M. (1998). Improved algorithms for topic distillation in a hypertext environment. In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, August, 1998, 104-111.
Boudourides, M. A. (no date). Webometrics and Organizations. [Online]. Available: http://hyperion.math.upatras.gr/weborg/ (September, 2000).
Bray, T. (1996). Measuring the Web. In Proceedings of the 5th International World Wide Web Conference, May 1996, Computer Networks and ISDN Systems, 28, 993-1005. [Also online] Available: http://www5conf.inria.fr/fich_html/papers/P9/Overview.html (September 2000).
Brewington, B. E., Cybenko, G. (2000). How dynamic is the Web? In Proceedings of the 9th International World Wide Web Conference, May 2000, Computer Networks and ISDN Systems, 33, 257-276. [Also online] Available: http://www9.org/w9cdrom/264/264.html (August 2000).
Brin, S., Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th International World Wide Web Conference, April 1998, Computer Networks and ISDN Systems, 30: 107-117. [Also online] Available: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm (September 2000).
Broder, A., Kumar, R., Maghoul, F., Raghavan. P., Rajagopalan, S., Stata, R., Tomlins, A., Wiener, J. (2000). Graph structure in the Web. In: Proceedings of the 9th International World Wide Web Conference, May 2000, Computer Networks and ISDN Systems 33: 309-320. [Also online]. Available: http://www9.org/w9cdrom/160/160.html (September 2000).
Brookes, B. C. (1990). Biblio-, sciento-, infor-metrics??? What are we talking about? In L. Egghe and R. Rousseau (Eds), Informetrics 89/90, 31-42. Amsterdam: Elsevier.
Carriere, J., Kazman, R. (1997). WebQuery: Searching and visualiznig the Web through connectivity. In: Proceedings of the 6th International World Wide Web Conference, May 1997, 701-711. [Also online]. Available: http://www.cgl.uwaterloo.ca/Projects/Vanish/webquery-1.html (September 2000).
Chakrabarti, S., Dom B., Kumar, R. S., Raghavan, P., Rajagopalan, S., Tomkins, A., Kleinberg, J. M., Gibson, D. (1999). Hypersearching the Web. Scientific American, 280(6): 54-60. [Also online]. Available: http://www.sciam.com/1999/0699issue/0699raghavan.html (September 2000).
Chakrabarti, S., Dom B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J. M. (1998). Automatic Resource Compliation by Analyzing Hyperlink Structure and Assoicated Text. In: Proceedings of the 7th International World Wide Web Conference, April 1998, Computer Networks and ISDN Systems, 30: 65-74 [Also online]. Available: http://decweb.ethz.ch/WWW7/1898/com1898.htm (September 2000).
Chakrabarti, S., VAN DEN Berg, M., Dom, B. (1999). Focused crawling: A new approach to topic-specific Web resource discovery. In: Proceedings of the 8th International World Wide Web Conference, May 1999, 545-562. [Also online]. Available: http://www8.org/w8-papers/5a-search-query/crawling/index.html (September 2000). The Clever Project. (no date). [Online]. Available: http://www.almaden.ibm.com/cs/k53/clever.html (September 2000).
Cronin, B., Snyder, H. W., Rosenbaum, H., Martinson, A., Callahan, E. (1998). Invoked on the Web. Journal of American Society for Information Science, 49 (14): 1319-1328.
Cui, L. (1999). Rating health Web sites using the principles of citation analysis: A bibliometric approach. Journal of Medical Internet Research, 1(1): e4. [Online]. Available: http://www.jmir.org/1999/1/e4/index.htm (September 2000).
Dahn, M. (2000).Counting angels on a pinhead: Critically interpreting Web size estimates. Online 24 (1): 35-40. [Also online]. Available: http://www.onlineinc.com/onlinemag/OL2000/dahn1.html (August 2000).
Dean, J., Henzinger, M. (1999). Finding related pages in the World Wide Web. In: Proceedings of the 8th International World Wide Web Conference, May 1999, 389-401. [Also online]. Available: http://www8.org/w8-papers/4a-search-mining/finding/finding.html
Feldman, S. (1997). “It Was Here a Minute Ago!”: Archiving the Net. Searcher, 5 (9), 52. [Also Online]. Available: http://www.info-sec.com/internet/internet_120397a.html-ssi (August 2000).
Garfield, E., Welljams-Dorof, A. (1992). Citation data: Their use as quantitative indicators for science and technology evaluation and policy making. Science & Public Policy, 19 (5): 321-327. [Also online]. Available: http://www.garfield.library.upenn.edu/papers/sciandpubpolv19(5)p132y1992.html (August 2000).
Grossman, J. W., Ion, P. D. F. (2000). The Erdos Number Project. [Online]. Available: http://www.oakland.edu/~grossman/erdoshp.html. (September 2000).
Huberman, B. A., Adamic, L. A. (1999). Growth dynamics of the World-Wide Web, Nature, 401, 131.
Ingwersen. P. (1998). The calculation of Web impact factors. Journal of Documentation, 4(2): 236-243.
Jansen, B. J. SPINK, A., Saracevic, T. (2000). Real life, real users and real needs: A study and analysis of user queries on the Web. Information Processing and Management, 36: 207-227.
Katz, L. (1953). A new status index derived from sociometric analysis. Psycometrika. 18 (1): 39-43.
Kleinberg, J. M. (1998). Authoritative sources in a hyperlinked environment. In: Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Also appeared in: Journal of the ACM, 46(5): 604-632, 1999. [Also online]. Available: http://www.cs.cornell.edu/home/kleinber/auth.ps (September 2000).
Koehler, W. (1999). An analysis of Web page and Web site constancy and permanence. Journal of the American Society for Information Science, 50 (2): 162-180.
Larson, R. (1966). Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. ASIS96. [Online]. Available: http://sherlock.berkeley.edu/asis96/asis96.html (September 2000).
Lawrence, S., Giles, C. L. (1998). Searching the World Wide Web. Science, 280, 98-100.
Lawrence, S., Giles, C. L. (1999). Accessibility and distribution of information on the Web. Nature, 400: 107-110.
Lawrence, S., Bollacker, K., Giles, C. L. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32 (6): 67-71.
Leydesdorff, L., Curran, M. (2000). Mapping university-industry-government relations on the Internet: The construction of indicators for a knowledge-based economy. Cybermetrics, 4 (1), paper 2. [Online]. Available: http://www.cindoc.csis.es/cybermetrics/articles/v4i1p2.html (August 2000).
The Library OF Congress (2000). Facsinating Facts about the Library of Congress. [Online]. Available: http://www.loc.gov/today/fascinate.html (August 2000).
Mannina, B., Quoniam, L. (2000) How to hold a virtual library active? Cybermetics, 4 (1), paper 1. [Online]. Available: http://www.cindoc.csic.es/cybermetrics/articles/v4i1p1.html (September 2000)
Mccallum, A. K., Nigam, K., Rennie, J., Seymore, K. (2000). Automating the construction of Internet portals with machine learning. Information Retrieval, 3, 127-163.
Moore, A., Murray, B. H. (2000). Sizing the Internet. [Online]. Available: http://www.cyveillance.com/resources/7921S_Sizing_the_Internet.pdf (August 2000).
Notess, G. R. (2000). Search Engine Statistics: Database Total Size Estimates. [Online]. Available: http://www.searchengineshowdown.com/stats/0002sizeest.shtml (August 2000).
Notess, G. R. (2000b). The Half Billion Crew: Google, Inktomi GEN3 & Webtop. [Online]. Available: http://www.searchengineshowdown.com/stats/500million.html (September, 2000).
Notess, G. R. (2000c). Inconsistencies Reports. [Online]. Available: http://www.searchengineshowdown.com/inconsistent.shtml (September 2000).
Pinski, G., Narin, F. (1976). Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing and Management, 12: 297-312.
PubMed Overview (2000). [Online]. Available: http://www.ncbi.nlm.nih.gov:80/entrez/query/stattic/overview.html (August 2000).
Rosenbaum, H. (1998). Web-based community networks: A study of information organization and access. In: ASIS.98 Contributed Papers, 516-530.
Ross, N. C. M., Wolfram, D. (2000). End user searching on the Internet: An analysis of term pair topics submitted to the Excite search engine. Journal of the American Society for Information Science, 51 (10): 949-958.
Rousseau, R. (1997). Sitations: An exploratory study. Cybermetrics, 1 (1), [Online]. Available: http://www.cindoc.es/cybermetrics/articles/v1i1p1.htm (August 2000).
Rousseau, R. (1999). Daily time series of common single word searches in AltaVista and NorthernLight. Cybermetrics, 2/3 (1), paper 2, [Online]. Available: http://www.cindoc.csis.es/cybermetricc/articles/v2i1p2.html (August 2000).
Salton, G. (1989). Automatic Text Processing. Addison-Wesley, Reading, MA.
Sherman, C. (1999). The search engines speak. In Web Search. [Online]. Avaliable: http://websearch.about.com/internet/websearch/library/weekly/aa120399.htm (September 2000).
Smith, A. G. (1999). A tale of two Web spaces: Comparing sites using Web impact factors. Journal of Documentation, 55 (5): 577-592.
Snyder, H., Rosenbaum, H. (1999). Can search engines be used as tools for Web-link analysis? A critical view. Journal of Documentation, 55 (4): 375-384
Sullivan, D. (1998). Northern light adds search functions, freshens index. In SearchEngineWatch. [Online]. Available: http://www.searchenginewatch.internet.com/sereport/98/08northernlight.html (September 2000).
Sullivan, D. (2000). Search engine sizes. In: SearchEngineWatch. [Online]. Available: http://www.searchenginewatch.com/reports.sizes.html (August 2000).
Sullivan, D. (no date). Search assistance features. In: SearchEngineWatch. [Online]. Available: http://searchenginewatch.internet.com/facts/assistance.html (September 2000).
Tague-Sutcliffe, J. (1992). An introduction to informetrics. Information Processing and Management, 28 (1): 1-3.
Thelwall, M. (2000). Web impact factors and search engine coverage. Journal of Documentation, 56 (2): 185-189.
Wired Cybrarian. (1997). [Online]. Available: http://hotwired.lycos.com/cybrarian/ reference/search.html (August 2000).
Woodruff A., Aoki, P. M., Brewer E., Gauthier P., Rowe, L. A. (1996). An investigation of documents from the World Wide Web. In Proceedings of the 5th International World Wide Web Conference, May 1996, Computer Networks and ISDN Systems, 28: 963-980. [Also online]. Available: http://www5conf.inria.fr/fich_html/papers/P7/Overview.html (September 2000).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bar-Ilan, J. Data collection methods on the Web for infometric purposes — A review and analysis. Scientometrics 50, 7–32 (2001). https://doi.org/10.1023/A:1005682102768
Issue Date:
DOI: https://doi.org/10.1023/A:1005682102768