Skip to main content

A Theory and Approach to Improving Relevance Ranking in Web Retrieval

  • Conference paper
  • First Online:
Web Intelligence: Research and Development (WI 2001)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2198))

Included in the following conference series:

Abstract

The development of the World Wide Web (WWW) makes a huge amount of information available on-line, and the amount of information continues to increase. As of March 2001 the Google search engine searches 1,346,966,000 Web pages. Many search systems have been developed to manage this massive collection of information. Investigation shows that the primary method used by these systems is classification. Unfortunately, classification has an intrinsic restriction. Consider this example. Recently, we sent a query that consists of the word x“computer ” to Google, and Google found 33,220,000 relevant Web pages. This number far exceeds anything that people can possibly begin to read. This problem is intrinsic to classification, which means it cannot be avoided. The problem is explained by the Pigeonhole Principle (i.e. Dirichlet’s Box Principle) [10]. Suppose we can classify Web pages using all the English words in a dictionary. Given a particular keyword, let us calculate on average how many Web pages will be classified as relevant. Let totalKeywords be the number of all keywords in a vocabulary list. Let averageKeywords be the average number of keywords that a Web document may have. Let the number of all Web pages be n. Let the number of relevant Web pages be numberRelevant. Then we have:

$$ number Relevant \approx \frac{{n \times average Keywords}} {{total Keywords}}. $$

If n = 1346966000, averageKeywords = 100, and totalKeywords = 10000, then numberRelevant is 13469660.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. G. Arfken. Curvilinear coordinates. In 3rd, editor, Mathematical Methods for Physicists, pages 86–90. Academic Press, Orlando, FL, 1985. ç2.1.

    Google Scholar 

  2. P. Bollmann and S.K.M. Wong. Adaptive linear information retrieval models. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pages 157–163, 1987.

    Google Scholar 

  3. Robert T. Craig. Modern Principles of Mathematics. Prentice-Hall, Inc./ Englewood Cliffs, N.J., 1969.

    Google Scholar 

  4. A. Gray. Modern Differential Geometry of Curves and Surfaces with Mathematica, chapter Metrics on Surfaces. CRC Press, Boca Raton, FL, 2nd edition, 1997.

    Google Scholar 

  5. M. E. Maron and J. L. Kuhns. On relevance, probabilistic indexing and informatiin retrieval.Journal of the Association for Computing Machinery, 7:216–244, 1960.

    Google Scholar 

  6. M. J. McGill, M. Koll, and T. Noreault. An evaluation of factors affecting document ranking by information retrieval systems. School of Information Studies, Syracuse University, Syracuse, New York 13210, 1979.

    Google Scholar 

  7. P. M. Morse and H. Feshbach. Methods of Theoretical Physics, Part I, chapter Curvilinear Coordinates, pages 21–31. McGraw-Hill, New York, 1953.

    Google Scholar 

  8. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.

    Google Scholar 

  9. H.J. Schneider, P. Bollmann, F. Jochum, E. Konrad, U. Reiner, and V. Weissmann. Leistungsbewertung von information retrieval verfahren (live). Projektbericht, Technische Universitat, Berlin, 1986.

    Google Scholar 

  10. D. Shanks. Solved and Unsolved Problems in Number Theory, page 161. Chelsea, New York, 4th edition, 1993.

    Google Scholar 

  11. H. F. Stiles. The association factor in information retrieval. Journal of the ACM, 8:271–279, 1961.

    Article  Google Scholar 

  12. Z. W. Wang. An analysis on vector space model based on computational geometry. Master’s thesis, Department of Computer Science, University of Regina, 1993.

    Google Scholar 

  13. Z. W. Wang. Riemann space model and similarity-based web retrieval. Ph.D. thesis, Department of Computer Science, University of Regina, 2001.

    Google Scholar 

  14. Z. W. Wang, R.B. Maguire, and Y. Y. Yao. A non-Euclidean model for web retrieval. In The First International Conference on Web-Age Information Management (WAIM’2000), Shanghai, 2000. Accepted.

    Google Scholar 

  15. S. K. M. Wong, W. Ziarko, Raghavan, and P. C. N. Wong. On modeling of information retrieval concepts in vector spaces. ACM Transactions on Database Systems, 12(2):229–321, 1987.

    Article  Google Scholar 

  16. Y. Y. Yao. measuring retrieval performance based on user preference of documents. Journel of the American Society for Information Science, 46(2):133–145, 1995.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, Z.W., Maguire, R.B. (2001). A Theory and Approach to Improving Relevance Ranking in Web Retrieval. In: Zhong, N., Yao, Y., Liu, J., Ohsuga, S. (eds) Web Intelligence: Research and Development. WI 2001. Lecture Notes in Computer Science(), vol 2198. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45490-X_37

Download citation

  • DOI: https://doi.org/10.1007/3-540-45490-X_37

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42730-8

  • Online ISBN: 978-3-540-45490-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics