Skip to main content

Web Document Modeling

  • Chapter
The Adaptive Web

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4321))

Abstract

A very common issue of adaptive Web-Based systems is the modeling of documents. Such documents represent domain-specific information for a number of purposes. Application areas such as Information Search, Focused Crawling and Content Adaptation (among many others) benefit from several techniques and approaches to model documents effectively. For example, a document usually needs preliminary processing in order to obtain the relevant information in an effective and useful format, so as to be automatically processed by the system. The objective of this chapter is to support other chapters, providing a basic overview of the most common and useful techniques and approaches related with document modeling. This chapter describes high-level techniques to model Web documents, such as the Vector Space Model and a number of AI approaches, such as Semantic Networks, Neural Networks and Bayesian Networks. This chapter is not meant to act as a substitute of more comprehensive discussions about the topics presented. Rather, it provides a brief and informal introduction to the main concepts of document modeling, also focusing on the systems that are presented in the rest of the book as concrete examples of the related concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agosti, M., Melucci, M.: Information retrieval on the web. In: Agosti, M., Crestani, F., Pasi, G. (eds.) ESSIR 2000. LNCS, vol. 1980, pp. 242–285. Springer, Heidelberg (2001)

    Google Scholar 

  2. Agosti, M., Smeaton, A.F: Information Retrieval and Hypertext. Kluwer Academic Publishers, Dordrecht (1997)

    Google Scholar 

  3. Amento, B., Terveen, L., Hill, W.: Does authority mean quality? predicting expert quality ratings of web documents. In: SIGIR ’00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 296–303. ACM Press, New York (2000)

    Chapter  Google Scholar 

  4. Asnicar, F., Tasso, C.: ifWeb: a prototype of user models based intelligent agent for document filtering and navigation in the World Wide Web. In: P.Brusilovsky, Fink, J., Kay, J. (eds.): Proceedings of Workshop Adaptive Systems and User Modeling on the World Wide Web at Sixth International Conference on User Modeling, UM97, Chia Laguna, Sardinia, Italy, June 2, pp. 3–11 (1997)

    Google Scholar 

  5. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  6. Baziz, M., Boughanem, M., Traboulsi, S.: A concept-based approach for indexing documents in IR. In: Actes du XXIII ème Congrès INFORSID, Grenoble, May 24–27, pp. 489–504 (2005)

    Google Scholar 

  7. Belkin, N.J., Croft, W.B.: Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM 35(12), 29–38 (1992)

    Article  Google Scholar 

  8. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155 (2003)

    Article  MATH  Google Scholar 

  9. Berry, M.W.: Large-scale sparse singular value computations. International Journal of Supercomputer Applications 6(1), 13–49 (1992)

    MathSciNet  Google Scholar 

  10. Bharat, K., Henzinger, M.R.: Improved algorithms for topic distillation in a hyperlinked environment. In: SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 104–111. ACM Press, New York (1998)

    Chapter  Google Scholar 

  11. Billsus, D., Pazzani, M.J.: User modeling for adaptive news access. User Modeling and User-Adapted Interaction 10(2-3), 147–180 (2000)

    Article  Google Scholar 

  12. Billsus, D., Pazzani, M.J.: Adaptive news access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 550–572. Springer, Heidelberg (2007)

    Google Scholar 

  13. Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)

    Google Scholar 

  14. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998)

    Article  Google Scholar 

  15. Broglio, J., Callan, J.P., Croft, W.B., Nachbar, D.W.: Document retrieval and routing using the INQUERY system. In: Text REtrieval Conference (TREC) TREC-3 Proceedings, pp. 29–38. NIST Special Publication 500-226: Overview of the Third Text REtrieval Conference (TREC-3). Department of Commerce, National Institute of Standards and Technology (1994)

    Google Scholar 

  16. Buckley, C., Singhal, A., Mitra, M.: Using query zoning and correlation within SMART: TREC 5. In: Text REtrieval Conference (TREC) TREC-5 Proceedings, NIST Special Publication 500-238: The Fifth Text REtrieval Conference (TREC-5). Department of Commerce, National Institute of Standards and Technology (1996)

    Google Scholar 

  17. Buckley, C., Singhal, A., Mitra, M., Salton, G.: New retrieval approaches using SMART: TREC 4. In: Harman, D. (ed.) NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4). Department of Commerce, National Institute of Standards and Technology (November 1995)

    Google Scholar 

  18. Budzik, J., Hammond, K.J., Birnbaum, L.: Information access in context. Knowl.-Based Syst. 14(1-2), 37–53 (2001)

    Article  Google Scholar 

  19. Callan, J.: Document filtering with inference networks. In: Frei, H.P., Harman, D., Schäuble, P., Wilkinson, R. (eds.) Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 18–22, pp. 262–269. ACM Press, New York (1996)

    Google Scholar 

  20. Chakrabarti, S., Dom, B.E., Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D., Kleinberg, J.: Mining the web’s link structure. Computer 32(8), 60–67 (1999)

    Article  Google Scholar 

  21. Chittaro, L., Ranon, R.: Adaptive 3d web sites. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 433–464. Springer, Heidelberg (2007)

    Google Scholar 

  22. Clark, J., Koprinska, I., Poon, J.: A neural network based approach to automated E-mail classification. In: IEEE/WIC International Conference on Web Intelligence (WI’03), pp. 702–705. IEEE Computer Society Press, Los Alamitos (2003)

    Chapter  Google Scholar 

  23. Croft, W.B., Belkin, N.J., Bruandet, M.F., Kuhlen, R.: Hypertext and information retrieval: What are the fundamental concepts? (panel). In: ECHT, pp. 362–366 (1990)

    Google Scholar 

  24. Croft, W.B., Turtle, H.R.: Retrieval strategies for hypertext. Information Processing & Management 29(3), 313–324 (1993)

    Article  Google Scholar 

  25. Cummins, R., O’Riordan, C.: Evolving local and global weighting schemes in information retrieval. Information Retrieval 9(3), 311–330 (2006)

    Article  Google Scholar 

  26. Cutler, M., Deng, H., Maniccam, S., Meng, W.: A new study on using HTML structures to improve retrieval. In: Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’99), 8-10 November, Chicago, Illinois, USA, pp. 406–409. IEEE Computer Society Press, Los Alamitos (1999)

    Chapter  Google Scholar 

  27. Cutler, M., Shih, Y., Meng, W.: Using the structure of HTML documents to improve retrieval. In: USENIX Symposium on Internet Technologies and Systems, pp. 241–252 (1997)

    Google Scholar 

  28. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)

    Article  Google Scholar 

  29. DeRose, S.J.: The SGML FAQ Book: Understanding the Foundation of HTML and XML. Kluwer Academic Publications, Dordrecht (1997)

    Google Scholar 

  30. Devore, J.L.: Probability and Statistics for Engineering and the Sciences, 3rd edn. Brooks/Cole, Pacific Grove (1991)

    Google Scholar 

  31. Dumais, S.T.: Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23, 229–236 (1991)

    Google Scholar 

  32. Dumais, S.T.: Latent semantic indexing (LSI) and TREC-2. In: Text REtrieval Conference (TREC) TREC-2 Proceedings, pp.105–116. NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC 2). Department of Commerce, National Institute of Standards and Technology (1993)

    Google Scholar 

  33. Frei, H.P., Stieger, D.: Making use of hypertext links when retrieving information. In: Proceedings of the Fourth ACM Conference on Hypertext, pp. 102–111. Information Retrieval (1992)

    Google Scholar 

  34. Fung, R., Del Favero, B.: Applying Bayesian networks to information retrieval. Communications of the ACM 38(3), 42–48 (1995)

    Article  Google Scholar 

  35. Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for personalized information access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 54–89. Springer, Heidelberg (2007)

    Google Scholar 

  36. Gentili, G., Micarelli, A., Sciarrone, F.: Infoweb: An adaptive information filtering system for the cultural heritage domain. Applied Artificial Intelligence 17(8-9), 715–744 (2003)

    Google Scholar 

  37. Geva, S., Sahama, T.: The NLP task at INEX 2004. INEX 2004 39(1), 50–53 (2005)

    Article  Google Scholar 

  38. Golub, G.H., Loan, C.F.V.: Matrix Computations, 2nd edn. The Johns Hopkins University Press, Baltimore (1989)

    MATH  Google Scholar 

  39. Gourley, D., Totty, B.: HTTP: the definitive guide, 1st edn. O’Reilly Media, Sebastopol (Sept. 2002)

    MATH  Google Scholar 

  40. Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proceedings of the 30th International Conference on Very Large Databases, pp. 576–587. Morgan Kaufmann, San Francisco (2004)

    Google Scholar 

  41. Hawking, D., Upstill, T., Craswell, N.: Toward better weighting of anchors. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Posters), pp. 512–513. ACM Press, New York (2004)

    Google Scholar 

  42. Haykin, S.: Neural Networks: A Comprehensive Introduction. Prentice-Hall, Englewood Cliffs (1999)

    Google Scholar 

  43. Joachims, T., Freitag, D., Mitchell, T.M.: Webwatcher: A tour guide for the world wide web. In: Proceedings of the 15h International Conference on Artificial Intelligence, IJCAI1997, pp. 770–777 (1997)

    Google Scholar 

  44. Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments - part 2. Information Processing & Management 36(6), 809–840 (2000)

    Article  Google Scholar 

  45. Keller, M., Bengio, S.: A neural network for text representation. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 667–672. Springer, Heidelberg (2005)

    Google Scholar 

  46. Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  47. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)

    MATH  Google Scholar 

  48. Liu, J., Zhong, N., Yao, Y.W., Ras, Z.: The wisdom web: New challenges for web intelligence (WI). Journal of Intelligent Information Systems 20(1), 5–9 (2003)

    Article  Google Scholar 

  49. Magnini, B., Strapparava, C.: User modelling for news web sites with word sense based techniques. User Modeling User-Adapted Interaction 14(2-3), 239–257 (2004)

    Article  Google Scholar 

  50. Micarelli, A., Gasparetti, F.: Adaptive focused crawling. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 231–262. Springer, Heidelberg (2007)

    Google Scholar 

  51. Micarelli, A., Gasparetti, F., Sciarrone, F., Gauch, S.: Personalized search on the world wide web. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 195–230. Springer, Heidelberg (2007)

    Google Scholar 

  52. Micarelli, A., Sciarrone, F.: Anatomy and empirical evaluation of an adaptive web-based information filtering system. User Modeling and User-Adapted Interaction 14(2-3), 159–200 (2004)

    Article  Google Scholar 

  53. Miller, G.A.: WordNet: A lexical database for English. Communications of the ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  54. Mobasher, B.: Data mining for web personalization. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 90–135. Springer, Heidelberg (2007)

    Google Scholar 

  55. Molinari, A., Pereira, R.A.M., Pasi, G.: An indexing model of HTML documents. In: Proceedings of the 2003 ACM Symposium on Applied Computing (SAC), Melbourne, USA, March 9-12, pp. 834–840. ACM, New York (2003)

    Chapter  Google Scholar 

  56. Pant, G., Menczer, F.: Topical crawling for business intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 17–22. Springer, Heidelberg (2003)

    Google Scholar 

  57. Park, L.A.F., Ramamohanarao, K., Palaniswami, M.: A novel document retrieval method using the discrete wavelet transform. ACM Transactions on Information Systems 23(3), 267–298 (2005)

    Article  Google Scholar 

  58. Pazzani, M.J., Billsus, D.: Learning and revising user profiles: The identification of interesting web sites. Machine Learning 27, 313–331 (1997)

    Article  Google Scholar 

  59. Pazzani, M.J., Billsus, D.: Content-based recommendation systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 325–341. Springer, Heidelberg (2007)

    Google Scholar 

  60. Pearl, J.: Fusion, propagation, and structuring in belief networks. Artificial Intelligence 29(3), 241–288 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  61. Pearl, J.: Probabilistic Reasoning in Intelligent Systems, 2nd edn. Morgan Kauffmann, Los Altos (1988)

    Google Scholar 

  62. Pereira, R.A.M., Molinari, A., Pasi, G.: Contextual weighted representations and indexing IEEE computer society models for the retrieval of HTML documents. Soft. Computing 9(7), 481–492 (2005)

    Article  Google Scholar 

  63. Piwowarski, B., Gallinari, P.: A bayesian network for XML information retrieval: Searching and learning with the INEX collection. Information Retrieval 8(4), 655–681 (2005)

    Article  Google Scholar 

  64. Piwowarski, B., Vu, T., Gallinari, P.: Bayesian networks for structured information retrieval. In: Learning Methods for Text Understanding and Mining, Grenoble, France, January 26–29 (2004)

    Google Scholar 

  65. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  66. Quillian, M.: Semantic Memory. MIT Press, Cambridge (1968)

    Google Scholar 

  67. Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Department of Computer Science, University of Glasgow (1979)

    Google Scholar 

  68. Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. Journal of the American Society for Information Science 27, 129–146 (1976)

    Google Scholar 

  69. Robertson, S.E., Walker, S.: Okapi/keenbow at TREC-8. In: Text REtrieval Conference (TREC) TREC-8 Proceedings, NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC 8), pp. 151–162. Department of Commerce, National Institute of Standards and Technology(1999)

    Google Scholar 

  70. Robertson, S.E., Walker, S., Hancock-Beaulieu, M.: Experimentation as a way of life: Okapi at TREC. Information Processing and Management 36(1), 95–108 (2000)

    Article  Google Scholar 

  71. Rumelhart, D.E., McClelland, J.L.: Parallel Distributed Processing, Volume 1:Foundations (ed. w/ PDP Research Group). MIT Press, Cambridge (1986)

    Google Scholar 

  72. Russel, S., Norvig, P.: Artificial Intelligence: a modern approach. Prentice-Hall, Englewood Cliffs (1998)

    Google Scholar 

  73. Salem, A.B.M., Syiam, M.M., Ayad, A.F.: Unsupervised artificial neural networks for clustering of document collections. Egyptian Computer Science Journal 26(1) (2004)

    Google Scholar 

  74. Salton, G.: The Smart Retrieval System. Experiments in Automatic Document Processing, 1st edn. Prentice Hall, Englewood Cliffs (1971)

    Google Scholar 

  75. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  76. Salton, G., McGill, M.J.: An Introduction to modern information retrieval. Mc-Graw Hill (1983)

    Google Scholar 

  77. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  78. Segal, R.B., Kephart, J.O.: MailCat: an intelligent assistant for organizing e-mail. In: Etzioni, O., Müller, J.P., Bradshaw, J.M. (eds.) Proceedings of the Third International Conference on Autonomous Agents. Agents’99, Seattle, USA, pp. 276–282. ACM Press, New York (1999)

    Chapter  Google Scholar 

  79. Shastri, L.: Why semantic networks. In: Sowa, J.F. (ed.) Principles of Semantic Networks: Explorations in the Representation of Knowledge, pp. 108–136. Morgan Kaufmann, San Mateo (1991)

    Google Scholar 

  80. Thomas, S.: HTTP Essentials: Protocols for Secure, Scalable Web Sites. Wiley, Chichester (2001)

    Google Scholar 

  81. Tsikrika, T., Lalmas, M.: Combining evidence for web retrieval using the inference network model: an experimental study. Information Processing & Management 40(5), 751–772 (2004)

    Article  Google Scholar 

  82. Turtle, H.R., Croft, W.B.: Evaluation of an inference network-based retrieval model. ACM Transactions On Information Systems 9(3), 187–222 (1991)

    Article  Google Scholar 

  83. Vlajic, N., Card, H.C.: An adaptive neural network approach to hypertext clustering. In: IEEE International Conference on Neural Networks, vol. 6. IJCNN’99, Washington, DC, July, pp. 3722–3726. IEEE Computer Society Press, Los Alamitos (1999)

    Google Scholar 

  84. Walker, S.: The Okapi online catalogue research projects. In: Hildreth, C. (ed.) The online catalogue. Research and directions, pp. 84–106. Library Association, London (1989)

    Google Scholar 

  85. Wilkinson, R., Hingston, P.: Using the cosine measure in a neural network for document retrieval. In: Bookstein, A., Chiaramella, Y., Salton, G., Raghavan, V.V. (eds.) Proceedings of the 14th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval III., Chicago, USA, October, pp. 202–210. ACM Press, New York (1991)

    Chapter  Google Scholar 

  86. Yang, C.C., Chen, H., Hong, K.: Visualization of large category map for Internet browsing. Decision Support Systems 35(1), 89–102 (2003)

    Article  Google Scholar 

  87. Yang, K.: Combining text and link-based retrieval methods for web IR. In: Voorhees, E., Harman, D. (eds.) The Ninth Text REtrieval Conference (TREC 9), pp. 609–618 (2001)

    Google Scholar 

  88. Yang, K., Albertson, D.E.: WIDIT in TREC-2003 web track. In: Text REtrieval Conference (TREC), TREC 2003 Proceedings, pp. 328–336 (2003)

    Google Scholar 

  89. Yang, K., Maglaughlin, K.L.: IRIS at TREC-8. In: Text REtrieval Conference (TREC) TREC-8 Proceedings. NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC 8), pp. 645–656. Department of Commerce, National Institute of Standards and Technology (1999)

    Google Scholar 

  90. Yao, Y., Zhong, N., Liu, J., Ohsuga, S.: Web intelligence (WI). In: Zhong, N., Yao, Y., Ohsuga, S., Liu, J. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 1–17. Springer, Heidelberg (2001)

    Google Scholar 

  91. Yao, Y., Zhong, N., Liu, J., Ohsuga, S.: Web intelligence: exploring structures, semantics, and knowledge of the web. Knowledge-Based Systems 17(5-6), 175–177 (2004)

    Article  Google Scholar 

  92. Zhong, N., Liu, J., Yao, Y.: In search of the wisdom web. IEEE Computer 35(11), 27–31 (2002)

    Google Scholar 

  93. Zhong, N., Liu, J., Yao, Y.: A New Paradigm for Developing the Wisdom Web and Social Network Intelligence. In: Web Intelligence, pp. 1–15. Springer, Heidelberg (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Peter Brusilovsky Alfred Kobsa Wolfgang Nejdl

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this chapter

Cite this chapter

Micarelli, A., Sciarrone, F., Marinilli, M. (2007). Web Document Modeling. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds) The Adaptive Web. Lecture Notes in Computer Science, vol 4321. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72079-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72079-9_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72078-2

  • Online ISBN: 978-3-540-72079-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics