Abstract
A very common issue of adaptive Web-Based systems is the modeling of documents. Such documents represent domain-specific information for a number of purposes. Application areas such as Information Search, Focused Crawling and Content Adaptation (among many others) benefit from several techniques and approaches to model documents effectively. For example, a document usually needs preliminary processing in order to obtain the relevant information in an effective and useful format, so as to be automatically processed by the system. The objective of this chapter is to support other chapters, providing a basic overview of the most common and useful techniques and approaches related with document modeling. This chapter describes high-level techniques to model Web documents, such as the Vector Space Model and a number of AI approaches, such as Semantic Networks, Neural Networks and Bayesian Networks. This chapter is not meant to act as a substitute of more comprehensive discussions about the topics presented. Rather, it provides a brief and informal introduction to the main concepts of document modeling, also focusing on the systems that are presented in the rest of the book as concrete examples of the related concepts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agosti, M., Melucci, M.: Information retrieval on the web. In: Agosti, M., Crestani, F., Pasi, G. (eds.) ESSIR 2000. LNCS, vol. 1980, pp. 242–285. Springer, Heidelberg (2001)
Agosti, M., Smeaton, A.F: Information Retrieval and Hypertext. Kluwer Academic Publishers, Dordrecht (1997)
Amento, B., Terveen, L., Hill, W.: Does authority mean quality? predicting expert quality ratings of web documents. In: SIGIR ’00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 296–303. ACM Press, New York (2000)
Asnicar, F., Tasso, C.: ifWeb: a prototype of user models based intelligent agent for document filtering and navigation in the World Wide Web. In: P.Brusilovsky, Fink, J., Kay, J. (eds.): Proceedings of Workshop Adaptive Systems and User Modeling on the World Wide Web at Sixth International Conference on User Modeling, UM97, Chia Laguna, Sardinia, Italy, June 2, pp. 3–11 (1997)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Baziz, M., Boughanem, M., Traboulsi, S.: A concept-based approach for indexing documents in IR. In: Actes du XXIII ème Congrès INFORSID, Grenoble, May 24–27, pp. 489–504 (2005)
Belkin, N.J., Croft, W.B.: Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM 35(12), 29–38 (1992)
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155 (2003)
Berry, M.W.: Large-scale sparse singular value computations. International Journal of Supercomputer Applications 6(1), 13–49 (1992)
Bharat, K., Henzinger, M.R.: Improved algorithms for topic distillation in a hyperlinked environment. In: SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 104–111. ACM Press, New York (1998)
Billsus, D., Pazzani, M.J.: User modeling for adaptive news access. User Modeling and User-Adapted Interaction 10(2-3), 147–180 (2000)
Billsus, D., Pazzani, M.J.: Adaptive news access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 550–572. Springer, Heidelberg (2007)
Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998)
Broglio, J., Callan, J.P., Croft, W.B., Nachbar, D.W.: Document retrieval and routing using the INQUERY system. In: Text REtrieval Conference (TREC) TREC-3 Proceedings, pp. 29–38. NIST Special Publication 500-226: Overview of the Third Text REtrieval Conference (TREC-3). Department of Commerce, National Institute of Standards and Technology (1994)
Buckley, C., Singhal, A., Mitra, M.: Using query zoning and correlation within SMART: TREC 5. In: Text REtrieval Conference (TREC) TREC-5 Proceedings, NIST Special Publication 500-238: The Fifth Text REtrieval Conference (TREC-5). Department of Commerce, National Institute of Standards and Technology (1996)
Buckley, C., Singhal, A., Mitra, M., Salton, G.: New retrieval approaches using SMART: TREC 4. In: Harman, D. (ed.) NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4). Department of Commerce, National Institute of Standards and Technology (November 1995)
Budzik, J., Hammond, K.J., Birnbaum, L.: Information access in context. Knowl.-Based Syst. 14(1-2), 37–53 (2001)
Callan, J.: Document filtering with inference networks. In: Frei, H.P., Harman, D., Schäuble, P., Wilkinson, R. (eds.) Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 18–22, pp. 262–269. ACM Press, New York (1996)
Chakrabarti, S., Dom, B.E., Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D., Kleinberg, J.: Mining the web’s link structure. Computer 32(8), 60–67 (1999)
Chittaro, L., Ranon, R.: Adaptive 3d web sites. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 433–464. Springer, Heidelberg (2007)
Clark, J., Koprinska, I., Poon, J.: A neural network based approach to automated E-mail classification. In: IEEE/WIC International Conference on Web Intelligence (WI’03), pp. 702–705. IEEE Computer Society Press, Los Alamitos (2003)
Croft, W.B., Belkin, N.J., Bruandet, M.F., Kuhlen, R.: Hypertext and information retrieval: What are the fundamental concepts? (panel). In: ECHT, pp. 362–366 (1990)
Croft, W.B., Turtle, H.R.: Retrieval strategies for hypertext. Information Processing & Management 29(3), 313–324 (1993)
Cummins, R., O’Riordan, C.: Evolving local and global weighting schemes in information retrieval. Information Retrieval 9(3), 311–330 (2006)
Cutler, M., Deng, H., Maniccam, S., Meng, W.: A new study on using HTML structures to improve retrieval. In: Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’99), 8-10 November, Chicago, Illinois, USA, pp. 406–409. IEEE Computer Society Press, Los Alamitos (1999)
Cutler, M., Shih, Y., Meng, W.: Using the structure of HTML documents to improve retrieval. In: USENIX Symposium on Internet Technologies and Systems, pp. 241–252 (1997)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)
DeRose, S.J.: The SGML FAQ Book: Understanding the Foundation of HTML and XML. Kluwer Academic Publications, Dordrecht (1997)
Devore, J.L.: Probability and Statistics for Engineering and the Sciences, 3rd edn. Brooks/Cole, Pacific Grove (1991)
Dumais, S.T.: Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23, 229–236 (1991)
Dumais, S.T.: Latent semantic indexing (LSI) and TREC-2. In: Text REtrieval Conference (TREC) TREC-2 Proceedings, pp.105–116. NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC 2). Department of Commerce, National Institute of Standards and Technology (1993)
Frei, H.P., Stieger, D.: Making use of hypertext links when retrieving information. In: Proceedings of the Fourth ACM Conference on Hypertext, pp. 102–111. Information Retrieval (1992)
Fung, R., Del Favero, B.: Applying Bayesian networks to information retrieval. Communications of the ACM 38(3), 42–48 (1995)
Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for personalized information access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 54–89. Springer, Heidelberg (2007)
Gentili, G., Micarelli, A., Sciarrone, F.: Infoweb: An adaptive information filtering system for the cultural heritage domain. Applied Artificial Intelligence 17(8-9), 715–744 (2003)
Geva, S., Sahama, T.: The NLP task at INEX 2004. INEX 2004 39(1), 50–53 (2005)
Golub, G.H., Loan, C.F.V.: Matrix Computations, 2nd edn. The Johns Hopkins University Press, Baltimore (1989)
Gourley, D., Totty, B.: HTTP: the definitive guide, 1st edn. O’Reilly Media, Sebastopol (Sept. 2002)
Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proceedings of the 30th International Conference on Very Large Databases, pp. 576–587. Morgan Kaufmann, San Francisco (2004)
Hawking, D., Upstill, T., Craswell, N.: Toward better weighting of anchors. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Posters), pp. 512–513. ACM Press, New York (2004)
Haykin, S.: Neural Networks: A Comprehensive Introduction. Prentice-Hall, Englewood Cliffs (1999)
Joachims, T., Freitag, D., Mitchell, T.M.: Webwatcher: A tour guide for the world wide web. In: Proceedings of the 15h International Conference on Artificial Intelligence, IJCAI1997, pp. 770–777 (1997)
Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments - part 2. Information Processing & Management 36(6), 809–840 (2000)
Keller, M., Bengio, S.: A neural network for text representation. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 667–672. Springer, Heidelberg (2005)
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)
Liu, J., Zhong, N., Yao, Y.W., Ras, Z.: The wisdom web: New challenges for web intelligence (WI). Journal of Intelligent Information Systems 20(1), 5–9 (2003)
Magnini, B., Strapparava, C.: User modelling for news web sites with word sense based techniques. User Modeling User-Adapted Interaction 14(2-3), 239–257 (2004)
Micarelli, A., Gasparetti, F.: Adaptive focused crawling. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 231–262. Springer, Heidelberg (2007)
Micarelli, A., Gasparetti, F., Sciarrone, F., Gauch, S.: Personalized search on the world wide web. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 195–230. Springer, Heidelberg (2007)
Micarelli, A., Sciarrone, F.: Anatomy and empirical evaluation of an adaptive web-based information filtering system. User Modeling and User-Adapted Interaction 14(2-3), 159–200 (2004)
Miller, G.A.: WordNet: A lexical database for English. Communications of the ACM 38(11), 39–41 (1995)
Mobasher, B.: Data mining for web personalization. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 90–135. Springer, Heidelberg (2007)
Molinari, A., Pereira, R.A.M., Pasi, G.: An indexing model of HTML documents. In: Proceedings of the 2003 ACM Symposium on Applied Computing (SAC), Melbourne, USA, March 9-12, pp. 834–840. ACM, New York (2003)
Pant, G., Menczer, F.: Topical crawling for business intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 17–22. Springer, Heidelberg (2003)
Park, L.A.F., Ramamohanarao, K., Palaniswami, M.: A novel document retrieval method using the discrete wavelet transform. ACM Transactions on Information Systems 23(3), 267–298 (2005)
Pazzani, M.J., Billsus, D.: Learning and revising user profiles: The identification of interesting web sites. Machine Learning 27, 313–331 (1997)
Pazzani, M.J., Billsus, D.: Content-based recommendation systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web: Methods and Strategies of Web Personalization. LNCS, vol. 4321, pp. 325–341. Springer, Heidelberg (2007)
Pearl, J.: Fusion, propagation, and structuring in belief networks. Artificial Intelligence 29(3), 241–288 (1986)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems, 2nd edn. Morgan Kauffmann, Los Altos (1988)
Pereira, R.A.M., Molinari, A., Pasi, G.: Contextual weighted representations and indexing IEEE computer society models for the retrieval of HTML documents. Soft. Computing 9(7), 481–492 (2005)
Piwowarski, B., Gallinari, P.: A bayesian network for XML information retrieval: Searching and learning with the INEX collection. Information Retrieval 8(4), 655–681 (2005)
Piwowarski, B., Vu, T., Gallinari, P.: Bayesian networks for structured information retrieval. In: Learning Methods for Text Understanding and Mining, Grenoble, France, January 26–29 (2004)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Quillian, M.: Semantic Memory. MIT Press, Cambridge (1968)
Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Department of Computer Science, University of Glasgow (1979)
Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. Journal of the American Society for Information Science 27, 129–146 (1976)
Robertson, S.E., Walker, S.: Okapi/keenbow at TREC-8. In: Text REtrieval Conference (TREC) TREC-8 Proceedings, NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC 8), pp. 151–162. Department of Commerce, National Institute of Standards and Technology(1999)
Robertson, S.E., Walker, S., Hancock-Beaulieu, M.: Experimentation as a way of life: Okapi at TREC. Information Processing and Management 36(1), 95–108 (2000)
Rumelhart, D.E., McClelland, J.L.: Parallel Distributed Processing, Volume 1:Foundations (ed. w/ PDP Research Group). MIT Press, Cambridge (1986)
Russel, S., Norvig, P.: Artificial Intelligence: a modern approach. Prentice-Hall, Englewood Cliffs (1998)
Salem, A.B.M., Syiam, M.M., Ayad, A.F.: Unsupervised artificial neural networks for clustering of document collections. Egyptian Computer Science Journal 26(1) (2004)
Salton, G.: The Smart Retrieval System. Experiments in Automatic Document Processing, 1st edn. Prentice Hall, Englewood Cliffs (1971)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Salton, G., McGill, M.J.: An Introduction to modern information retrieval. Mc-Graw Hill (1983)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Segal, R.B., Kephart, J.O.: MailCat: an intelligent assistant for organizing e-mail. In: Etzioni, O., Müller, J.P., Bradshaw, J.M. (eds.) Proceedings of the Third International Conference on Autonomous Agents. Agents’99, Seattle, USA, pp. 276–282. ACM Press, New York (1999)
Shastri, L.: Why semantic networks. In: Sowa, J.F. (ed.) Principles of Semantic Networks: Explorations in the Representation of Knowledge, pp. 108–136. Morgan Kaufmann, San Mateo (1991)
Thomas, S.: HTTP Essentials: Protocols for Secure, Scalable Web Sites. Wiley, Chichester (2001)
Tsikrika, T., Lalmas, M.: Combining evidence for web retrieval using the inference network model: an experimental study. Information Processing & Management 40(5), 751–772 (2004)
Turtle, H.R., Croft, W.B.: Evaluation of an inference network-based retrieval model. ACM Transactions On Information Systems 9(3), 187–222 (1991)
Vlajic, N., Card, H.C.: An adaptive neural network approach to hypertext clustering. In: IEEE International Conference on Neural Networks, vol. 6. IJCNN’99, Washington, DC, July, pp. 3722–3726. IEEE Computer Society Press, Los Alamitos (1999)
Walker, S.: The Okapi online catalogue research projects. In: Hildreth, C. (ed.) The online catalogue. Research and directions, pp. 84–106. Library Association, London (1989)
Wilkinson, R., Hingston, P.: Using the cosine measure in a neural network for document retrieval. In: Bookstein, A., Chiaramella, Y., Salton, G., Raghavan, V.V. (eds.) Proceedings of the 14th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval III., Chicago, USA, October, pp. 202–210. ACM Press, New York (1991)
Yang, C.C., Chen, H., Hong, K.: Visualization of large category map for Internet browsing. Decision Support Systems 35(1), 89–102 (2003)
Yang, K.: Combining text and link-based retrieval methods for web IR. In: Voorhees, E., Harman, D. (eds.) The Ninth Text REtrieval Conference (TREC 9), pp. 609–618 (2001)
Yang, K., Albertson, D.E.: WIDIT in TREC-2003 web track. In: Text REtrieval Conference (TREC), TREC 2003 Proceedings, pp. 328–336 (2003)
Yang, K., Maglaughlin, K.L.: IRIS at TREC-8. In: Text REtrieval Conference (TREC) TREC-8 Proceedings. NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC 8), pp. 645–656. Department of Commerce, National Institute of Standards and Technology (1999)
Yao, Y., Zhong, N., Liu, J., Ohsuga, S.: Web intelligence (WI). In: Zhong, N., Yao, Y., Ohsuga, S., Liu, J. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 1–17. Springer, Heidelberg (2001)
Yao, Y., Zhong, N., Liu, J., Ohsuga, S.: Web intelligence: exploring structures, semantics, and knowledge of the web. Knowledge-Based Systems 17(5-6), 175–177 (2004)
Zhong, N., Liu, J., Yao, Y.: In search of the wisdom web. IEEE Computer 35(11), 27–31 (2002)
Zhong, N., Liu, J., Yao, Y.: A New Paradigm for Developing the Wisdom Web and Social Network Intelligence. In: Web Intelligence, pp. 1–15. Springer, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this chapter
Cite this chapter
Micarelli, A., Sciarrone, F., Marinilli, M. (2007). Web Document Modeling. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds) The Adaptive Web. Lecture Notes in Computer Science, vol 4321. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72079-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-540-72079-9_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72078-2
Online ISBN: 978-3-540-72079-9
eBook Packages: Computer ScienceComputer Science (R0)