Abstract
The diffusion of the World Wide Web (WWW) and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) constitues a common basis for generating documents over the internet and the intranets. By means of the HTML the author is allowed to organize the text into subparts delimited by special tags; these subparts are then visualized by the HTML browser in distinct ways, i.e. with distinct typographical formats. In this paper a model for indexing HTML documents is proposed which exploits the role of tags in encoding the importance of their delimited text. Central to our model is a method to compute the significance degree of a term in a document by weighting the term instances according to the tags in which they occur. The indexing model proposed is based on a contextual weighted representation of the document under consideration, by means of which a set of (normalized) numerical weights is assigned to the various tags containing the text. The weighted representation is contextual in the sense that the set of numerical weights assigned to the various tags and the respective text depend (other than on the tags themselves) on the particular document considered. By means of the contextual weighted representation our indexing model reflects not only the general syntactic structure of the HTML language but also the information conveyed by the particular way in which the author instantiates that general structure in the document under consideration. We discuss two different forms of contextual weighting: the first is based on a linear weighted representation and is closer to the standard model of universal (i.e. non contextual) weighting; the second is based on a more complex non linear weighted representation and has a number of novel and interesting features.
References
Agosti M, Crestani F, Pasi G (eds) (2001) Lectures in Information Retrieval, Springer, Berlin Heidelberg, New York
Barfourosh A, Motahary Nezhad HR, Anderson ML, Perlis D (2002) Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition, Technical Report available at the URL citeseer.nj.nec.com/barfourosh02information.html
Berners-Lee T, Connolly D (1994) Hypertext markup language specification - 2.0. IETF HTML Working Group
Bookstein A (1981) A comparison of two systems of weighted Boolean retrieval. J. Am. Soc. Information Sci. 32(4):275–279
Bordogna G, Pasi G (1995) Controlling retrieval through a user-adaptive representation of documents. Int. J. Approximate Reasoning 12:317–339
Bordogna G, Pasi G (1995) Linguistic aggregation operators of selection criteria in fuzzy information retrieval. J. Intelligent Information Syst. 10:233–248
Bordogna G, Pasi G (2000) Flexible querying of structured documents. In: Proc. of Flexible Query Answering Systems FQAS (Warsaw, Poland), pp. 350–361
Bordogna G, Pasi G (2001) Modelling vagueness in information retrieval. In: Agosti M, Crestani F, Pasi G (eds) Lectures in Information Retrieval. Springer Verlag, Berlin Heidelberg New York
Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Sys. 30:107–117
Buell DA (1982) An analysis of some fuzzy subset applications to information retrieval systems. Fuzzy Sets and Sys. 7:35–42
Carrire SJ, Kazman R (1997) WebQuery: searching and visualizing the Web through connectivity. Computer Networks 29:1257–1267
Cater SC, Kraft DH (1989) A generalizaton and clarification of the Waller-Kraft wish-list. Information and Processing Management 25:15–25
Chakrabarti S, van der Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31:1623–1640
Chakrabarti S, Joshi M, Tawde V (2001) Enhanced topic distillation using text, markup tags, and hyperlinks. In: Proc. SIGIR’01 Conference (New Orleans, 2001), pp 208–216
Crestani F, Lalmas M, van Rijsbergen CJ, Campbell I (1998) Is this document relevant?... Probably, ACM Computing Surveys 30(4):528–552
Crestani F, Pasi G (2000) (eds) Soft Computing in Information Retrieval: Techniques and Applications. Physica Verlag, Heidelberg 2000, Series Studies in Fuzziness
Croft B (1994) What do people want from Information Retrieval, D-Lib Magazine, November 1995
Cutler M, Shih Y, Meng W (1997) Using the structures of HTML documents to improve retrieval. In: Proc. USENIX Symposium on internet technologies and systems NSITS’97 (Monterey, California), pp 241–251
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J. ACM 46:604–632
Kuhnert C (1995) Choosing an indexing strategy in an entreprise environment. In: Proc. 3rd International WWW Conference: Technology, Tools, and Applications (Darmstadt, Germany)
Kobayashi M, Takeda K (2000) Information retrieval on the Web. IBM Research Report, RT0347
Molinari A, Pasi G (1996) A Fuzzy representation of HTML documents for information retrieval systems. In: Proc IEEE International Conference on Fuzzy Systems (New Orleans, September 1996)
McBryan O (1994). GENVL and WWW: Tools for taming the Web. In: Proc. 1st International WWW Conference (Geneva, Switzerland, May 1994)
Pfeifer U, Poersch T, Fuhr N (1996) Searching proper names in databases. In: Proc. Conference on Hypertext - Information Retrieval - Multimedia HIMS’96
Pinkerton B (1994) Finding what people want: Experiences with the WebCrawler. In: Proc. 2nd International WWW Conference: Mosaic and the Web (Chicago, Illinois, October 1994)
Radecki T (1979) Fuzzy set theoretical approach to document retrieval. Information Processing Management 15:247–260
Savoy J (1996) An extended vector processing scheme for searching information in hypertext systems. Information Processing Management 32(2):155–170
Salton G, McGill MJ (1984) Introduction to Modern Information Retrieval. McGraw-Hill, New York
Salton G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison Wesley, Redwood City CA
Salton G, Buckley C (1988) Term weighting approaches in automatic text retrieval. Information Processing Management 24(5):513–523
Spertus E (1997) Parasite : Mining structural information on the Web. Computer Networks and ISDN Systems: The International Journal of Computer and Telecommunication Networking 29:1205–1215
van Rijsbergen K (1979) Information Retrieval. Butterworths, London.
Wilkinson R (1994) Effective retrieval of structured documents. In: Proc. SIGIR’94 Conference (Dublin, Ireland), pp 311–317
Extensible Markup Language (XML) 1.0 W3C Reccomendation 10 February 1998, http://www.w3.org/TR/1998/REC-xml- 19980210
www.searchengineworld.com/spiders/lycos.htm, Lycos Search Engine
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pereira, R., Molinari, A. & Pasi, G. Contextual weighted representations and indexing models for the retrieval of HTML documents. Soft Comput 9, 481–492 (2005). https://doi.org/10.1007/s00500-004-0361-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-004-0361-z