Skip to main content
Log in

Contextual weighted representations and indexing models for the retrieval of HTML documents

  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

The diffusion of the World Wide Web (WWW) and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) constitues a common basis for generating documents over the internet and the intranets. By means of the HTML the author is allowed to organize the text into subparts delimited by special tags; these subparts are then visualized by the HTML browser in distinct ways, i.e. with distinct typographical formats. In this paper a model for indexing HTML documents is proposed which exploits the role of tags in encoding the importance of their delimited text. Central to our model is a method to compute the significance degree of a term in a document by weighting the term instances according to the tags in which they occur. The indexing model proposed is based on a contextual weighted representation of the document under consideration, by means of which a set of (normalized) numerical weights is assigned to the various tags containing the text. The weighted representation is contextual in the sense that the set of numerical weights assigned to the various tags and the respective text depend (other than on the tags themselves) on the particular document considered. By means of the contextual weighted representation our indexing model reflects not only the general syntactic structure of the HTML language but also the information conveyed by the particular way in which the author instantiates that general structure in the document under consideration. We discuss two different forms of contextual weighting: the first is based on a linear weighted representation and is closer to the standard model of universal (i.e. non contextual) weighting; the second is based on a more complex non linear weighted representation and has a number of novel and interesting features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  • Agosti M, Crestani F, Pasi G (eds) (2001) Lectures in Information Retrieval, Springer, Berlin Heidelberg, New York

  • Barfourosh A, Motahary Nezhad HR, Anderson ML, Perlis D (2002) Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition, Technical Report available at the URL citeseer.nj.nec.com/barfourosh02information.html

  • Berners-Lee T, Connolly D (1994) Hypertext markup language specification - 2.0. IETF HTML Working Group

  • Bookstein A (1981) A comparison of two systems of weighted Boolean retrieval. J. Am. Soc. Information Sci. 32(4):275–279

    Google Scholar 

  • Bordogna G, Pasi G (1995) Controlling retrieval through a user-adaptive representation of documents. Int. J. Approximate Reasoning 12:317–339

    Google Scholar 

  • Bordogna G, Pasi G (1995) Linguistic aggregation operators of selection criteria in fuzzy information retrieval. J. Intelligent Information Syst. 10:233–248

    Google Scholar 

  • Bordogna G, Pasi G (2000) Flexible querying of structured documents. In: Proc. of Flexible Query Answering Systems FQAS (Warsaw, Poland), pp. 350–361

  • Bordogna G, Pasi G (2001) Modelling vagueness in information retrieval. In: Agosti M, Crestani F, Pasi G (eds) Lectures in Information Retrieval. Springer Verlag, Berlin Heidelberg New York

  • Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Sys. 30:107–117

    Google Scholar 

  • Buell DA (1982) An analysis of some fuzzy subset applications to information retrieval systems. Fuzzy Sets and Sys. 7:35–42

    Google Scholar 

  • Carrire SJ, Kazman R (1997) WebQuery: searching and visualizing the Web through connectivity. Computer Networks 29:1257–1267

    Google Scholar 

  • Cater SC, Kraft DH (1989) A generalizaton and clarification of the Waller-Kraft wish-list. Information and Processing Management 25:15–25

    Google Scholar 

  • Chakrabarti S, van der Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31:1623–1640

    Google Scholar 

  • Chakrabarti S, Joshi M, Tawde V (2001) Enhanced topic distillation using text, markup tags, and hyperlinks. In: Proc. SIGIR’01 Conference (New Orleans, 2001), pp 208–216

  • Crestani F, Lalmas M, van Rijsbergen CJ, Campbell I (1998) Is this document relevant?... Probably, ACM Computing Surveys 30(4):528–552

  • Crestani F, Pasi G (2000) (eds) Soft Computing in Information Retrieval: Techniques and Applications. Physica Verlag, Heidelberg 2000, Series Studies in Fuzziness

  • Croft B (1994) What do people want from Information Retrieval, D-Lib Magazine, November 1995

  • Cutler M, Shih Y, Meng W (1997) Using the structures of HTML documents to improve retrieval. In: Proc. USENIX Symposium on internet technologies and systems NSITS’97 (Monterey, California), pp 241–251

  • Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J. ACM 46:604–632

    Google Scholar 

  • Kuhnert C (1995) Choosing an indexing strategy in an entreprise environment. In: Proc. 3rd International WWW Conference: Technology, Tools, and Applications (Darmstadt, Germany)

  • Kobayashi M, Takeda K (2000) Information retrieval on the Web. IBM Research Report, RT0347

  • Molinari A, Pasi G (1996) A Fuzzy representation of HTML documents for information retrieval systems. In: Proc IEEE International Conference on Fuzzy Systems (New Orleans, September 1996)

  • McBryan O (1994). GENVL and WWW: Tools for taming the Web. In: Proc. 1st International WWW Conference (Geneva, Switzerland, May 1994)

  • Pfeifer U, Poersch T, Fuhr N (1996) Searching proper names in databases. In: Proc. Conference on Hypertext - Information Retrieval - Multimedia HIMS’96

  • Pinkerton B (1994) Finding what people want: Experiences with the WebCrawler. In: Proc. 2nd International WWW Conference: Mosaic and the Web (Chicago, Illinois, October 1994)

  • Radecki T (1979) Fuzzy set theoretical approach to document retrieval. Information Processing Management 15:247–260

  • Savoy J (1996) An extended vector processing scheme for searching information in hypertext systems. Information Processing Management 32(2):155–170

    Google Scholar 

  • Salton G, McGill MJ (1984) Introduction to Modern Information Retrieval. McGraw-Hill, New York

  • Salton G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison Wesley, Redwood City CA

  • Salton G, Buckley C (1988) Term weighting approaches in automatic text retrieval. Information Processing Management 24(5):513–523

    Google Scholar 

  • Spertus E (1997) Parasite : Mining structural information on the Web. Computer Networks and ISDN Systems: The International Journal of Computer and Telecommunication Networking 29:1205–1215

  • van Rijsbergen K (1979) Information Retrieval. Butterworths, London.

  • Wilkinson R (1994) Effective retrieval of structured documents. In: Proc. SIGIR’94 Conference (Dublin, Ireland), pp 311–317

  • Extensible Markup Language (XML) 1.0 W3C Reccomendation 10 February 1998, http://www.w3.org/TR/1998/REC-xml- 19980210

  • www.searchengineworld.com/spiders/lycos.htm, Lycos Search Engine

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. A. Marques Pereira.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pereira, R., Molinari, A. & Pasi, G. Contextual weighted representations and indexing models for the retrieval of HTML documents. Soft Comput 9, 481–492 (2005). https://doi.org/10.1007/s00500-004-0361-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-004-0361-z

Keywords

Navigation