Contextual weighted representations and indexing models for the retrieval of HTML documents

Pereira, R. A. Marques; Molinari, A.; Pasi, G.

doi:10.1007/s00500-004-0361-z

Contextual weighted representations and indexing models for the retrieval of HTML documents

Published: 19 November 2004

Volume 9, pages 481–492, (2005)
Cite this article

Soft Computing Aims and scope Submit manuscript

R. A. Marques Pereira¹,
A. Molinari¹ &
G. Pasi²

79 Accesses
7 Citations
3 Altmetric
Explore all metrics

Abstract

The diffusion of the World Wide Web (WWW) and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) constitues a common basis for generating documents over the internet and the intranets. By means of the HTML the author is allowed to organize the text into subparts delimited by special tags; these subparts are then visualized by the HTML browser in distinct ways, i.e. with distinct typographical formats. In this paper a model for indexing HTML documents is proposed which exploits the role of tags in encoding the importance of their delimited text. Central to our model is a method to compute the significance degree of a term in a document by weighting the term instances according to the tags in which they occur. The indexing model proposed is based on a contextual weighted representation of the document under consideration, by means of which a set of (normalized) numerical weights is assigned to the various tags containing the text. The weighted representation is contextual in the sense that the set of numerical weights assigned to the various tags and the respective text depend (other than on the tags themselves) on the particular document considered. By means of the contextual weighted representation our indexing model reflects not only the general syntactic structure of the HTML language but also the information conveyed by the particular way in which the author instantiates that general structure in the document under consideration. We discuss two different forms of contextual weighting: the first is based on a linear weighted representation and is closer to the standard model of universal (i.e. non contextual) weighting; the second is based on a more complex non linear weighted representation and has a number of novel and interesting features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agosti M, Crestani F, Pasi G (eds) (2001) Lectures in Information Retrieval, Springer, Berlin Heidelberg, New York
Barfourosh A, Motahary Nezhad HR, Anderson ML, Perlis D (2002) Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition, Technical Report available at the URL citeseer.nj.nec.com/barfourosh02information.html
Berners-Lee T, Connolly D (1994) Hypertext markup language specification - 2.0. IETF HTML Working Group
Bookstein A (1981) A comparison of two systems of weighted Boolean retrieval. J. Am. Soc. Information Sci. 32(4):275–279
Google Scholar
Bordogna G, Pasi G (1995) Controlling retrieval through a user-adaptive representation of documents. Int. J. Approximate Reasoning 12:317–339
Google Scholar
Bordogna G, Pasi G (1995) Linguistic aggregation operators of selection criteria in fuzzy information retrieval. J. Intelligent Information Syst. 10:233–248
Google Scholar
Bordogna G, Pasi G (2000) Flexible querying of structured documents. In: Proc. of Flexible Query Answering Systems FQAS (Warsaw, Poland), pp. 350–361
Bordogna G, Pasi G (2001) Modelling vagueness in information retrieval. In: Agosti M, Crestani F, Pasi G (eds) Lectures in Information Retrieval. Springer Verlag, Berlin Heidelberg New York
Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Sys. 30:107–117
Google Scholar
Buell DA (1982) An analysis of some fuzzy subset applications to information retrieval systems. Fuzzy Sets and Sys. 7:35–42
Google Scholar
Carrire SJ, Kazman R (1997) WebQuery: searching and visualizing the Web through connectivity. Computer Networks 29:1257–1267
Google Scholar
Cater SC, Kraft DH (1989) A generalizaton and clarification of the Waller-Kraft wish-list. Information and Processing Management 25:15–25
Google Scholar
Chakrabarti S, van der Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31:1623–1640
Google Scholar
Chakrabarti S, Joshi M, Tawde V (2001) Enhanced topic distillation using text, markup tags, and hyperlinks. In: Proc. SIGIR’01 Conference (New Orleans, 2001), pp 208–216
Crestani F, Lalmas M, van Rijsbergen CJ, Campbell I (1998) Is this document relevant?... Probably, ACM Computing Surveys 30(4):528–552
Crestani F, Pasi G (2000) (eds) Soft Computing in Information Retrieval: Techniques and Applications. Physica Verlag, Heidelberg 2000, Series Studies in Fuzziness
Croft B (1994) What do people want from Information Retrieval, D-Lib Magazine, November 1995
Cutler M, Shih Y, Meng W (1997) Using the structures of HTML documents to improve retrieval. In: Proc. USENIX Symposium on internet technologies and systems NSITS’97 (Monterey, California), pp 241–251
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J. ACM 46:604–632
Google Scholar
Kuhnert C (1995) Choosing an indexing strategy in an entreprise environment. In: Proc. 3rd International WWW Conference: Technology, Tools, and Applications (Darmstadt, Germany)
Kobayashi M, Takeda K (2000) Information retrieval on the Web. IBM Research Report, RT0347
Molinari A, Pasi G (1996) A Fuzzy representation of HTML documents for information retrieval systems. In: Proc IEEE International Conference on Fuzzy Systems (New Orleans, September 1996)
McBryan O (1994). GENVL and WWW: Tools for taming the Web. In: Proc. 1st International WWW Conference (Geneva, Switzerland, May 1994)
Pfeifer U, Poersch T, Fuhr N (1996) Searching proper names in databases. In: Proc. Conference on Hypertext - Information Retrieval - Multimedia HIMS’96
Pinkerton B (1994) Finding what people want: Experiences with the WebCrawler. In: Proc. 2nd International WWW Conference: Mosaic and the Web (Chicago, Illinois, October 1994)
Radecki T (1979) Fuzzy set theoretical approach to document retrieval. Information Processing Management 15:247–260
Savoy J (1996) An extended vector processing scheme for searching information in hypertext systems. Information Processing Management 32(2):155–170
Google Scholar
Salton G, McGill MJ (1984) Introduction to Modern Information Retrieval. McGraw-Hill, New York
Salton G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison Wesley, Redwood City CA
Salton G, Buckley C (1988) Term weighting approaches in automatic text retrieval. Information Processing Management 24(5):513–523
Google Scholar
Spertus E (1997) Parasite : Mining structural information on the Web. Computer Networks and ISDN Systems: The International Journal of Computer and Telecommunication Networking 29:1205–1215
van Rijsbergen K (1979) Information Retrieval. Butterworths, London.
Wilkinson R (1994) Effective retrieval of structured documents. In: Proc. SIGIR’94 Conference (Dublin, Ireland), pp 311–317
Extensible Markup Language (XML) 1.0 W3C Reccomendation 10 February 1998, http://www.w3.org/TR/1998/REC-xml- 19980210
www.searchengineworld.com/spiders/lycos.htm, Lycos Search Engine

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica e Studi Aziendali, Università degli Studi di Trento, Via Inama 5, 38100, Trento, Italy
R. A. Marques Pereira & A. Molinari
Istituto Tecnologie della Costruzione, Consiglio Nazionale delle Ricerche CNR, Via Bassini 15, 20133, Milano, Italy
G. Pasi

Authors

R. A. Marques Pereira
View author publications
You can also search for this author in PubMed Google Scholar
A. Molinari
View author publications
You can also search for this author in PubMed Google Scholar
G. Pasi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. A. Marques Pereira.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pereira, R., Molinari, A. & Pasi, G. Contextual weighted representations and indexing models for the retrieval of HTML documents. Soft Comput 9, 481–492 (2005). https://doi.org/10.1007/s00500-004-0361-z

Download citation

Published: 19 November 2004
Issue Date: July 2005
DOI: https://doi.org/10.1007/s00500-004-0361-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Contextual weighted representations and indexing models for the retrieval of HTML documents

Abstract

Access this article

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation