An Analytical Approach to Concept Extraction in HTML Environments

Fresno, Victor; Ribeiro, Angela

doi:10.1023/B:JIIS.0000019277.82436.17

An Analytical Approach to Concept Extraction in HTML Environments

Published: May 2004

Volume 22, pages 215–235, (2004)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Victor Fresno¹ &
Angela Ribeiro²

110 Accesses
12 Citations
Explore all metrics

Abstract

The core of the Internet and World Wide Web revolution comes from their capacity to efficiently share the huge quantity of data, but the rapid and chaotic growth of the Net has extremely complicated the task of sharing or mining useful information. Each inference process, from Internet information, requires an adequate characterization of the Web pages. The textual part of a page is one of the most important aspects that should be considered to appropriately perform a page characterization. The textual characterization should be made through the extraction of an appropriate set of relevant concepts that properly represent the text included in the Web page. This paper presents a method to obtain such a set of relevant concepts from a Web page, essentially based on a relevance estimation of each word in the text of a Web page. The word-relevance is defined by a combination of criteria that take into account characteristics of the HTML language as well as more classical measures such as the frequency and the position of a word in a document. Besides, heuristic rules to obtain the most suitable fusion of criteria is achieved via a statistical study. Several experiments are conducted to test the performance of the proposed concept extraction method compared to other approaches including a commercial tool. The results obtained here exhibit a greater success in the concept extraction by the proposed technique against other tested methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Article Open access 20 August 2017

Marcin Michał Mirończuk

Concept Identification from Single-Documents

An Automatic Construction of Concept Maps Based on Statistical Text Mining

References

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison-Wesley: ACM Press Books.
Google Scholar
Chen, H. and Dumais, S.T. (2000). Bringing Order to the Web: Automatically Categorizing Search Results. In Proc. Of CHI'00, Human Factor in Computing Systems (pp. 145-152). Den Haag, New York, US: ACM Press.
Google Scholar
Dunham, M.H. (2002). Data Mining. Introductory and Advanced Topics. Upper Saddle River, NJ: Prentice Hall.
Google Scholar
Fresno, V. and Ribeiro, A. (2001). Feature Selection and Dimensionality Reduction inWeb Pages Representation. In International ICSC Congress on Computational Intelligence: Methods & Applications (pp. 416-421). Bangor, Wales, U.K.
Gales, W., Kenneth, W.C., and Yarowsky, D. (1992). A Method for Disambiguating Word Senses in a Large Corpus. Computers and the Humanities, 26, 415-439.
Google Scholar
Gudivada, V.N., Raghavan, V.V., Grosky, W.I., and Kasanagottu, R. (1997). Information Retrieval on the World Wide Web, IEEE Internet Computing. Sept.-Oct., 58-68.
Henzinger, M. (2000). Link Analysis in Web Information Retrieval. Bulletin of the Technical Committee on Data Engineering, 23, 3-8.
Google Scholar
Hovy, E. and Lin, C.Y. (1999). Automated Text Summarization in SUMMARIST. In I. Mani and M.T. Maybury (Eds.), Advances in Automatic Text Summarization. Cambridge, MA. The MIT Press.
Google Scholar
Koller, D. and Sahami, M. (1996). Toward Optimal Feature Selection. In ICML-96: Proceedings of the Thirteenth International Conference on Machine Learning (pp. 284-292). San Francisco, CA: Morgan Kaufmann.
Google Scholar
Kosala, R. and Blockeel, H. (2000). Web Mining Research: A Survey. ACM SIGKDD Explorations, 2(1), 1-15.
Google Scholar
Manning, C.D. and Schtze, H. (2001). Foundations of Statistical Natural Language Processing, Cambridge, MA: The MIT Press.
Google Scholar
Mitchell, T.M. (1997). Machine Learning. McGraw-Hill International Editions.
Mladenic, D. (1999). Text-Learning and Related Intelligent Agents. IEEE Expert Special issue on Applications of Intelligent Information Retrieval. July-August.
Musciano, C. and Kennedy, B. (1997). HTML The Complete Guide. McGraw Hill.
Salton, G., Wong, A., and Yang, C.S. (1975). A Vector Space Model for Information Retrieval. Communications of the ACM 18(11), 613-620.
Google Scholar
UNCTAD(2002). E-Commerce and Development Report 2002. Report of the United Nations Conference on Trade and Development. United Nations, New York and Geneva.
Yang, Y., Slattery, S., and Ghani, R. (2002). A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems, 18(2), 219-241.
Google Scholar

Download references

Author information

Authors and Affiliations

Escuela Superior de Ciencias Experimentales y Tecnologia, Rey Juan Carlos University, 28933, Mostoles, Madrid, Spain
Victor Fresno
Industrial Automation Institute (IAI), Spanish Council for Scientific Research (CSIC), 28500, Arganda del Rey, Madrid, Spain
Angela Ribeiro

Authors

Victor Fresno
View author publications
You can also search for this author in PubMed Google Scholar
Angela Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fresno, V., Ribeiro, A. An Analytical Approach to Concept Extraction in HTML Environments. Journal of Intelligent Information Systems 22, 215–235 (2004). https://doi.org/10.1023/B:JIIS.0000019277.82436.17

Download citation

Issue Date: May 2004
DOI: https://doi.org/10.1023/B:JIIS.0000019277.82436.17

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Analytical Approach to Concept Extraction in HTML Environments

Abstract

Access this article

Similar content being viewed by others

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Concept Identification from Single-Documents

An Automatic Construction of Concept Maps Based on Statistical Text Mining

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

An Analytical Approach to Concept Extraction in HTML Environments

Abstract

Access this article

Similar content being viewed by others

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Concept Identification from Single-Documents

An Automatic Construction of Concept Maps Based on Statistical Text Mining

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation