Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies

Kwong, Linus W.; Ng, Yiu-Kai

doi:10.1023/A:1024653618816

Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies

Published: September 2003

Volume 6, pages 281–303, (2003)
Cite this article

World Wide Web Aims and scope Submit manuscript

Linus W. Kwong¹ &
Yiu-Kai Ng¹

77 Accesses
2 Citations
Explore all metrics

Abstract

To retrieve Web documents of interest, most of the Web users rely on Web search engines. All existing search engines provide query facility for users to search for the desired documents using search-engine keywords. However, when a search engine retrieves a long list of Web documents, the user might need to browse through each retrieved document in order to determine which document is of interest. We observe that there are two kinds of problems involved in the retrieval of Web documents: (1) an inappropriate selection of keywords specified by the user; and (2) poor precision in the retrieved Web documents. In solving these problems, we propose an automatic binary-categorization method that is applicable for recognizing multiple-record Web documents of interest, which appear often in advertisement Web pages. Our categorization method uses application ontologies and is based on two information retrieval models, the Vector Space Model (VSM) and the Clustering Model (CM). We analyze and cull Web documents to just those applicable to a particular application ontology. The culling analysis (i) uses CM to find a virtual centroid for the records in a Web document, (ii) computes a vector in a multi-dimensional space for this centroid, and (iii) compares the vector with the predefined ontology vector of the same multi-dimensional space using VSM, which we consider the magnitudes of the vectors, as well as the angle between them. Our experimental results show that we have achieved an average of 90% recall and 97% precision in recognizing Web documents belonged to the same category (i.e., domain of interest). Thus our categorization discards very few documents it should have kept and keeps very few it should have discarded.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical Multidimensional Classification of Web Documents with MultiWebClass

Phrase Based Web Document Clustering: An Indexing Approach

Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space

References

R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, Menlo Park, California, 1999.
Google Scholar
D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. More, “Document categorization and query generation on the world wide web using webace,” Journal of Artificial Intelligence Review 13(5-6), 1999, 365-391.
Google Scholar
H. Borko and M. Bernick, “Automatic document classification,” Journal of the ACM 10(2), 1963, 151-162.
Google Scholar
M. A. Bunge, Treatise on Basic Philosophy, Vol. 4: Ontology II: A World of Systems, Reidel, Boston, 1979.
Google Scholar
C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal, “Web search using automatic classification,” in Proceedings of the Sixth International WWW Conference, 1997.
F. Crestani, M. Lalmas, and C. J. van Rijsbergen, Information Retrieval: Uncertainty and Logics, Kluwer Academic Publishers, Massachusetts, 1998.
Google Scholar
D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y.-K. Ng, and R. D. Smith, “Conceptual-model-based data extraction from multiple-record Web pages,” Journal of Data and Knowledge Engineering 31(3), 1999, 227-251.
Google Scholar
D. W. Embley, Y. Jiang, and Y.-K. Ng, “Record-boundary discovery in Web documents,” in Proceedings of the SIGMOD'99 Conference, 1999, pp. 467-478.
D. W. Embley, Y.-K. Ng, and L. Xu, “Recognizing ontology-applicable multiple-recorod Web documents,” in Proceedings of the 20th International Conference on Conceptual Modeling (ER 2001), November 2001, pp. 555-570.
D. Koller and M. Sahami, “Hierarchically classifying documents using very few words,” in Proceedings of the 14th International Conference on Machine Learning, July 1997, pp. 170-178.
D. Lewis, R. Schapire, and J. Callan, “Training algorithms for linear text classifiers,” in Proceedings of the ACM SIGIR, 1996, pp. 298-306.
E. Riloff and W. Lehnert, “Information extraction as a basis for high-precision text classification,” ACM Transactions on Information Systems 12(3), 1994, 296-333.
Google Scholar
G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, New York, 1989.
Google Scholar
G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1983.
Google Scholar
V. Storey, D. Dey, H. Ullrich, and S. Sundaresan, “An ontology-based expert system for database design,” Data & Knowledge Engineering 28(1), 1998, 31-46.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Brigham Young University, Provo, Utah, 84602, USA
Linus W. Kwong & Yiu-Kai Ng

Authors

Linus W. Kwong
View author publications
You can also search for this author in PubMed Google Scholar
Yiu-Kai Ng
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kwong, L.W., Ng, YK. Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies. World Wide Web 6, 281–303 (2003). https://doi.org/10.1023/A:1024653618816

Download citation

Issue Date: September 2003
DOI: https://doi.org/10.1023/A:1024653618816

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies

Abstract

Access this article

Similar content being viewed by others

Hierarchical Multidimensional Classification of Web Documents with MultiWebClass

Phrase Based Web Document Clustering: An Indexing Approach

Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies

Abstract

Access this article

Similar content being viewed by others

Hierarchical Multidimensional Classification of Web Documents with MultiWebClass

Phrase Based Web Document Clustering: An Indexing Approach

Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation