Skip to main content
Log in

Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

To retrieve Web documents of interest, most of the Web users rely on Web search engines. All existing search engines provide query facility for users to search for the desired documents using search-engine keywords. However, when a search engine retrieves a long list of Web documents, the user might need to browse through each retrieved document in order to determine which document is of interest. We observe that there are two kinds of problems involved in the retrieval of Web documents: (1) an inappropriate selection of keywords specified by the user; and (2) poor precision in the retrieved Web documents. In solving these problems, we propose an automatic binary-categorization method that is applicable for recognizing multiple-record Web documents of interest, which appear often in advertisement Web pages. Our categorization method uses application ontologies and is based on two information retrieval models, the Vector Space Model (VSM) and the Clustering Model (CM). We analyze and cull Web documents to just those applicable to a particular application ontology. The culling analysis (i) uses CM to find a virtual centroid for the records in a Web document, (ii) computes a vector in a multi-dimensional space for this centroid, and (iii) compares the vector with the predefined ontology vector of the same multi-dimensional space using VSM, which we consider the magnitudes of the vectors, as well as the angle between them. Our experimental results show that we have achieved an average of 90% recall and 97% precision in recognizing Web documents belonged to the same category (i.e., domain of interest). Thus our categorization discards very few documents it should have kept and keeps very few it should have discarded.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, Menlo Park, California, 1999.

    Google Scholar 

  2. D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. More, “Document categorization and query generation on the world wide web using webace,” Journal of Artificial Intelligence Review 13(5-6), 1999, 365-391.

    Google Scholar 

  3. H. Borko and M. Bernick, “Automatic document classification,” Journal of the ACM 10(2), 1963, 151-162.

    Google Scholar 

  4. M. A. Bunge, Treatise on Basic Philosophy, Vol. 4: Ontology II: A World of Systems, Reidel, Boston, 1979.

    Google Scholar 

  5. C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal, “Web search using automatic classification,” in Proceedings of the Sixth International WWW Conference, 1997.

  6. F. Crestani, M. Lalmas, and C. J. van Rijsbergen, Information Retrieval: Uncertainty and Logics, Kluwer Academic Publishers, Massachusetts, 1998.

    Google Scholar 

  7. D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y.-K. Ng, and R. D. Smith, “Conceptual-model-based data extraction from multiple-record Web pages,” Journal of Data and Knowledge Engineering 31(3), 1999, 227-251.

    Google Scholar 

  8. D. W. Embley, Y. Jiang, and Y.-K. Ng, “Record-boundary discovery in Web documents,” in Proceedings of the SIGMOD'99 Conference, 1999, pp. 467-478.

  9. D. W. Embley, Y.-K. Ng, and L. Xu, “Recognizing ontology-applicable multiple-recorod Web documents,” in Proceedings of the 20th International Conference on Conceptual Modeling (ER 2001), November 2001, pp. 555-570.

  10. D. Koller and M. Sahami, “Hierarchically classifying documents using very few words,” in Proceedings of the 14th International Conference on Machine Learning, July 1997, pp. 170-178.

  11. D. Lewis, R. Schapire, and J. Callan, “Training algorithms for linear text classifiers,” in Proceedings of the ACM SIGIR, 1996, pp. 298-306.

  12. E. Riloff and W. Lehnert, “Information extraction as a basis for high-precision text classification,” ACM Transactions on Information Systems 12(3), 1994, 296-333.

    Google Scholar 

  13. G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, New York, 1989.

    Google Scholar 

  14. G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1983.

    Google Scholar 

  15. V. Storey, D. Dey, H. Ullrich, and S. Sundaresan, “An ontology-based expert system for database design,” Data & Knowledge Engineering 28(1), 1998, 31-46.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kwong, L.W., Ng, YK. Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies. World Wide Web 6, 281–303 (2003). https://doi.org/10.1023/A:1024653618816

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1024653618816

Navigation