A Probabilistic Model for Classification of Multiple-Record Web Documents

Tang, June; Ng, Yiu-Kai

doi:10.1007/978-1-4471-0299-1_29

A Probabilistic Model for Classification of Multiple-Record Web Documents

June Tang⁴ &
Yiu-Kai Ng⁴

Conference paper

82 Accesses

Abstract

The amount of information available on the World Wide Web, which appear in various Web documents, have been increasing dramatically in recent years. Classification of Web documents is becoming a more significant method for organizing such information. In this paper, we adopt a probabilistic model to classify Web documents into relevant documents and irrelevant documents with respect to an application ontology. Our model is based on the multivariant statistical analysis and is different from the conventional probabilistic information retrieval models. The experiments we have conducted using our probabilistic model look promising in terms of classification of multiple-record Web documents.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anderson, T. W. An Introduction to Multivariate Statistical Methods. John Wiley, New York, 1984.
Google Scholar
Crestani, F. and van Rijsbergen, C. J. A Study of Probability Kinetmatics in Information Retrieval. ACM Trans. Inf. Syst. 16(3), 225–255, 1998.
Article Google Scholar
Embley, D. W. Object Database Development: Concepts and Principles. Addison Wesley Longman, 1998.
Google Scholar
Embley, D. W., Campbell, D. M., Jiang, Y., Liddle, S. W., Lonsdale, D. W., Ng, Y.-K., and Smith, R. D. Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages, Journal of Data and Knowledge Engineering. 31(3), 227–251, 1999.
Article MATH Google Scholar
Fuhr, N. The Probabilistic Models in Information Retrieval. Comput. J. 35(3), 243–255, June 1992.
Article MATH Google Scholar
Fuhr, N. and Buckley, C. A Probabilistic Learning Approach for Document Indexing. ACM Trans. Inf. Syst. 9(3), 223–248, 1991.
Article Google Scholar
Gövert, N., Lalmas, M., and Fuhr, N. A Probabilistic Description-Oriented Approach for Categorising Web Documents. Preprint, 1999.
Google Scholar
Johnson, R. A. and Wichern, D. W. Applied Multivariate Statistical Analysis. Prentice-Hall Inc., New Jersey, 1998.
Google Scholar
Kendall, M. G. Multivariate Analysis. Hafner Press, New York, 1975.
MATH Google Scholar
Salton, G. and McGill, M. J. Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1983.
MATH Google Scholar
van Rijsbergen, C. J. Information Retrieval. Butterworths, London, U. K., 1979.
Google Scholar
Wong, S. K. M. and Yao, Y. Y.] On Modeling Information Retrieval with Probabilistic Inference. ACM Trans. Inf. Syst. 13(1), 38–68, 1995.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Brigham Young University, Provo, Utah, 84602, USA
June Tang & Yiu-Kai Ng

Authors

June Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yiu-Kai Ng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, Information Systems and Mathematics, South Bank University, 103 Borough Road, London, UK
Dilip Patel & Shushma Patel &
School of Computing, London Guildhall University, 100 Minories, London, UK
Islam Choudhury
Department of Information Systems and Computing, Brunel University, Uxbridge, London, UK
Sergio de Cesare

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, J., Ng, YK. (2001). A Probabilistic Model for Classification of Multiple-Record Web Documents. In: Patel, D., Choudhury, I., Patel, S., de Cesare, S. (eds) OOIS 2000. Springer, London. https://doi.org/10.1007/978-1-4471-0299-1_29

Download citation

DOI: https://doi.org/10.1007/978-1-4471-0299-1_29
Publisher Name: Springer, London
Print ISBN: 978-1-85233-420-8
Online ISBN: 978-1-4471-0299-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics