Abstract
The amount of information available on the World Wide Web, which appear in various Web documents, have been increasing dramatically in recent years. Classification of Web documents is becoming a more significant method for organizing such information. In this paper, we adopt a probabilistic model to classify Web documents into relevant documents and irrelevant documents with respect to an application ontology. Our model is based on the multivariant statistical analysis and is different from the conventional probabilistic information retrieval models. The experiments we have conducted using our probabilistic model look promising in terms of classification of multiple-record Web documents.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Anderson, T. W. An Introduction to Multivariate Statistical Methods. John Wiley, New York, 1984.
Crestani, F. and van Rijsbergen, C. J. A Study of Probability Kinetmatics in Information Retrieval. ACM Trans. Inf. Syst. 16(3), 225–255, 1998.
Embley, D. W. Object Database Development: Concepts and Principles. Addison Wesley Longman, 1998.
Embley, D. W., Campbell, D. M., Jiang, Y., Liddle, S. W., Lonsdale, D. W., Ng, Y.-K., and Smith, R. D. Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages, Journal of Data and Knowledge Engineering. 31(3), 227–251, 1999.
Fuhr, N. The Probabilistic Models in Information Retrieval. Comput. J. 35(3), 243–255, June 1992.
Fuhr, N. and Buckley, C. A Probabilistic Learning Approach for Document Indexing. ACM Trans. Inf. Syst. 9(3), 223–248, 1991.
Gövert, N., Lalmas, M., and Fuhr, N. A Probabilistic Description-Oriented Approach for Categorising Web Documents. Preprint, 1999.
Johnson, R. A. and Wichern, D. W. Applied Multivariate Statistical Analysis. Prentice-Hall Inc., New Jersey, 1998.
Kendall, M. G. Multivariate Analysis. Hafner Press, New York, 1975.
Salton, G. and McGill, M. J. Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1983.
van Rijsbergen, C. J. Information Retrieval. Butterworths, London, U. K., 1979.
Wong, S. K. M. and Yao, Y. Y.] On Modeling Information Retrieval with Probabilistic Inference. ACM Trans. Inf. Syst. 13(1), 38–68, 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag London Limited
About this paper
Cite this paper
Tang, J., Ng, YK. (2001). A Probabilistic Model for Classification of Multiple-Record Web Documents. In: Patel, D., Choudhury, I., Patel, S., de Cesare, S. (eds) OOIS 2000. Springer, London. https://doi.org/10.1007/978-1-4471-0299-1_29
Download citation
DOI: https://doi.org/10.1007/978-1-4471-0299-1_29
Publisher Name: Springer, London
Print ISBN: 978-1-85233-420-8
Online ISBN: 978-1-4471-0299-1
eBook Packages: Springer Book Archive