Summary
In this paper, we shall present a theoretical framework for classifying web pages in a hierarchical directory using the Bayesian Network formalism. In particular, we shall focus on the problem of multi-label text categorization, where a given document can be assigned to any number of categories in the hierarchy. The idea is to explicitly represent the dependence relationships between the different categories in the hierarchy, although adapted to include the category descriptors. Given a new document (web page) to be classified, a Bayesian Network inference process shall be used to compute the probability of each category given the document. The web page is then assigned to those classes with the highest posterior probability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Brin and L. Page, (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30:1–7, pp. 107–117.
L.M. de Campos, J.M. Fernández-Luna, J.F. Huete, (2003) The BNR model: Foundations and performance of a Bayesian network retrieval model, International Journal of Approximate Reasoning 34:265–285.
L.M. de Campos, J.M. Fernández-Luna, J.F. Huete, (2004) Using context information in structured document retrieval: an approach based on influence diagrams, Information Processing and Management 40(5):829–847.
S. Dumais, H. Chen (2000). Hierarchical Classification of Web Content. In Proceedings of the SIGIR Conference 256–263.
I. Frommholz (2001). Categorizing Web Documents in Hierarchical Catalogues. In Proceedings of the 23rd Conference on Information Retrieval
N. Govert, M. Lalmas and N. Furh (1999). A probabilistic description-oriented approach for categorising web documents. In Proc. of the ACM Intern. Conference on Information Knowledge and Management, 475–482.
F.V. Jensen (1996). An Introduction to Bayesian Networks. University College London Press, London.
J. Kleinberg. (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46:5, pp. 604–632.
D. Koller, M. Sahami (1997). Hierarchically classifying documents using very few words. In Proceedings of the 14th International Conference on Machine Learning 170–178.
D. Mladenić (1998). Turning Yahoo into an Automatic Web-page Classifier. In Proceedings of the 13th European Conference on Artificial Intelligence 473–474.
J. Pearl (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan and Kaufmann, California.
M.E. Ruiz, P. Srinivasan (2002). Hierarchical Text Categorization Using Neural Networks. Information Retrieval 5:87–118, 2002
F. Sebastiani (2002). Machine Learning in Automated Text Categorizarion. ACM Computing Surveys 34(1):1–47.
R. Schapier, E. Singer and A. Singhal (1998) Boosting and Rochio applied to text filtering. In Proc. of the SIGIR’98, 21st Intern. Conference on Research and Developement in Information Retrieval. 215–223.
A. Sun, E. Lim (2001). Hierarchical Text Classification and Evaluation. In Proceedings of the IEEE International Conference on Data Mining 521–528.
A. Sun, E. Lim, W. Ng, J. Srivastava (2004). Blocking Reduction Strategies in Hierarchical Text Classification. In IEEE Transactions on Knowledge and Data Engineering, 18(10), 1305–1308.
www.yahoo.com
Y. Yang, J. Zhang and B. Kisiel. (2003). A scalability of classifiers in text categorization. In Proc. SIGIR’03, Intern. Conference on Research and Developement in Information Retrieval. 96–103.
Y. Yang and J. Pedersen (1997). A comparative study on feature selection in text categorization. In Proc. of International Conference on Machine Learning. 412–420.
A.S. Weigend, E.D. Weiener, J.O. Pedersen (1999). Exploiting Hierarchy in Text Categorization. Information Retrieval 1:193–216.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
de Campos, L.M., Fernández-Luna, J.M., Huete, J.F. (2006). A Theoretical Framework for Web Categorization in Hierarchical Directories using Bayesian Networks. In: Herrera-Viedma, E., Pasi, G., Crestani, F. (eds) Soft Computing in Web Information Retrieval. Studies in Fuzziness and Soft Computing, vol 197. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31590-X_2
Download citation
DOI: https://doi.org/10.1007/3-540-31590-X_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31588-9
Online ISBN: 978-3-540-31590-2
eBook Packages: EngineeringEngineering (R0)