A Study of Approaches to Hypertext Categorization

Yang, Yiming; Slattery, Seán; Ghani, Rayid

doi:10.1023/A:1013685612819

A Study of Approaches to Hypertext Categorization

Published: March 2002

Volume 18, pages 219–241, (2002)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Yiming Yang¹,
Seán Slattery¹ &
Rayid Ghani^2,3

423 Accesses
Explore all metrics

Abstract

Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related Web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classification problems is an open question. This paper seeks a principled approach to providing the answers. Specifically, we define five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier. Using three hypertext datasets and three well-known learning algorithms (Naive Bayes, Nearest Neighbor, and First Order Inductive Learner), we examine these regularities in different domains, and compare alternative ways to exploit them. Our results show that the identification of hypertext regularities in the data and the selection of appropriate representations for hypertext in particular domains are crucial, but seldom obvious, in real-world problems. We find that adding the words in the linked neighborhood to the page having those links (both inlinks and outlinks) were helpful for all our classifiers on one data set, but more harmful than helpful for two out of the three classifiers on the remaining datasets. We also observed that extracting meta data from related Web sites was extremely useful for improving classification accuracy in some of those domains. Finally, the relative performance of the classifiers being tested provided insights into their strengths and limitations for solving classification problems involving diverse and often noisy Web pages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Chakrabarti, S., Dom, B.E., and Indyk, P. (1998). Enhanced Hypertext Categorization using Hyperlinks. In L.M. Haas and A. Tiwary (Eds.), Proceedings of SIGMOD-98, ACM International Conference on Management of Data, Seattle (pp. 307–318). New York, US: ACM Press.
Google Scholar
Chen, H. and Dumais, S.T. (2000). Bringing Order to the Web: Automatically Categorizing Search Results. In Proceedings of CHI-00, ACM International Conference on Human Factors in Computing Systems, Den Haag, NL (pp. 145–152). New York, US: ACM Press.
Google Scholar
Cohen, W. (1995). Learning to Classify English Text with ILP Methods. In L. De Raedt (Ed.), Advances in Inductive Logic Programming. Amsterdam: IOS Press.
Google Scholar
Cohen, W. (2000). Automatically Extracting Features for Concept Learning from the Web. In Seventeenth International Conference on Machine Learning.
Cohen, W.W. (1995). Learning to Classify English Text with ILP Methods. In L. De Raedt (Ed.), Advances in Inductive Logic Programming (pp. 124–143). Amsterdam: IOS Press.
Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A.K., Mitchell, T.M., Nigam, K., and Slattery, S. (2000). Learning to Construct Knowledge Bases from the World Wide Web. Artificial Intelligence, 118(1/2), 69–113.
Google Scholar
Craven, M., Slattery, S., and Nigam, K. (1998). First-Order Learning forWeb Mining. In Tenth European Conference on Machine Learning.
Dasarathy, B.V. (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, McGraw-Hill Computer Science Series. Las Alamitos, CA: IEEE Computer Society Press.
Google Scholar
Freitag, D. (1998). Multistrategy Learning for Information Extraction. In Proceedings of the Fifteenth International Conference on Machine Learning, San Francisco (pp. 161–169). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Fürnkranz, J. (1999). Exploiting Structural Information for Text Classification on the WWW. In D.J. Hand, J.N. Kok, and M.R. Berthold (Eds.), Proceedings of IDA-99, 3rd Symposium on Intelligent Data Analysis, Amsterdam (pp. 487–497). Heidelberg: Springer Verlag. Published in the “Lecture Notes in Computer Science” series, number 1642.
Google Scholar
Ghani, R., Jones, R., Mladenic, D., Nigam, K., and Slattery, S. (2000). Data Mining on Symbolic Knowledge Extracted from the Web. In Workshop on Text Mining at the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Ghani, R., Slattery, S., and Yang, Y. (2001). Hypertext Categorization using Hyperlink Patterns and Meta Data. In Proceedings of ICML-01, 18th International Conference on Machine Learning, Williams College, US. San Francisco: Morgan Kaufmann Publishers.
Google Scholar
Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In C. Nédellec and C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany (pp. 137–142). Heidelberg: Springer Verlag. Published in the “Lecture Notes in Computer Science” series, number 1398.
Google Scholar
Joachims, T., Cristianini, N., and Shawe-Taylor, J. (2001). Composite Kernels for Hypertext Categorisation. In Proceedings of ICML-01, 18th International Conference on Machine Learning, Williams College, US. San Francisco: Morgan Kaufmann Publishers.
Google Scholar
Kleinberg, J. (1998). Authoritative Sources in a Hyperlinked Environment. In Proceedings of the Nineth Annual ACM-SIAM Symposioum on Discrete Algorithms.
Lewis, D.D. (1998). Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In C. Nédellec and C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, (pp. 4–15), Heidelberg: Springer Verlag. Published in the “Lecture Notes in Computer Science” series, number 1398.
Google Scholar
McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. In AAAI-98 Workshop on Learning for Text Categorization. Tech. rep. WS-98-05. Menlo Park, CA: AAAI Press.
Google Scholar
Oh, H.-J., Myaeng, S.H., and Lee, M.-H. (2000). A Practical Hypertext Categorization Method using Links and Incrementally Available Class Information. In N.J. Belkin, P. Ingwersen, and M.-K. Leong (Eds.), Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, Athens (pp. 264–271). New York: ACM Press.
Google Scholar
Quinlan, J.R. (1990). Learning Logical Definitions from Relations. Machine Learning, 5, 239–266.
Google Scholar
Salton, G. and Buckley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 24, 513–523.
Google Scholar
Slattery, S. and Craven, M. (1998). Combining Statistical and Relational Methods for Learning in Hypertext Domains. In Proceedings of the 8th international Conference on Inductive Logic Programming, Madison, WI.
Slattery, S. (2001). Hypertext Classification. Ph.D. thesis, Carnegie Mellon University.
Slattery, S. and Craven, M. (2000). Discovering Test Set Regularities in Relational Domains. In P. Langley (Ed.), Proceedings of ICML-00, 17th International Conference on Machine Learning, Stanford, US. San Francisco: Morgan Kaufmann Publishers.
Google Scholar
van Rijsbergen, C.J. (1979). Information Retrieval. London: Butterworths.
Google Scholar
Yang, Y. (1994). Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorisation and Retrieval. In W.B. Croft and C.J. van Rijsbergen (Eds.), Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, Dublin (pp. 13–22). Heidelberg: Springer Verlag.
Google Scholar
Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1(1/2), 69–90.
Google Scholar
Yang, Y. (2001). A Study on Thresholding Strategies for Text Categorization. In W.B. Croft et al. (Eds.), Proceedings of SIGIR-2001, 24th ACMInternational Conference on Research and Development in Information Retrieval, New Orleans (pp. 137–145). Louisiana: ACM Press.
Google Scholar
Yang, Y., Ault, T., and Pierce, T. (2000). Combining Multiple Learning Strategies for Effective Cross-Validation. In P. Langley (Eds.), Proceedings of ICML-00, 17th International Conference on Machine Learning, Stanford (pp. 1167–1182). San Francisco: Morgan Kaufmann Publishers.
Google Scholar
Yang, Y. and Pedersen, J.O. (1997). A Comparative Study on Feature Selection in Text Categorization. In D.H. Fisher (Ed.), Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville (pp. 412–420). San Francisco: Morgan Kaufmann Publishers.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Yiming Yang & Seán Slattery
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA;
Rayid Ghani
Accenture Technology Labs—Research, Northbrook, IL, 60062, USA
Rayid Ghani

Authors

Yiming Yang
View author publications
You can also search for this author in PubMed Google Scholar
Seán Slattery
View author publications
You can also search for this author in PubMed Google Scholar
Rayid Ghani
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Y., Slattery, S. & Ghani, R. A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems 18, 219–241 (2002). https://doi.org/10.1023/A:1013685612819

Download citation

Issue Date: March 2002
DOI: https://doi.org/10.1023/A:1013685612819

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Study of Approaches to Hypertext Categorization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Naive Website Categorization Based on Text Coverage

Text Classification Using Novel “Anti-Bayesian” Techniques

An effective and interpretable method for document classification

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Navigation

A Study of Approaches to Hypertext Categorization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Naive Website Categorization Based on Text Coverage

Text Classification Using Novel “Anti-Bayesian” Techniques

An effective and interpretable method for document classification

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation