Costco: Robust Content and Structure Constrained Clustering of Networked Documents

Yan, Su; Lee, Dongwon; Wang, Alex Hai

doi:10.1007/978-3-642-19437-5_24

Su Yan¹⁷,
Dongwon Lee¹⁸ &
Alex Hai Wang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6609))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1300 Accesses

Abstract

Connectivity analysis of networked documents provides high quality link structure information, which is usually lost upon a content-based learning system. It is well known that combining links and content has the potential to improve text analysis. However, exploiting link structure is non-trivial because links are often noisy and sparse. Besides, it is difficult to balance the term-based content analysis and the link-based structure analysis to reap the benefit of both. We introduce a novel networked document clustering technique that integrates the content and link information in a unified optimization framework. Under this framework, a novel dimensionality reduction method called COntent & STructure COnstrained (Costco) Feature Projection is developed. In order to extract robust link information from sparse and noisy link graphs, two link analysis methods are introduced. Experiments on benchmark data and diverse real-world text corpora validate the effectiveness of proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Angelova, R., Siersdorfer, S.: A neighborhood-based approach for clustering of linked document collections. In: CIKM, pp. 778–779 (2006)
Google Scholar
Bolelli, L., Ertekin, S., Giles, C.L.: Clustering scientific literature using sparse citation graph analysis. In: PKDD, pp. 30–41 (2006)
Google Scholar
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: SIGMOD, pp. 307–318 (1998)
Google Scholar
Cohn, D.A., Hofmann, T.: The missing link - a probabilistic model of document content and hypertext connectivity. In: NIPS, pp. 430–436 (2000)
Google Scholar
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1-2), 143–175 (2001)
Article MATH Google Scholar
Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679–707 (2003)
MATH Google Scholar
He, X., Zha, H., Ding, C.H.Q., Simon, H.D.: Web document clustering using hyperlink structures. Computational Statistics & Data Analysis 41(1), 19–45 (2002)
Article MATH Google Scholar
Henzinger, M.: Hyperlink analysis on the world wide web. In: Hypertext, pp. 1–3 (2005)
Google Scholar
Ji, X., Xu, W.: Document clustering with prior knowledge, pp. 405–412 (2006)
Google Scholar
Menczer, F.: Lexical and semantic clustering by web links. JASIST 55(14), 1261–1269 (2004)
Article Google Scholar
Modha, D.S., Spangler, W.S.: Clustering hypertext with applications to web searching. In: Hypertext, pp. 143–152 (2000)
Google Scholar
Neumaier, A.: Solving ill-conditioned and singular linear systems: A tutorial on regularization. SIAM Review 40, 636–666 (1998)
Article MATH Google Scholar
Neville, J., Adler, M., Jensen, D.: Clustering relational data using attribute and link information. In: Proceedings of the IJCAI Text Mining and Link Analysis Workshop (2003)
Google Scholar
Oh, H.-J., Myaeng, S.H., Lee, M.-H.: A practical hypertext catergorization method using links and incrementally available class information. In: SIGIR, pp. 264–271 (2000)
Google Scholar
Park, H.W., Thelwall, M.: Hyperlink analyses of the world wide web: A review. J. Computer-Mediated Communication 8(4) (2003)
Google Scholar
Pearson, K.: On lines and planes of closest fit to systems of points in space. Philo- sophical Magazine 2(6), 559–572 (1901)
Article MATH Google Scholar
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000)
Article Google Scholar
Shi, J., Malik, J.: Normalized cuts and image segmentation (2000)
Google Scholar
Wang, Y., Kitsuregawa, M.: Evaluating contents-link coupled web page clustering for web search results. In: CIKM, pp. 499–506 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center, San Jose, CA, 95120, USA
Su Yan
The Pennsylvania State University, University Park, PA, 16802, USA
Dongwon Lee
The Pennsylvania State University, Dumore, PA, 18512, USA
Alex Hai Wang

Authors

Su Yan
View author publications
You can also search for this author in PubMed Google Scholar
Dongwon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Alex Hai Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, S., Lee, D., Wang, A.H. (2011). Costco: Robust Content and Structure Constrained Clustering of Networked Documents. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6609. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19437-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-19437-5_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19436-8
Online ISBN: 978-3-642-19437-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics