skip to main content
10.1145/1458082.1458241acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Scalable community discovery on textual data with relations

Authors Info & Claims
Published:26 October 2008Publication History

ABSTRACT

Every piece of textual data is generated as a method to convey its authors' opinion regarding specific topics. Authors deliberately organize their writings and create links, i.e., references, acknowledgments, for better expression. Thereafter, it is of interest to study texts as well as their relations to understand the underlying topics and communities. Although many efforts exist in the literature in data clustering and topic mining, they are not applicable to community discovery on large document corpus for several reasons. First, few of them consider both textual attributes as well as relations. Second, scalability remains a significant issue for large-scale datasets. Additionally, most algorithms rely on a set of initial parameters that are hard to be captured and tuned. Motivated by the aforementioned observations, a hierarchical community model is proposed in the paper which distinguishes community cores from affiliated members. We present our efforts to develop a scalable community discovery solution for large-scale document corpus. Our proposal tries to quickly identify potential cores as seeds of communities through relation analysis. To eliminate the influence of initial parameters, an innovative attribute-based core merge process is introduced so that the algorithm promises to return consistent communities regardless initial parameters. Experimental results suggest that the proposed method has high scalability to corpus size and feature dimensionality, with more than 15 topical precision improvement compared with popular clustering techniques.

References

  1. R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207--216, Washington, D.C., 26?28 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Airoldi, D. Blei, E. Xing, and S. Fienberg. A latent mixed membership model for relational data. In LinkKDD '05: Proceedings of the 3rd international workshop on Link discovery, pages 82--89, New York, NY, USA, 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  4. Y. Chi, S. Zhu, X. Song, J. Tatemura, and B. L. Tseng. Structural and temporal analysis of the blogosphere through community factorization. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 163--172, New York, NY, USA, 2007. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th International Conf. on Machine Learning, pages 167--174. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. A. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS, pages 430--436. MIT Press, 2000.Google ScholarGoogle Scholar
  7. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  8. Y. Dourisboure, F. Geraci, and M. Pellegrini. Extraction and classification of dense communities in the web. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 461--470, New York, NY, USA, 2007. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 150--160, Boston, MA, August 20?23 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 41--50, New York, NY, USA, 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In UK Conference on Hypertext, pages 225--234, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228--5235, April 2004.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. Harel and Y. Koren. Clustering spatial data using random walks. In Knowledge Discovery and Data Mining (KDD'01), pages 281--286, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI'99, Stockholm, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Kitsuregawa, M. Toyoda, and I. Pramudiono. Web community mining and web log mining: commodity cluster based execution. Aust. Comput. Sci. Commun., 24(2):3--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. M. Kleinberg. Hubs, authorities, and communities. ACM Comput. Surv., page 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Long, Z. M. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 470--479, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Long, Z. M. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 470--479, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Moser, R. Ge, and M. Ester. Joint cluster analysis of attribute and relationship data withouta-priori specification of the number of clusters. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 510--519, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Popescul, G. Flake, S. Lawrence, L. Ungar, and C. L. Giles. Clustering and identifying temporal trends in document databases. In Advances in Digital Libraries, ADL 2000, pages 173--182, Washington, DC, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Rosen-Zvi, T. Griffiths, P. Smyth, and M. Steyvers. Learning author topic models from text corpora. Technical report, November 2005.Google ScholarGoogle Scholar
  22. X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations and text. In LinkKDD '05: Proceedings of the 3rd international workshop on Link discovery, pages 28--35, New York, NY, USA, 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W.-J. Zhou, J.-R. Wen, W.-Y. Ma, and H.-J. Zhang. A concentric-circle model for community mining in graph structures. Technical Report MSR-TR-2002-123, Microsoft Research Asia, Beijing, China, November 2002.Google ScholarGoogle Scholar

Index Terms

  1. Scalable community discovery on textual data with relations

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
          October 2008
          1562 pages
          ISBN:9781595939913
          DOI:10.1145/1458082

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 October 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader