ABSTRACT
Every piece of textual data is generated as a method to convey its authors' opinion regarding specific topics. Authors deliberately organize their writings and create links, i.e., references, acknowledgments, for better expression. Thereafter, it is of interest to study texts as well as their relations to understand the underlying topics and communities. Although many efforts exist in the literature in data clustering and topic mining, they are not applicable to community discovery on large document corpus for several reasons. First, few of them consider both textual attributes as well as relations. Second, scalability remains a significant issue for large-scale datasets. Additionally, most algorithms rely on a set of initial parameters that are hard to be captured and tuned. Motivated by the aforementioned observations, a hierarchical community model is proposed in the paper which distinguishes community cores from affiliated members. We present our efforts to develop a scalable community discovery solution for large-scale document corpus. Our proposal tries to quickly identify potential cores as seeds of communities through relation analysis. To eliminate the influence of initial parameters, an innovative attribute-based core merge process is introduced so that the algorithm promises to return consistent communities regardless initial parameters. Experimental results suggest that the proposed method has high scalability to corpus size and feature dimensionality, with more than 15 topical precision improvement compared with popular clustering techniques.
- R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207--216, Washington, D.C., 26?28 1993. Google ScholarDigital Library
- E. Airoldi, D. Blei, E. Xing, and S. Fienberg. A latent mixed membership model for relational data. In LinkKDD '05: Proceedings of the 3rd international workshop on Link discovery, pages 82--89, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarCross Ref
- Y. Chi, S. Zhu, X. Song, J. Tatemura, and B. L. Tseng. Structural and temporal analysis of the blogosphere through community factorization. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 163--172, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
- D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th International Conf. on Machine Learning, pages 167--174. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarDigital Library
- D. A. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS, pages 430--436. MIT Press, 2000.Google Scholar
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
- Y. Dourisboure, F. Geraci, and M. Pellegrini. Extraction and classification of dense communities in the web. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 461--470, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
- G. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 150--160, Boston, MA, August 20?23 2000. Google ScholarDigital Library
- B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 41--50, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
- D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In UK Conference on Hypertext, pages 225--234, 1998. Google ScholarDigital Library
- T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228--5235, April 2004.Google ScholarCross Ref
- D. Harel and Y. Koren. Clustering spatial data using random walks. In Knowledge Discovery and Data Mining (KDD'01), pages 281--286, 2001. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI'99, Stockholm, 1999. Google ScholarDigital Library
- M. Kitsuregawa, M. Toyoda, and I. Pramudiono. Web community mining and web log mining: commodity cluster based execution. Aust. Comput. Sci. Commun., 24(2):3--10, 2002. Google ScholarDigital Library
- J. M. Kleinberg. Hubs, authorities, and communities. ACM Comput. Surv., page 5. Google ScholarDigital Library
- B. Long, Z. M. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 470--479, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- B. Long, Z. M. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 470--479, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- F. Moser, R. Ge, and M. Ester. Joint cluster analysis of attribute and relationship data withouta-priori specification of the number of clusters. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 510--519, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- A. Popescul, G. Flake, S. Lawrence, L. Ungar, and C. L. Giles. Clustering and identifying temporal trends in document databases. In Advances in Digital Libraries, ADL 2000, pages 173--182, Washington, DC, 2000. Google ScholarDigital Library
- M. Rosen-Zvi, T. Griffiths, P. Smyth, and M. Steyvers. Learning author topic models from text corpora. Technical report, November 2005.Google Scholar
- X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations and text. In LinkKDD '05: Proceedings of the 3rd international workshop on Link discovery, pages 28--35, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
- W.-J. Zhou, J.-R. Wen, W.-Y. Ma, and H.-J. Zhang. A concentric-circle model for community mining in graph structures. Technical Report MSR-TR-2002-123, Microsoft Research Asia, Beijing, China, November 2002.Google Scholar
Index Terms
- Scalable community discovery on textual data with relations
Recommendations
Evaluation of Community Mining Algorithms in the Presence of Attributes
Revised Selected Papers of the PAKDD 2015 Workshops on Trends and Applications in Knowledge Discovery and Data Mining - Volume 9441Grouping data points is one of the fundamental tasks in data mining, commonly known as clustering. In the case of interrelated data, when data is represented in the form of nodes and their relationships, the grouping is referred to as community. A ...
Blog Community Discovery Based on Tag Data Clustering
PACIIA '08: Proceedings of the 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application - Volume 02Blog is increasingly becoming an important source of information. Blog community is a kind of a group of bloggers with the same interest and common topics on the Internet. To use blog resources effectively, one important way is to identify blog ...
Mining the Community Structure of a Web Site
BCI '09: Proceedings of the 2009 Fourth Balkan Conference in InformaticsMost approaches for mining the community structure of a graph are based on the assumption that each member of a community has more links within than outside its community. We argue that this delimitation of a community is not appropriate for graphs ...
Comments