research-article

Scalable community discovery on textual data with relations

Authors:
Huajing Li

The Pennsylvania State University, University Park, PA, USA

The Pennsylvania State University, University Park, PA, USA
View Profile

,
Zaiqing Nie

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Wang-Chien Lee

The Pennsylvania State University, University Park, PA, USA

The Pennsylvania State University, University Park, PA, USA
View Profile

,
Lee Giles

The Pennsylvania State University, University Park, PA, USA

The Pennsylvania State University, University Park, PA, USA
View Profile

,
Ji-Rong Wen

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementOctober 2008Pages 1203–1212https://doi.org/10.1145/1458082.1458241

Published:26 October 2008Publication History

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 1203–1212

ABSTRACT

Every piece of textual data is generated as a method to convey its authors' opinion regarding specific topics. Authors deliberately organize their writings and create links, i.e., references, acknowledgments, for better expression. Thereafter, it is of interest to study texts as well as their relations to understand the underlying topics and communities. Although many efforts exist in the literature in data clustering and topic mining, they are not applicable to community discovery on large document corpus for several reasons. First, few of them consider both textual attributes as well as relations. Second, scalability remains a significant issue for large-scale datasets. Additionally, most algorithms rely on a set of initial parameters that are hard to be captured and tuned. Motivated by the aforementioned observations, a hierarchical community model is proposed in the paper which distinguishes community cores from affiliated members. We present our efforts to develop a scalable community discovery solution for large-scale document corpus. Our proposal tries to quickly identify potential cores as seeds of communities through relation analysis. To eliminate the influence of initial parameters, an innovative attribute-based core merge process is introduced so that the algorithm promises to return consistent communities regardless initial parameters. Experimental results suggest that the proposed method has high scalability to corpus size and feature dimensionality, with more than 15 topical precision improvement compared with popular clustering techniques.

References

R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207--216, Washington, D.C., 26?28 1993. Google ScholarDigital Library
E. Airoldi, D. Blei, E. Xing, and S. Fienberg. A latent mixed membership model for relational data. In LinkKDD '05: Proceedings of the 3rd international workshop on Link discovery, pages 82--89, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarCross Ref
Y. Chi, S. Zhu, X. Song, J. Tatemura, and B. L. Tseng. Structural and temporal analysis of the blogosphere through community factorization. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 163--172, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th International Conf. on Machine Learning, pages 167--174. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarDigital Library
D. A. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS, pages 430--436. MIT Press, 2000.Google Scholar
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
Y. Dourisboure, F. Geraci, and M. Pellegrini. Extraction and classification of dense communities in the web. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 461--470, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
G. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 150--160, Boston, MA, August 20?23 2000. Google ScholarDigital Library
B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 41--50, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In UK Conference on Hypertext, pages 225--234, 1998. Google ScholarDigital Library
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228--5235, April 2004.Google ScholarCross Ref
D. Harel and Y. Koren. Clustering spatial data using random walks. In Knowledge Discovery and Data Mining (KDD'01), pages 281--286, 2001. Google ScholarDigital Library
T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI'99, Stockholm, 1999. Google ScholarDigital Library
M. Kitsuregawa, M. Toyoda, and I. Pramudiono. Web community mining and web log mining: commodity cluster based execution. Aust. Comput. Sci. Commun., 24(2):3--10, 2002. Google ScholarDigital Library
J. M. Kleinberg. Hubs, authorities, and communities. ACM Comput. Surv., page 5. Google ScholarDigital Library
B. Long, Z. M. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 470--479, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
B. Long, Z. M. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 470--479, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
F. Moser, R. Ge, and M. Ester. Joint cluster analysis of attribute and relationship data withouta-priori specification of the number of clusters. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 510--519, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
A. Popescul, G. Flake, S. Lawrence, L. Ungar, and C. L. Giles. Clustering and identifying temporal trends in document databases. In Advances in Digital Libraries, ADL 2000, pages 173--182, Washington, DC, 2000. Google ScholarDigital Library
M. Rosen-Zvi, T. Griffiths, P. Smyth, and M. Steyvers. Learning author topic models from text corpora. Technical report, November 2005.Google Scholar
X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations and text. In LinkKDD '05: Proceedings of the 3rd international workshop on Link discovery, pages 28--35, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
W.-J. Zhou, J.-R. Wen, W.-Y. Ma, and H.-J. Zhang. A concentric-circle model for community mining in graph structures. Technical Report MSR-TR-2002-123, Microsoft Research Asia, Beijing, China, November 2002.Google Scholar

Index Terms

Scalable community discovery on textual data with relations
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering
2. Software and its engineering

Recommendations

Evaluation of Community Mining Algorithms in the Presence of Attributes
Revised Selected Papers of the PAKDD 2015 Workshops on Trends and Applications in Knowledge Discovery and Data Mining - Volume 9441

Grouping data points is one of the fundamental tasks in data mining, commonly known as clustering. In the case of interrelated data, when data is represented in the form of nodes and their relationships, the grouping is referred to as community. A ...
Read More
Blog Community Discovery Based on Tag Data Clustering
PACIIA '08: Proceedings of the 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application - Volume 02

Blog is increasingly becoming an important source of information. Blog community is a kind of a group of bloggers with the same interest and common topics on the Internet. To use blog resources effectively, one important way is to identify blog ...
Read More
Mining the Community Structure of a Web Site
BCI '09: Proceedings of the 2009 Fourth Balkan Conference in Informatics

Most approaches for mining the community structure of a graph are based on the assumption that each member of a community has more links within than outside its community. We argue that this delimitation of a community is not appropriate for graphs ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
October 2008
1562 pages
ISBN:9781595939913
DOI:10.1145/1458082
General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
community mining
relational data
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 552
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable community discovery on textual data with relations

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluation of Community Mining Algorithms in the Presence of Attributes

Blog Community Discovery Based on Tag Data Clustering

Mining the Community Structure of a Web Site