research-article

CluChunk: clustering large scale user-generated content incorporating chunklet information

Authors:

Alok ChoudharyAuthors Info & Claims

BigMine '12: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications

Pages 12 - 19

https://doi.org/10.1145/2351316.2351319

Published: 12 August 2012 Publication History

Abstract

The exponential rise of online content in the form of blogs, microblogs, forums, and multimedia sharing sites has raised an urgent demand for efficient and high-quality text clustering algorithms for fast navigation and browsing of users based on better document organization. For several kinds of these user-generated content, it is much easier to obtain the input in small sets, where the data in each set belongs to the same class but with unknown class labels. Such data is viewed as weakly-labeled data and the inherent chunklet information is very useful for improving clustering performance. In this paper, we propose a system - CluChunk (clustering chunklet data) to cluster unlabeled web data which incorporates chunklet information. We try to transfer the original feature space by a discriminatively learning linear transformation such that simple unsupervised learning techniques (such as K-Means) in the transformed space can achieve good clustering accuracy. Using larger scale data from some web applications (social media and online forums), we demonstrate that the clustering performance can get significantly improved by: 1)incorporating the inherent weakly-labeled information into the clustering framework; 2)enriching the representation of short text with additional features extracted from the chunklet subset. The proposed approach can be applied to other mining tasks with large scale user-generated content, like product review summarizing and blog content clustering/classification task.

References

[1]

200 million Tweets per day. http://blog.twitter.com/2011/06/200-million-tweets-per-day.html.

[2]

Measuring semantic similarity between words using web search engines. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 757--766, New York, NY, USA, 2007. ACM.

Digital Library

[3]

S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using wikipedia. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 787--788, New York, NY, USA, 2007. ACM.

Digital Library

[4]

A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equivalence relations. In In Proceedings of the Twentieth International Conference on Machine Learning, pages 11--18, 2003.

[5]

S. Becker, S. Thrun, and K. Obermayer, editors. Advances in Neural Information Processing Systems 15 {Neural Information Processing Systems, NIPS 2002, December 9--14, 2002, Vancouver, British Columbia, Canada}. MIT Press, 2003.

[6]

B. Carter. How To Get More Likes And Comments On Facebook. http://allfacebook.com/how-to-get-more-likes-and-comments-on-facebook-book-excerpt.

[7]

M. Chen, X. Jin, and D. Shen. Short text classification improved by learning multi-granularity topics. In T. Walsh, editor, IJCAI, pages 1776--1781. IJCAI/AAAI, 2011.

Digital Library

[8]

R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, New York, 2. edition, 2001.

Digital Library

[9]

E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 1606--1611, 2007.

Digital Library

[10]

J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems 17, pages 513--520. MIT Press, 2004.

[11]

J. Hu, L. Fang, Y. Cao, H.-J. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pages 179--186, New York, NY, USA, 2008. ACM.

Digital Library

[12]

X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou. Exploiting wikipedia as external knowledge for document clustering. In In Proc. of Int. Conf. on Knowledge Discovery and Data Mining (KDD, 2009.

Digital Library

[13]

R. Huang, Q. Liu, H. Lu, and S. Ma. Solving the small sample size problem of lda. In In: Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02, pages 29--32, 2002.

Digital Library

[14]

M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th international conference on World Wide Web, WWW '06, pages 377--386, New York, NY, USA, 2006. ACM.

Digital Library

[15]

G. Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989.

Digital Library

[16]

D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Query enrichment for web-query classification. ACM Trans. Inf. Syst., 24(3):320--352, July 2006.

Digital Library

[17]

N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proceedings of the 7th European Conference on Computer Vision-Part IV, ECCV '02, pages 776--792, London, UK, UK, 2002. Springer-Verlag.

Digital Library

[18]

W. tau Yih and C. Meek. Improving similarity measures for short segments of text. In AAAI, pages 1489--1494. AAAI Press, 2007.

Digital Library

[19]

I. W. Tsang, P. ming Cheung, and J. T. Kwok. Kernel relevant component analysis for distance metric learning. In In IEEE International Joint Conference on Neural Networks (IJCNN, pages 954--959. IJCNN, 2005.

[20]

K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl. Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pages 577--584, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

Digital Library

Cited By

Naik SPawar J(2017)Mining association rules between values across attributes in data streams2017 International Conference on Computational Intelligence in Data Science(ICCIDS)10.1109/ICCIDS.2017.8272634(1-6)Online publication date: Jun-2017
https://doi.org/10.1109/ICCIDS.2017.8272634
Cheng YChen ZLiu LWang JAgrawal AChoudhary AHe QIyengar ANejdl WPei JRastogi R(2013)Feedback-driven multiclass active learning for data streamsProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505528(1311-1320)Online publication date: 27-Oct-2013
https://dl.acm.org/doi/10.1145/2505515.2505528
Cheng YXie YChen ZAgrawal AChoudhary AGuo SGrossman RUthurusamy RDhillon IKoren Y(2013)JobMinerProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2487575.2487704(1450-1453)Online publication date: 11-Aug-2013
https://dl.acm.org/doi/10.1145/2487575.2487704

Index Terms

CluChunk: clustering large scale user-generated content incorporating chunklet information

Recommendations

A novel incremental conceptual hierarchical text clustering method using CFu-tree

This paper presents a novel down-top incremental conceptual hierarchical text clustering approach using CFu-tree (ICHTC-CF) representation.For summarizing a cluster, we use the term-based feature extraction in text clustering.A new measure criterion, ...
Tag suggestion and localization in user-generated videos based on social knowledge
WSM '10: Proceedings of second ACM SIGMM workshop on Social media

Nowadays, almost any web site that provides means for sharing user-generated multimedia content, like Flickr, Facebook, YouTube and Vimeo, has tagging functionalities to let users annotate the material that they want to share. The tags are then used to ...
A novel clustering algorithm based on data transformation approaches

A new initialization technique is proposed to improve the performance of K-means.A data transformation approach is proposed to solve empty cluster problem.An efficient method is proposed to estimate the optimal number of clusters.Proposed clustering ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BigMine '12: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications

August 2012

134 pages

ISBN:9781450315470

DOI:10.1145/2351316

Program Chairs:
Wei Fan
IBM T.J. Watson Research
,
Albert Bifet
University of Waikato
,
Qiang Yang
Hong Kong University of Science and Technology,
,
Philip Yu
University of Illinois at Chicago

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '12

Sponsor:

KDD '12: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 12, 2012

Beijing, China

Acceptance Rates

Overall Acceptance Rate 13 of 23 submissions, 57%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
302
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Naik SPawar J(2017)Mining association rules between values across attributes in data streams2017 International Conference on Computational Intelligence in Data Science(ICCIDS)10.1109/ICCIDS.2017.8272634(1-6)Online publication date: Jun-2017
https://doi.org/10.1109/ICCIDS.2017.8272634
Cheng YChen ZLiu LWang JAgrawal AChoudhary AHe QIyengar ANejdl WPei JRastogi R(2013)Feedback-driven multiclass active learning for data streamsProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505528(1311-1320)Online publication date: 27-Oct-2013
https://dl.acm.org/doi/10.1145/2505515.2505528
Cheng YXie YChen ZAgrawal AChoudhary AGuo SGrossman RUthurusamy RDhillon IKoren Y(2013)JobMinerProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2487575.2487704(1450-1453)Online publication date: 11-Aug-2013
https://dl.acm.org/doi/10.1145/2487575.2487704

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten