skip to main content
10.1145/2351316.2351319acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

CluChunk: clustering large scale user-generated content incorporating chunklet information

Published: 12 August 2012 Publication History

Abstract

The exponential rise of online content in the form of blogs, microblogs, forums, and multimedia sharing sites has raised an urgent demand for efficient and high-quality text clustering algorithms for fast navigation and browsing of users based on better document organization. For several kinds of these user-generated content, it is much easier to obtain the input in small sets, where the data in each set belongs to the same class but with unknown class labels. Such data is viewed as weakly-labeled data and the inherent chunklet information is very useful for improving clustering performance. In this paper, we propose a system - CluChunk (clustering chunklet data) to cluster unlabeled web data which incorporates chunklet information. We try to transfer the original feature space by a discriminatively learning linear transformation such that simple unsupervised learning techniques (such as K-Means) in the transformed space can achieve good clustering accuracy. Using larger scale data from some web applications (social media and online forums), we demonstrate that the clustering performance can get significantly improved by: 1)incorporating the inherent weakly-labeled information into the clustering framework; 2)enriching the representation of short text with additional features extracted from the chunklet subset. The proposed approach can be applied to other mining tasks with large scale user-generated content, like product review summarizing and blog content clustering/classification task.

References

[1]
200 million Tweets per day. http://blog.twitter.com/2011/06/200-million-tweets-per-day.html.
[2]
Measuring semantic similarity between words using web search engines. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 757--766, New York, NY, USA, 2007. ACM.
[3]
S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using wikipedia. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 787--788, New York, NY, USA, 2007. ACM.
[4]
A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equivalence relations. In In Proceedings of the Twentieth International Conference on Machine Learning, pages 11--18, 2003.
[5]
S. Becker, S. Thrun, and K. Obermayer, editors. Advances in Neural Information Processing Systems 15 {Neural Information Processing Systems, NIPS 2002, December 9--14, 2002, Vancouver, British Columbia, Canada}. MIT Press, 2003.
[6]
B. Carter. How To Get More Likes And Comments On Facebook. http://allfacebook.com/how-to-get-more-likes-and-comments-on-facebook-book-excerpt.
[7]
M. Chen, X. Jin, and D. Shen. Short text classification improved by learning multi-granularity topics. In T. Walsh, editor, IJCAI, pages 1776--1781. IJCAI/AAAI, 2011.
[8]
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, New York, 2. edition, 2001.
[9]
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 1606--1611, 2007.
[10]
J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems 17, pages 513--520. MIT Press, 2004.
[11]
J. Hu, L. Fang, Y. Cao, H.-J. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pages 179--186, New York, NY, USA, 2008. ACM.
[12]
X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou. Exploiting wikipedia as external knowledge for document clustering. In In Proc. of Int. Conf. on Knowledge Discovery and Data Mining (KDD, 2009.
[13]
R. Huang, Q. Liu, H. Lu, and S. Ma. Solving the small sample size problem of lda. In In: Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02, pages 29--32, 2002.
[14]
M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th international conference on World Wide Web, WWW '06, pages 377--386, New York, NY, USA, 2006. ACM.
[15]
G. Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989.
[16]
D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Query enrichment for web-query classification. ACM Trans. Inf. Syst., 24(3):320--352, July 2006.
[17]
N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proceedings of the 7th European Conference on Computer Vision-Part IV, ECCV '02, pages 776--792, London, UK, UK, 2002. Springer-Verlag.
[18]
W. tau Yih and C. Meek. Improving similarity measures for short segments of text. In AAAI, pages 1489--1494. AAAI Press, 2007.
[19]
I. W. Tsang, P. ming Cheung, and J. T. Kwok. Kernel relevant component analysis for distance metric learning. In In IEEE International Joint Conference on Neural Networks (IJCNN, pages 954--959. IJCNN, 2005.
[20]
K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl. Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pages 577--584, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

Cited By

View all
  • (2017)Mining association rules between values across attributes in data streams2017 International Conference on Computational Intelligence in Data Science(ICCIDS)10.1109/ICCIDS.2017.8272634(1-6)Online publication date: Jun-2017
  • (2013)Feedback-driven multiclass active learning for data streamsProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505528(1311-1320)Online publication date: 27-Oct-2013
  • (2013)JobMinerProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2487575.2487704(1450-1453)Online publication date: 11-Aug-2013

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BigMine '12: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
August 2012
134 pages
ISBN:9781450315470
DOI:10.1145/2351316
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. chunklet
  2. data transformation
  3. text clustering
  4. user-generated content

Qualifiers

  • Research-article

Conference

KDD '12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 13 of 23 submissions, 57%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Mining association rules between values across attributes in data streams2017 International Conference on Computational Intelligence in Data Science(ICCIDS)10.1109/ICCIDS.2017.8272634(1-6)Online publication date: Jun-2017
  • (2013)Feedback-driven multiclass active learning for data streamsProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505528(1311-1320)Online publication date: 27-Oct-2013
  • (2013)JobMinerProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2487575.2487704(1450-1453)Online publication date: 11-Aug-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media