skip to main content
research-article

Parallel Massive Clustering of Discrete Distributions

Published: 02 June 2015 Publication History

Abstract

The trend of analyzing big data in artificial intelligence demands highly-scalable machine learning algorithms, among which clustering is a fundamental and arguably the most widely applied method. To extend the applications of regular vector-based clustering algorithms, the Discrete Distribution (D2) clustering algorithm has been developed, aiming at clustering data represented by bags of weighted vectors which are well adopted data signatures in many emerging information retrieval and multimedia learning applications. However, the high computational complexity of D2-clustering limits its impact in solving massive learning problems. Here we present the parallel D2-clustering (PD2-clustering) algorithm with substantially improved scalability. We developed a hierarchical multipass algorithm structure for parallel computing in order to achieve a balance between the individual-node computation and the integration process of the algorithm. Experiments and extensive comparisons between PD2-clustering and other clustering algorithms are conducted on synthetic datasets. The results show that the proposed parallel algorithm achieves significant speed-up with minor accuracy loss. We apply PD2-clustering to image concept learning. In addition, by extending D2-clustering to symbolic data, we apply PD2-clustering to protein sequence clustering. For both applications, we demonstrate the high competitiveness of our new algorithm in comparison with other state-of-the-art methods.

Supplementary Material

a49-zhang-app.pdf (zhang.zip)
Supplemental movie, appendix, image and software files for, A reward-and-punishment-based approach for concept detection using adaptive ontology rules

References

[1]
D. Arthur, B. Manthey, and H. Röglin. 2009. K-means has polynomial smoothed complexity. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science. 405--414.
[2]
A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. 2005. Clustering with Bregman divergences. J. Mach. Learn. Research 6, 1705--1749.
[3]
T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. 2013. Testing closeness of discrete distributions. J. ACM 60, 1, 4:1--4:25.
[4]
C. Beecks, A. M. Ivanescu, S. Kirchhoff, and T. Seidl. 2011. Modeling multimedia contents through probabilistic feature signatures. In Proceedings of the 19th ACM International Conference on Multimedia. 1433--1436.
[5]
B. Boeckmann, A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31, 1, 365--370.
[6]
W. Chen, Y. Song, H. Bai, C. Lin, and E. Y. Chang. 2011. Parallel spectral clustering in distributed systems. IEEE Trans. Pattern Anal. Mach. Intell. 33, 3, 568--586.
[7]
T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. 48.
[8]
P. Clement and W. Desch. 2008. An elementary proof of the triangle inequality for the Wasserstein metric. Proc. Amer. Math. Soc. 136, 1, 333--339.
[9]
T. M. Cover and J. A. Thomas. 2012. Elements of Information Theory. John Wiley & Sons.
[10]
E. Dahlhaus. 2000. Parallel algorithms for hierarchical clustering and applications to split decomposition and parity graph recognition. J. Algo. 36, 2, 205--240.
[11]
I. Daubechies. 1992. Ten Lectures on Wavelets. SIAM.
[12]
I. S. Dhillon and D. S. Modha. 2001. Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 1--2, 143--175.
[13]
A. J. Enright and C. A. Ouzounis. 2000. GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 16, 5, 451--457.
[14]
R. L. Ferreira Cordeiro, C. Traina Junior, A. J. Machado Traina, J. Lóopez, U. Kang, and C. Faloutsos. 2011. Clustering very large multi-dimensional datasets with mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 690--698.
[15]
A. Garrow, A. Agnew, and D. Westhead. 2005. TMB-Hunt: An amino acid composition based method to screen proteomes for Beta-Barrel transmembrane proteins. BMC Bioinformatics 6, 1, 56--71.
[16]
A. Gersho and R. M. Gray. 1992. Vector Quantization and Signal Compression. Springer.
[17]
E. Gonina, G. Friedland, E. Battenberg, P. Koanantakool, M. Driscoll, E. Georganas, and K. Keutzer. 2014. Scalable multimedia content analysis on parallel platforms using python. ACM Trans. Multimedia Comput. Commun. Appl. 10, 2, 18.
[18]
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng. 15, 3, 515--528.
[19]
Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li. 2010. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics 26, 5, 680--682.
[20]
L. Hubert and P. Arabie. 1985. Comparing partitions. J. Classification 2, 1, 193--218.
[21]
L. V. Kantorovich. 1942. On the transfer of masses. Dokl. Akad. Nauk. SSSR. 227--229.
[22]
D. Kelley and S. Salzberg. 2010. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics 11, 1, 544--555.
[23]
E. Levina and P. Bickel. 2001. The earth mover's distance is the Mallows distance: some insights from statistics. In Proceedings of 8th IEEE International Conference on Computer Vision, Vol. 2. 251--256.
[24]
J. Li and J. Z. Wang. 2008. Real-time computerized annotation of pictures. IEEE Trans. Pattern Anal. Mach. Intell. 30, 6, 985--1002.
[25]
Y. Linde, A. Buzo, and R. Gray. 1980. An algorithm for vector quantizer design. IEEE Trans. Commun. 28, 1, 84--95.
[26]
D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2, 91--110.
[27]
J. MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. 281--297.
[28]
G. Monge. 1781. Méemoire sur la théeorie des déeblais et des remblais. De l'Imprimerie Royale.
[29]
K. G. Murty. 1983. Linear Programming. Vol. 57, Wiley New York.
[30]
C. F. Olson. 1995. Parallel algorithms for hierarchical clustering. Parallel Comput. 21, 8, 1313--1325.
[31]
A. Paccanaro, J. A. Casbon, and M. A. Saqi. 2006. Spectral clustering of protein sequences. Nucleic Acids Research 34, 5, 1571--1580.
[32]
W. R. Pearson. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183, 63--98.
[33]
S. L. K. Pond, K. Scheffler, M. B. Gravenor, A. F. Y. Poon, and S. D. W. Frost. 2010. Evolutionary fingerprinting of genes. Mol. Biol. Evolution 27, 3, 520--536.
[34]
A. Rosenberg and J. Hirschberg. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the Conference on Empirical Methods on Natural Language Processing and Computational Natural Language Learning, Vol. 7. 410--420.
[35]
J. Sang and C. Xu. 2011. Browse by chunks: Topic mining and organizing on web-scale social media. ACM Trans. Multimedia Comput. Commun. Appl. 7, 1, 30.
[36]
C. J. Sigrist, E. De Castro, L. Cerutti, B. A. Cuche, N. Hulo, A. Bridge, L. Bougueleret, and I. Xenarios. 2013. New and continuing developments at PROSITE. Nucleic Acids Research 41, D1, D344--D347.
[37]
N. X. Vinh, J. Epps, and J. Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Research, 2837--2854.
[38]
X. Wan. 2007. A novel document similarity measure based on earth mover's distance. Information Sciences 177, 18, 3718--3730.
[39]
D. Xu and S. F. Chang. 2008. Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans. Pattern Anal. Mach. Intell. 30, 11, 1985--1997.
[40]
W. Zhao, H. Ma, and Q. He. 2009. Parallel k-means clustering based on mapreduce. In Cloud Computing, Lecture Notes in Computer Science Vol. 5931, 674--679.
[41]
Q. Zheng and W. Gao. 2008. Constructing visual phrases for effective and efficient object-based image retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 5, 1, 7.

Cited By

View all
  • (2023)2BiVQA: Double Bi-LSTM based Video Quality Assessment of UGC VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3632178Online publication date: 8-Nov-2023
  • (2023)Unlocking the Emotional World of Visual Media: An Overview of the Science, Research, and Impact of Understanding EmotionProceedings of the IEEE10.1109/JPROC.2023.3273517111:10(1236-1286)Online publication date: Oct-2023
  • (2022)Balance-driven automatic clustering for probability density functions using metaheuristic optimizationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01683-814:4(1063-1078)Online publication date: 22-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 11, Issue 4
April 2015
231 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/2788342
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2015
Accepted: 01 December 2014
Revised: 01 September 2014
Received: 01 May 2014
Published in TOMM Volume 11, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Discrete distribution
  2. clustering
  3. image annotation
  4. large-scale learning
  5. parallel computing
  6. protein clustering

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)2BiVQA: Double Bi-LSTM based Video Quality Assessment of UGC VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3632178Online publication date: 8-Nov-2023
  • (2023)Unlocking the Emotional World of Visual Media: An Overview of the Science, Research, and Impact of Understanding EmotionProceedings of the IEEE10.1109/JPROC.2023.3273517111:10(1236-1286)Online publication date: Oct-2023
  • (2022)Balance-driven automatic clustering for probability density functions using metaheuristic optimizationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01683-814:4(1063-1078)Online publication date: 22-Oct-2022
  • (2021)Optimal Transport With Relaxed Marginal ConstraintsIEEE Access10.1109/ACCESS.2021.30726139(58142-58160)Online publication date: 2021
  • (2021)Unsupervised and Semisupervised LearningWiley StatsRef: Statistics Reference Online10.1002/9781118445112.stat08320(1-18)Online publication date: 18-Aug-2021
  • (2020)Automatic Transformation of a Video Using Multimodal Information for an Engaging Exploration ExperienceApplied Sciences10.3390/app1009305610:9(3056)Online publication date: 27-Apr-2020
  • (2019)Survey of Compressed Domain Video Summarization TechniquesACM Computing Surveys10.1145/335539852:6(1-29)Online publication date: 16-Oct-2019
  • (2019)Color Theme--based Aesthetic Enhancement Algorithm to Emulate the Human Perception of Beauty in PhotosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/332899115:2s(1-17)Online publication date: 3-Jul-2019
  • (2019)Objective Reduction Using Objective Sampling and Affinity Propagation for Many-Objective Optimization ProblemsIEEE Access10.1109/ACCESS.2019.29140697(68392-68403)Online publication date: 2019
  • (2017)Fast Discrete Distribution Clustering Using Wasserstein Barycenter With Sparse SupportIEEE Transactions on Signal Processing10.1109/TSP.2017.265964765:9(2317-2332)Online publication date: 1-May-2017
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media