research-article

Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases

Authors:

Karin VerspoorAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 9, Issue 3

Article No.: 17, Pages 1 - 27

https://doi.org/10.1145/3131611

Published: 27 January 2018 Publication History

Abstract

The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis.

Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.

References

[1]

Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 5--14.

Digital Library

[2]

Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology 215, 3, 403--410.

[3]

Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Jesús M. Pérez, and Inigo Perona. 2013. An extensive comparative study of cluster validity indices. Pattern Recognition 46, 1, 243--256.

Digital Library

[4]

Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, et al. 2000. Gene ontology: Tool for the unification of biology. Nature Genetics 25, 1, 25--29.

[5]

Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Computing Surveys 41, 3, 16.

Digital Library

[6]

Dennis A. Benson, Karen Clark, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and Eric W. Sayers. 2015. GenBank. Nucleic Acids Research 43, D30.

[7]

James C. Bezdek, Masud Moshtaghi, Thomas Runkler, and Christopher Leckie. 2016. The generalized C index for internal fuzzy cluster validity. IEEE Transactions on Fuzzy Systems 24, 6, 1500--1512.

Digital Library

[8]

Stefan Bienert, Andrew Waterhouse, Tjaart A. P. de Beer, Gerardo Tauriello, Gabriel Studer, Lorenza Bordoli, and Torsten Schwede. 2016. The SWISS-MODEL repository—new features and functionality. Nucleic Acids Research 45, D313--D319.

[9]

Emmanuel Boutet, Damien Lieberherr, Michael Tognolli, Michel Schneider, Parit Bansal, Alan J. Bridge, Sylvain Poux, Lydie Bougueleret, and Ioannis Xenarios. 2016. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: How to use the entry view. Plant Bioinformatics: Methods and Protocols 1374, 23--54.

[10]

Borisas Bursteinas, Ramona Britto, Benoit Bely, Andrea Auchincloss, Catherine Rivoire, Nicole Redaschi, Claire O’Donovan, and Maria Jesus Martin. 2016. Minimizing proteome redundancy in the UniProt KnowledgeBase. Database: The Journal of Biological Databases and Curation 2016, baw139.

[11]

Yu-Dong Cai and Shuo Liang Lin. 2003. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics 1648, 1, 127--133.

[12]

Qingyu Chen, Yu Wan, Yang Lei, Justin Zobel, and Karin Verspoor. 2016. Evaluation of CD-HIT for constructing non-redundant databases. In Proceedings of the 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM’16). IEEE, Los Alamitos, CA, 703--706.

[13]

Qingyu Chen, Justin Zobel, and Karin Verspoor. 2015. Evaluation of a machine learning duplicate detection method for bioinformatics databases. In Proceedings of the ACM 9th International Workshop on Data and Text Mining in Biomedical Informatics. ACM, New York, NY, 4--12.

Digital Library

[14]

Qingyu Chen, Justin Zobel, and Karin Verspoor. 2017. Benchmarks for measurement of duplicate detection methods in nucleotide databases. Database: The Journal of Biological Databases and Curation 2017, baw164.

[15]

Qingyu Chen, Justin Zobel, and Karin Verspoor. 2017. Duplicates, redundancies, and inconsistencies in the primary nucleotide databases: A descriptive study. Database: The Journal of Biological Databases and Curation 2017, baw163.

[16]

Qingyu Chen, Justin Zobel, Xiuzhen Zhang, and Karin Verspoor. 2016. Supervised learning for detection of duplicates in genomic sequence databases. PloS One 11, 8, e0159644.

[17]

Peter Christen. 2012. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 24, 9, 1537--1555.

Digital Library

[18]

Christian Cole, Jonathan D. Barber, and Geoffrey J. Barton. 2008. The Jpred 3 secondary structure prediction server. Nucleic Acids Research 36, Suppl. 2, W197--W201.

[19]

Gene Ontology Consortium. 2017. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Research 45, D331--D338.

[20]

Mélanie Courtot, Aleksandra Shypitsyna, Elena Speretta, Alexander Holmes, Tony Sawford, Tony Wardell, Maria Jesus Martin, and Claire O’Donovan. 2015. UniProt-GOA: A central resource for data integration and GO annotation. In Proceedings of the International Workshop on Semantic Web Applications and Tools for Life Sciences (SWAT4LS’15). 227--228.

[21]

Francisco M. Couto and Mário J. Silva. 2011. Disjunctive shared information between ontology concepts: Application to gene ontology. Journal of Biomedical Semantics 2, 1, 5.

[22]

E. C. Dalrymple-Alford. 1970. Measurement of clustering in free recall. Psychological Bulletin 74, 1, 32.

[23]

Van Dang, Xiaobing Xue, and W. Bruce Croft. 2011. Inferring query aspects from reformulations using clustering. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 2117--2120.

Digital Library

[24]

Christophe Dessimoz and Nives Škunca. 2016. The Gene Ontology Handbook. Methods in Molecular Biology. Springer.

[25]

Antonio Di Marco and Roberto Navigli. 2013. Clustering and diversifying Web search results with graph-based word sense induction. Computational Linguistics 39, 3, 709--754.

[26]

Hui Ding, Liaofu Luo, and Hao Lin. 2009. Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition. Protein and Peptide Letters 16, 4, 351--355.

[27]

Robert C. Edgar. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 19, 2460--2461.

Digital Library

[28]

Simon B. Eickhoff, Angela R. Laird, Peter T. Fox, Danilo Bzdok, and Lukas Hensel. 2016. Functional segregation of the human dorsomedial prefrontal cortex. Cerebral Cortex 26, 1, 304--321.

[29]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19, 1, 1--16.

Digital Library

[30]

Wenfei Fan. 2015. Data quality: From theory to practice. ACM SIGMOD Record 44, 3, 7--18.

Digital Library

[31]

Robert D. Finn, Penelope Coggill, Ruth Y. Eberhardt, Sean R. Eddy, Jaina Mistry, Alex L. Mitchell, Simon C. Potter, et al. 2016. The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Research 44, D279--D285.

[32]

Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. 2012. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 23, 3150--3152.

Digital Library

[33]

Michael Y. Galperin, Xosé M. Fernández-Suárez, and Daniel J. Rigden. 2017. The 24th annual Nucleic Acids Research database issue: A look back and upcoming changes. Nucleic Acids Research 45, D1--D11.

[34]

Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: Theory, practice and open challenges. Proceedings of the VLDB Endowment 5, 12, 2018--2019.

Digital Library

[35]

Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data Mining: Concepts and Techniques. Elsevier.

Digital Library

[36]

Julia Handl, Joshua Knowles, and Douglas B. Kell. 2005. Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 15, 3201--3212.

Digital Library

[37]

Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y. Goulermas. 2017. Stochastic multiview hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia 19, 1, 1--14.

Digital Library

[38]

K. E. Holt, H. Wertheim, R. N. Zadoks, S. Baker, C. A. Whitehouse, D. Dance, A. Jenney, et al. 2015. Genomic analysis of diversity, population structure, virulence, and antimicrobial resistance in Klebsiella pneumoniae, an urgent threat to public health. Proceedings of the National Academy of Sciences of the United States of America 112, 27, E3574--E3581.

[39]

Jing Hu and Xianghe Yan. 2012. BS-KNN: An effective algorithm for predicting protein subchloroplast localization. Evolutionary Bioinformatics Online 8, 79.

[40]

Ying Huang, Beifang Niu, Ying Gao, Limin Fu, and Weizhong Li. 2010. CD-HIT Suite: A Web server for clustering and comparing biological sequences. Bioinformatics 26, 5, 680--682.

Digital Library

[41]

Nicholas Jalbert and Westley Weimer. 2008. Automated duplicate detection for bug tracking systems. In Proceedings of the IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN’08). IEEE, Los Alamitos, CA, 52--61.

[42]

Vimukthi Jayawardene, Shazia Sadiq, and Marta Indulska. 2013. The curse of dimensionality in data quality. In Proceedings of the 24th Australasian Conference on Information Systems (ACIS’13). 1--11.

[43]

Yanping Ji, Zhen Zhang, and Yinghe Hu. 2009. The repertoire of G-protein-coupled receptors in Xenopus tropicalis. BMC Genomics 10, 263.

[44]

Juhyun Jung, Taewoo Ryu, Yongdeuk Hwang, Eunjung Lee, and Doheon Lee. 2010. Prediction of extracellular matrix proteins based on distinctive sequence and domain characteristics. Journal of Computational Biology 17, 1, 97--105.

[45]

Sallie Keller, Gizem Korkmaz, Mark Orr, Aaron Schroeder, and Stephanie Shipp. 2016. The evolution of data quality: Understanding the transdisciplinary origins of data quality concepts and approaches. Annual Review of Statistics and Its Application 4, 85--108.

[46]

Evguenia Kopylova, Jose A. Navas-Molina, Céline Mercier, Zhenjiang Zech Xu, Frédéric Mahé, Yan He, Hong-Wei Zhou, Torbjørn Rognes, J. Gregory Caporaso, and Rob Knight. 2016. Open-source sequence clustering methods improve the state of the art. mSystems 1, 1, e00003--15.

[47]

Peter G. Korning, Stefan M. Hebsgaard, Pierre Rouzé, and Søren Brunak. 1996. Cleaning the GenBank Arabidopsis thaliana data set. Nucleic Acids Research 24, 2, 316--320.

[48]

Manish Kumar, Varun Thakur, and Gajendra P. S. Raghava. 2008. COPid: Composition based protein identification. In Silico Biology 8, 2, 121--128.

[49]

Ivica Letunic, Tobias Doerks, and Peer Bork. 2009. SMART 6: Recent updates and new developments. Nucleic Acids Research 37, Suppl. 1, D229--D232.

[50]

Weizhong Li and Adam Godzik. 2006. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 13, 1658--1659.

Digital Library

[51]

Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. 2001. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 3, 282--283.

[52]

Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. 2002. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18, 1 77--82.

[53]

Jiajun Liu, Zi Huang, Hongyun Cai, Heng Tao Shen, Chong Wah Ngo, and Wei Wang. 2013. Near-duplicate video retrieval: Current research and future trends. ACM Computing Surveys 45, 4, 44.

Digital Library

[54]

Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. 2010. Understanding of internal clustering validation measures. In Proceedings of the 2010 IEEE 10th International Conference on Data Mining (ICDM’10). IEEE, Los Alamitos, CA, 911--916.

Digital Library

[55]

Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for Web crawling. In Proceedings of the 16th International Conference on World Wide Web. ACM, New York, NY, 141--150.

Digital Library

[56]

Bruno Martins. 2011. A supervised machine learning approach for duplicate detection over gazetteer records. In Proceedings of the International Conference on GeoSpatial Sematics. 34--51.

Digital Library

[57]

Gaston K. Mazandu, Emile R. Chimusa, Mamana Mbiyavanga, and Nicola J. Mulder. 2016. A-DaGO-Fun: An adaptable gene ontology semantic similarity-based functional analysis tool. Bioinformatics 32, 3, 477--479.

[58]

Gaston K. Mazandu, Emile R. Chimusa, and Nicola J. Mulder. 2016. Gene ontology semantic similarity tools: Survey on features and challenges for biological knowledge discovery. Briefings in Bioinformatics 2016, bbw067.

[59]

Gaston K. Mazandu and Nicola J. Mulder. 2013. Information content-based gene ontology semantic similarity approaches: Toward a unified framework theory. BioMed Research International 2013, Article Nol. 292063.

[60]

Gaston K. Mazandu and Nicola J. Mulder. 2014. Information content-based gene ontology functional similarity measures: Which one to use for a given biological data type? PloS One 9, 12, e113859.

[61]

Andrew V. McDonnell, Taijiao Jiang, Amy E. Keating, and Bonnie Berger. 2006. Paircoil2: Improved prediction of coiled coils from sequence. Bioinformatics 22, 3, 356--358.

Digital Library

[62]

Milot Mirdita, Lars von den Driesch, Clovis Galiez, Maria J. Martin, Johannes Söding, and Martin Steinegger. 2016. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research 2016, gkw1081.

[63]

Meeta Mistry and Paul Pavlidis. 2008. Gene ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics 9, 1, 327.

[64]

Heiko Müller, Felix Naumann, and Johann-Christoph Freytag. 2003. Data quality in genome databases. In Proceedings of the International Conference on Information Quality. 269--284.

[65]

Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3, 443--453.

[66]

Beifang Niu, Limin Fu, Shulei Sun, and Weizhong Li. 2010. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 11, 1, 1.

[67]

Catia Pesquita, Daniel Faria, Hugo Bastos, António E. N. Ferreira, André O. Falcão, and Francisco M. Couto. 2008. Metrics for GO based protein semantic similarity: A systematic evaluation. BMC Bioinformatics 9, 5, S4.

[68]

Catia Pesquita, Daniel Faria, Andre O. Falcao, Phillip Lord, and Francisco M. Couto. 2009. Semantic similarity in biomedical ontologies. PLoS Computational Biology 5, 7, e1000443.

[69]

Dariusz Plewczynski, Lukasz Slabinski, Adrian Tkacz, Laszlo Kajan, Liisa Holm, Krzysztof Ginalski, and Leszek Rychlewski. 2007. The RPSP: Web server for prediction of signal peptides. Polymer 48, 19, 5493--5496.

[70]

Sylvain Poux, Michele Magrane, Cecilia N. Arighi, Alan Bridge, Claire O’Donovan, Kati Laiho; UniProt Consortium. 2014. Expert curation in UniProtKB: A case study on dealing with conflicting and erroneous data. Database 2014, bau016.

[71]

Theodoros Rekatsinas, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2015. Finding quality in quantity: The challenge of discovering valuable sources for integration. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR’15).

[72]

Richard L. Marchese Robinson, Iseult Lynch, Willie Peijnenburg, John Rumble, Fred Klaessig, Clarissa Marquardt, Hubert Rauscher, et al. 2016. How should the completeness and quality of curated nanomaterial data be evaluated? Nanoscale 8, 19, 9919--9943.

[73]

Marta Rosikiewicz, Aurélie Comte, Anne Niknejad, Marc Robinson-Rechavi, and Frederic B. Bastian. 2013. Uncovering hidden duplicated content in public transcriptomics data. Database 2013, bat010.

[74]

Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In Proceedings of the 2014 IEEE 30th International Conference on Data Engineering (ICDE’14). IEEE, Los Alamitos, CA, 1294--1297.

[75]

M. K. Sakharkar, V. T. Chow, K. Ghosh, I. Chaturvedi, P. C. Lee, Sundara Perumal Bagavathi, Paul Shapshak, Subramanian Subbiah, and Pandjassarame Kangueane. 2005. Computational prediction of SEG (single exon gene) function in humans. Frontiers in Bioscience 10, 1382--1395.

[76]

Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2015. Search result diversification. Foundations and Trends in Information Retrieval 9, 1, 1--90.

Digital Library

[77]

Ina Maria Schedina, Stefanie Hartmann, Detlef Groth, Ingo Schlupp, and Ralph Tiedemann. 2014. Comparative analysis of the gonadal transcriptomes of the all-female species Poecilia formosa and its maternal ancestor Poecilia mexicana. BMC Research Notes 7, 1, 1.

[78]

Patrick D. Schloss and Sarah L. Westcott. 2011. Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Applied and Environmental Microbiology 77, 10, 3219--3226.

[79]

Robert Schmieder and Robert Edwards. 2011. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PloS One 6, 3, e17288.

[80]

Alexandra M. Schnoes, Shoshana D. Brown, Igor Dodevski, and Patricia C. Babbitt. 2009. Annotation error in public databases: Misannotation of molecular function in enzyme superfamilies. PLoS Computational Biology 5, 12, e1000605.

[81]

Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 8, 888--905.

Digital Library

[82]

Megan Sickmeier, Justin A. Hamilton, Tanguy LeGall, Vladimir Vacic, Marc S. Cortese, Agnes Tantos, Beata Szabo, et al. 2007. DisProt: The database of disordered proteins. Nucleic Acids Research 35, Suppl. 1, D786--D793.

[83]

Kresimir Sikic and Oliviero Carugo. 2010. Protein sequence redundancy reduction: Comparison of various method. Bioinformation 5, 6, 234--239.

[84]

Baris Suzek, Yuqi Wang, Hongzhan Huang, Peter McGarvey, Cathy Wu; UniProt Consortium. 2014. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2014, btu739.

Digital Library

[85]

Michael L. Tress, Domenico Cozzetto, Anna Tramontano, and Alfonso Valencia. 2006. An analysis of the Sargasso sea resource and the consequences for database composition. BMC Bioinformatics 7, 1, 1.

[86]

Chun-Wei Tung. 2012. PupDB: A database of pupylated proteins. BMC Bioinformatics 13, 1, 1.

[87]

Chun-Wei Tung and Shinn-Ying Ho. 2008. Computational identification of ubiquitylation sites from protein sequences. BMC Bioinformatics 9, 1, 1.

[88]

UniProt Consortium. 2014. UniProt: A hub for protein information. Nucleic Acids Research 2014, gku989.

[89]

UniProt Consortium. 2014. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Research 42, D191--D198.

[90]

Peter Willett. 1988. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 24, 5, 577--597.

Digital Library

[91]

Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems 36, 3, 15.

Digital Library

[92]

Xuan Xiao, Pu Wang, and Kuo-Chen Chou. 2009. GPCR-CA: A cellular automaton image approach for predicting G-protein--coupled receptor functional classes. Journal of Computational Chemistry 30, 9, 1414--1423.

[93]

Mohammed J. Zaki, Wagner Meira Jr., and Wagner Meira. 2014. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press.

Digital Library

[94]

Y. Zhang, T. Li, C. Y. Yang, D. Li, Y. Cui, Y. Jiang, L. Q. Zhang, Y. P. Zhu, and F. C. He. 2011. Prelocabc: A novel predictor of protein sub-cellular localization using a Bayesian classifier. Journal of Proteomics and Bioinformatics 4, 1, 044--052.

[95]

Eduard Valera Zorita, Pol Cuscó, and Guillaume Filion. 2015. Starcode: Sequence clustering based on all-pairs search. Bioinformatics 2015, btv053.

Cited By

Zárate ADíaz-González LTaboada B(2025)VirDetect-AI: a residual and convolutional neural network–based metagenomic tool for eukaryotic viral protein identificationBriefings in Bioinformatics10.1093/bib/bbaf00126:1Online publication date: 14-Jan-2025
https://doi.org/10.1093/bib/bbaf001
Vasconcelos DNunes NFörster AGomes J(2024)Optimal 2D audio features estimation for a lightweight application in mosquitoes speciesComputers in Biology and Medicine10.1016/j.compbiomed.2023.107787168:COnline publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1016/j.compbiomed.2023.107787
Wei ZChen XZhang XZhang HFan XGao HLiu FQian Y(2023)Comparison of methods for biological sequence clusteringIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2023.3253138(1-24)Online publication date: 2023
https://doi.org/10.1109/TCBB.2023.3253138
Show More Cited By

Index Terms

Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases

Recommendations

Online Deduplication for Databases
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

dbDedup is a similarity-based deduplication scheme for on-line database management systems (DBMSs). Beyond block-level compression of individual database pages or operation log (oplog) messages, as used in today's DBMSs, dbDedup uses byte-level delta ...
Density-Based Clustering of Functionally Similar Genes Using Biological Knowledge
Pattern Recognition and Machine Intelligence
Abstract
Clustering is used to identify natural groups present in the data. It has been applied widely for analyzing gene expression data to discover gene clusters that might be involved in same biological processes. This information is very important for ...
Fast hierarchical clustering and its validation

Clustering is the task of grouping similar objects into clusters. A prominent and useful class of algorithm is hierarchical agglomerative clustering (HAC) which iteratively agglomerates the closest pair until all data points belong to one cluster. It ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 9, Issue 3

Special Issue on Improving the Veracity and Value of Big Data

September 2017

140 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3183573

Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2018

Accepted: 01 July 2017

Revised: 01 July 2017

Received: 01 March 2017

Published in JDIQ Volume 9, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Melbourne International Research Scholarship from the University of Melbourne
Australian Research Council through a Discovery Project

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
453
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)8

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zárate ADíaz-González LTaboada B(2025)VirDetect-AI: a residual and convolutional neural network–based metagenomic tool for eukaryotic viral protein identificationBriefings in Bioinformatics10.1093/bib/bbaf00126:1Online publication date: 14-Jan-2025
https://doi.org/10.1093/bib/bbaf001
Vasconcelos DNunes NFörster AGomes J(2024)Optimal 2D audio features estimation for a lightweight application in mosquitoes speciesComputers in Biology and Medicine10.1016/j.compbiomed.2023.107787168:COnline publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1016/j.compbiomed.2023.107787
Wei ZChen XZhang XZhang HFan XGao HLiu FQian Y(2023)Comparison of methods for biological sequence clusteringIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2023.3253138(1-24)Online publication date: 2023
https://doi.org/10.1109/TCBB.2023.3253138
TİMUÇİN TDUZDAR ARGUN İ(2021)Initial Seed Value Effectiveness on Performances of Data Mining AlgorithmsVeri Madenciliği Algoritmalarının Performanslarında İlk Tohum Değer EtkinliğiDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.8131019:2(555-567)Online publication date: 25-Apr-2021
https://doi.org/10.29130/dubited.813101
Pipes LNielsen R(2021)AncestralClust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic treesBioinformatics10.1093/bioinformatics/btab723Online publication date: 20-Oct-2021
https://doi.org/10.1093/bioinformatics/btab723
Bernasconi A(2021)Data quality-aware genomic data integrationComputer Methods and Programs in Biomedicine Update10.1016/j.cmpbup.2021.100009(100009)Online publication date: Apr-2021
https://doi.org/10.1016/j.cmpbup.2021.100009
Sinha GThwel TMohdiwale SShrivastava D(2021)Introduction to data deduplication approachesData Deduplication Approaches10.1016/B978-0-12-823395-5.00019-7(1-15)Online publication date: 2021
https://doi.org/10.1016/B978-0-12-823395-5.00019-7
Sinha GBajaj V(2021)Data deduplication applications in cognitive science and computer vision researchData Deduplication Approaches10.1016/B978-0-12-823395-5.00001-X(357-368)Online publication date: 2021
https://doi.org/10.1016/B978-0-12-823395-5.00001-X
Zhang YZhou YGuo XWu JHe QLiu XYang Y(2018)Self-Adaptive K-Means Based on a Covering AlgorithmComplexity10.1155/2018/76982742018Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.1155/2018/7698274

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents