skip to main content
10.1145/2851613.2851887acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Eventually consistent cardinality estimation with applications in biodata mining

Published: 04 April 2016 Publication History

Abstract

Large set cardinality estimators and other streaming oriented operations are the cornerstone of big data processing. Cardinality estimators combined with in-memory based storage systems provide a fast framework for keeping valuable application data easily queryable and maintanable. This has a plethora of applications. For instance, a common use case is to maintain a number of counters for monitoring application statistics for real time dashboard purposes. Another such case is large set size estimation for big data systems in internal operations like counting. In this paper is addressed the issue of scaling the computation of a cardinality estimator in the presence of node failures in a distributed setting. Moreover, for the proposed estimation technique eventual consistency is proved, which is adequate for most cases in distributed applications. To the best of the authors knowledge, this functionality is not currently provided by commonly used commercial and open source systems. Additionally, the proposed approach is generic enough to be applied to other algorithms, which can help build a basic framework for more complex operations in the big data field. We demonstrate this with graph metric calculation applications in the large scale biodata mining field.

References

[1]
G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 4(1-8211;3):1--294, Jan. 2012.
[2]
S. N. Evans, Frederick, and A. Matsen. The phylogenetic kantorovich-rubinstein metric for environmental sequence samples. arxiv preprint arxiv:1005.1699, 2010.
[3]
P. Flajolet, E. Fusy, O. Gandouet, and et al. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In AOFA 2007: proceedings of the 2007 international conference on analysis of algorithms, 2007.
[4]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, Sept. 1985.
[5]
M. R. Henzinger, P. Raghavan, and S. Rajagopalan. External memory algorithms. chapter Computing on Data Streams, pages 107--118. American Mathematical Society, Boston, MA, USA, 1999.
[6]
S. Heule, M. Nunkesser, and A. Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT '13, pages 683--692, New York, NY, USA, 2013. ACM.
[7]
R. D. Leclerc. Survival of the sparsest: robust gene networks are parsimonious. Molecular Systems Biology, 4(1), 2008.
[8]
J. Leitao, J. Pereira, and L. Rodrigues. Epidemic broadcast trees. In Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems, SRDS '07, pages 301--310, Washington, DC, USA, 2007. IEEE Computer Society.
[9]
K. R. Lozupone C. Unifrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology, 71(12), Dec. 2005.
[10]
R. Makhloufi, G. Bonnet, G. Doyen, and D. Gaïti. Decentralized aggregation protocols in peer-to-peer networks: A survey. In Proceedings of the 4th IEEE International Workshop on Modelling Autonomic Communications Environments, MACE '09, pages 111--116, Berlin, Heidelberg, 2009. Springer-Verlag.
[11]
S. Marsland. Machine Learning: An Algorithmic Perspective, Second Edition. Chapman & Hall/CRC, 2nd edition, 2014.
[12]
A. McGregor. Graph stream algorithms: A survey. SIGMOD Rec., 43(1):9--20, May 2014.
[13]
J. I. Munro and M. S. Paterson. Selection and sorting with limited storage. Technical report, Coventry, UK, UK, 1978.
[14]
N. Ntarmos. Counting at large: Efficient cardinality estimation in internet-scale data networks. In In Proc. IEEE ICDE, 2006.
[15]
D. H. Parks and R. G. Beiko. Measuring community similarity with phylogenetic networks. MBE, 29(12):3947--3958, 2012.
[16]
G. Pavlopoulos, M. Secrier, C. Moschopoulos, T. Soldatos, S. Kossida, J. Aerts, R. Schneider, and P. Bagos. Using graph theory to analyze biological networks. BioData Mining, 4(1), 2011.
[17]
F. Salfner, M. Lenk, and M. Malek. A survey of online failure prediction methods. ACM Comput. Surv., 42(3):10:1--10:42, Mar. 2010.
[18]
Z. Xue, X. Dong, S. Ma, and W. Dong. A survey on failure prediction of large-scale server clusters. In Proceedings of the Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing - Volume 02, SNPD '07, pages 733--738, Washington, DC, USA, 2007. IEEE Computer Society.
[19]
J. Zhang. Massive Data Streams in Graph Theory and Computational Geometry. PhD thesis, Yale University, December 2005.

Cited By

View all
  • (2024)Power Iteration Graph Clustering With functools Higher Order Methods2024 19th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP)10.1109/SMAP63474.2024.00042(182-189)Online publication date: 21-Nov-2024
  • (2024)Clustering MBTI Personalities With Graph Filters And Self Organizing Maps Over Pinecone2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825637(5674-5681)Online publication date: 15-Dec-2024
  • (2023)Predicting ALzheimer's Disease with AI and Brain Imaging DataArtificial Intelligence Applications and Innovations10.1007/978-3-031-34111-3_25(291-301)Online publication date: 1-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing
April 2016
2360 pages
ISBN:9781450337397
DOI:10.1145/2851613
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. BioData mining
  2. CAP theorem
  3. cardinality estimation
  4. large graph metrics
  5. on-line analytics

Qualifiers

  • Research-article

Conference

SAC 2016
Sponsor:
SAC 2016: Symposium on Applied Computing
April 4 - 8, 2016
Pisa, Italy

Acceptance Rates

SAC '16 Paper Acceptance Rate 252 of 1,047 submissions, 24%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Power Iteration Graph Clustering With functools Higher Order Methods2024 19th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP)10.1109/SMAP63474.2024.00042(182-189)Online publication date: 21-Nov-2024
  • (2024)Clustering MBTI Personalities With Graph Filters And Self Organizing Maps Over Pinecone2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825637(5674-5681)Online publication date: 15-Dec-2024
  • (2023)Predicting ALzheimer's Disease with AI and Brain Imaging DataArtificial Intelligence Applications and Innovations10.1007/978-3-031-34111-3_25(291-301)Online publication date: 1-Jun-2023
  • (2021)Approximate High Dimensional Graph Mining With Matrix Polar Factorization: A Twitter Application2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671926(4441-4449)Online publication date: 15-Dec-2021
  • (2020)Building Trusted Startup Teams From LinkedIn Attributes: A Higher Order Probabilistic Analysis2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI50040.2020.00136(867-874)Online publication date: Nov-2020
  • (2020)Evaluating graph resilience with tensor stack networks: a Keras implementationNeural Computing and Applications10.1007/s00521-020-04790-132:9(4161-4176)Online publication date: 1-May-2020
  • (2019)A semantically annotated JSON metadata structure for open linked cultural data in Neo4jProceedings of the 23rd Pan-Hellenic Conference on Informatics10.1145/3368640.3368659(81-88)Online publication date: 28-Nov-2019
  • (2019)Tensor Clustering: A Review2019 14th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)10.1109/SMAP.2019.8864898(1-6)Online publication date: Jun-2019
  • (2018)A Graph Resilience Metric Based On Paths: Higher Order Analytics With GPU2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2018.00138(884-891)Online publication date: Nov-2018
  • (2017)Fuzzy Random Walkers with Second Order Bounds: An Asymmetric AnalysisAlgorithms10.3390/a1002004010:2(40)Online publication date: 30-Mar-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media