research-article

Eventually consistent cardinality estimation with applications in biodata mining

Authors:

Georgios Drakopoulos,

Stavros Kontopoulos,

Christos MakrisAuthors Info & Claims

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

Pages 941 - 944

https://doi.org/10.1145/2851613.2851887

Published: 04 April 2016 Publication History

Abstract

Large set cardinality estimators and other streaming oriented operations are the cornerstone of big data processing. Cardinality estimators combined with in-memory based storage systems provide a fast framework for keeping valuable application data easily queryable and maintanable. This has a plethora of applications. For instance, a common use case is to maintain a number of counters for monitoring application statistics for real time dashboard purposes. Another such case is large set size estimation for big data systems in internal operations like counting. In this paper is addressed the issue of scaling the computation of a cardinality estimator in the presence of node failures in a distributed setting. Moreover, for the proposed estimation technique eventual consistency is proved, which is adequate for most cases in distributed applications. To the best of the authors knowledge, this functionality is not currently provided by commonly used commercial and open source systems. Additionally, the proposed approach is generic enough to be applied to other algorithms, which can help build a basic framework for more complex operations in the big data field. We demonstrate this with graph metric calculation applications in the large scale biodata mining field.

References

[1]

G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 4(1-8211;3):1--294, Jan. 2012.

Digital Library

[2]

S. N. Evans, Frederick, and A. Matsen. The phylogenetic kantorovich-rubinstein metric for environmental sequence samples. arxiv preprint arxiv:1005.1699, 2010.

[3]

P. Flajolet, E. Fusy, O. Gandouet, and et al. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In AOFA 2007: proceedings of the 2007 international conference on analysis of algorithms, 2007.

[4]

P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, Sept. 1985.

Digital Library

[5]

M. R. Henzinger, P. Raghavan, and S. Rajagopalan. External memory algorithms. chapter Computing on Data Streams, pages 107--118. American Mathematical Society, Boston, MA, USA, 1999.

Digital Library

[6]

S. Heule, M. Nunkesser, and A. Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT '13, pages 683--692, New York, NY, USA, 2013. ACM.

Digital Library

[7]

R. D. Leclerc. Survival of the sparsest: robust gene networks are parsimonious. Molecular Systems Biology, 4(1), 2008.

[8]

J. Leitao, J. Pereira, and L. Rodrigues. Epidemic broadcast trees. In Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems, SRDS '07, pages 301--310, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[9]

K. R. Lozupone C. Unifrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology, 71(12), Dec. 2005.

[10]

R. Makhloufi, G. Bonnet, G. Doyen, and D. Gaïti. Decentralized aggregation protocols in peer-to-peer networks: A survey. In Proceedings of the 4th IEEE International Workshop on Modelling Autonomic Communications Environments, MACE '09, pages 111--116, Berlin, Heidelberg, 2009. Springer-Verlag.

Digital Library

[11]

S. Marsland. Machine Learning: An Algorithmic Perspective, Second Edition. Chapman & Hall/CRC, 2nd edition, 2014.

Digital Library

[12]

A. McGregor. Graph stream algorithms: A survey. SIGMOD Rec., 43(1):9--20, May 2014.

Digital Library

[13]

J. I. Munro and M. S. Paterson. Selection and sorting with limited storage. Technical report, Coventry, UK, UK, 1978.

Digital Library

[14]

N. Ntarmos. Counting at large: Efficient cardinality estimation in internet-scale data networks. In In Proc. IEEE ICDE, 2006.

Digital Library

[15]

D. H. Parks and R. G. Beiko. Measuring community similarity with phylogenetic networks. MBE, 29(12):3947--3958, 2012.

[16]

G. Pavlopoulos, M. Secrier, C. Moschopoulos, T. Soldatos, S. Kossida, J. Aerts, R. Schneider, and P. Bagos. Using graph theory to analyze biological networks. BioData Mining, 4(1), 2011.

[17]

F. Salfner, M. Lenk, and M. Malek. A survey of online failure prediction methods. ACM Comput. Surv., 42(3):10:1--10:42, Mar. 2010.

Digital Library

[18]

Z. Xue, X. Dong, S. Ma, and W. Dong. A survey on failure prediction of large-scale server clusters. In Proceedings of the Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing - Volume 02, SNPD '07, pages 733--738, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[19]

J. Zhang. Massive Data Streams in Graph Theory and Computational Geometry. PhD thesis, Yale University, December 2005.

Digital Library

Cited By

Drakopoulos GBardis GMylonas P(2024)Power Iteration Graph Clustering With functools Higher Order Methods2024 19th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP)10.1109/SMAP63474.2024.00042(182-189)Online publication date: 21-Nov-2024
https://doi.org/10.1109/SMAP63474.2024.00042
Drakopoulos GMylonas P(2024)Clustering MBTI Personalities With Graph Filters And Self Organizing Maps Over Pinecone2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825637(5674-5681)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825637
Peng CLin GLin JChen GLiu W(2023)Predicting ALzheimer's Disease with AI and Brain Imaging DataArtificial Intelligence Applications and Innovations10.1007/978-3-031-34111-3_25(291-301)Online publication date: 1-Jun-2023
https://doi.org/10.1007/978-3-031-34111-3_25
Show More Cited By

Index Terms

Eventually consistent cardinality estimation with applications in biodata mining
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
2. Theory of computation
  1. Design and analysis of algorithms
    1. Distributed algorithms
    2. Streaming, sublinear and near linear time algorithms
      1. Sketching and sampling

Recommendations

Cardinality estimation using normalizing flow
Abstract
Cardinality estimation is one of the most important problems in query optimization. Recently, machine learning-based techniques have been proposed to effectively estimate cardinality, which can be broadly classified into query-driven and data-...
Flexible Integration of Eventually Consistent Distributed Storage with Strongly Consistent Databases
NCCA '12: Proceedings of the 2012 Second Symposium on Network Cloud Computing and Applications

In order to design distributed business applications or services, the common practice consists in setting up a multi-tier architecture on top of a relational database. Due to the recent evolution of the needs in terms of scalability and availability in ...
Efficient and Effective Cardinality Estimation for Skyline Family
PACMMOD

Cardinality estimation, predicting the query result size, is a fundamental problem in databases. Existing skyline cardinality estimation methods are computationally infeasible for massive skyline queries over the large-scale database. In this paper, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

April 2016

2360 pages

ISBN:9781450337397

DOI:10.1145/2851613

Conference Chair:
Sascha Ossowski
University Rey Juan Carlos, Spain

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SAC 2016

Sponsor:

SIGAPP

SAC 2016: Symposium on Applied Computing

April 4 - 8, 2016

Pisa, Italy

Acceptance Rates

SAC '16 Paper Acceptance Rate 252 of 1,047 submissions, 24%;

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
88
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Drakopoulos GBardis GMylonas P(2024)Power Iteration Graph Clustering With functools Higher Order Methods2024 19th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP)10.1109/SMAP63474.2024.00042(182-189)Online publication date: 21-Nov-2024
https://doi.org/10.1109/SMAP63474.2024.00042
Drakopoulos GMylonas P(2024)Clustering MBTI Personalities With Graph Filters And Self Organizing Maps Over Pinecone2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825637(5674-5681)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825637
Peng CLin GLin JChen GLiu W(2023)Predicting ALzheimer's Disease with AI and Brain Imaging DataArtificial Intelligence Applications and Innovations10.1007/978-3-031-34111-3_25(291-301)Online publication date: 1-Jun-2023
https://doi.org/10.1007/978-3-031-34111-3_25
Drakopoulos GKafeza EMylonas PSioutas S(2021)Approximate High Dimensional Graph Mining With Matrix Polar Factorization: A Twitter Application2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671926(4441-4449)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9671926
Drakopoulos GKafeza EMylonas Pal Katheeri H(2020)Building Trusted Startup Teams From LinkedIn Attributes: A Higher Order Probabilistic Analysis2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI50040.2020.00136(867-874)Online publication date: Nov-2020
https://doi.org/10.1109/ICTAI50040.2020.00136
Drakopoulos GMylonas P(2020)Evaluating graph resilience with tensor stack networks: a Keras implementationNeural Computing and Applications10.1007/s00521-020-04790-132:9(4161-4176)Online publication date: 1-May-2020
https://dl.acm.org/doi/10.1007/s00521-020-04790-1
Drakopoulos GSpyrou EVoutos YMylonas PManolopoulos YPapadopoulos GStassopoulou ADionysiou IKyriakides ITsapatsoulis N(2019)A semantically annotated JSON metadata structure for open linked cultural data in Neo4jProceedings of the 23rd Pan-Hellenic Conference on Informatics10.1145/3368640.3368659(81-88)Online publication date: 28-Nov-2019
https://dl.acm.org/doi/10.1145/3368640.3368659
Drakopoulos GSpyrou EMylonas P(2019)Tensor Clustering: A Review2019 14th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)10.1109/SMAP.2019.8864898(1-6)Online publication date: Jun-2019
https://doi.org/10.1109/SMAP.2019.8864898
Drakopoulos GLiapakis XTzimas GMylonas P(2018)A Graph Resilience Metric Based On Paths: Higher Order Analytics With GPU2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2018.00138(884-891)Online publication date: Nov-2018
https://doi.org/10.1109/ICTAI.2018.00138
Drakopoulos GKanavos ATsakalidis K(2017)Fuzzy Random Walkers with Second Order Bounds: An Asymmetric AnalysisAlgorithms10.3390/a1002004010:2(40)Online publication date: 30-Mar-2017
https://doi.org/10.3390/a10020040
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten