CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing

Zhang, An-Zhen; Li, Jian-Zhong; Gao, Hong; Chen, Yu-Biao; Ma, Heng-Zhao; Bah, Mohamed Jaward

doi:10.1007/s11390-018-1824-5

CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing

Published: 23 March 2018

Volume 33, pages 366–379, (2018)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

An-Zhen Zhang¹,
Jian-Zhong Li¹,
Hong Gao¹,
Yu-Biao Chen¹,
Heng-Zhao Ma¹ &
…
Mohamed Jaward Bah¹

81 Accesses
5 Citations
Explore all metrics

Abstract

Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, CrowdOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Online Aggregation: A Review

An Efficient Block Sampling Strategy for Online Aggregation in the Cloud

An Effective and Cost-Based Framework for a Qualitative Hybrid Data Deduplication

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Hellerstein J M, Haas P J, Wang H J. Online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 1997, pp.171-182.
Doulkeridis C, Nørvåg K. A survey of large-scale analytical query processing in MapReduce. VLDB J., 2014, 23(3): 355-380.
Article Google Scholar
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 2007, 19(1): 1-16.
Article Google Scholar
Charikar M, Chaudhuri S, Motwani R, Narasayya V R. Towards estimation error guarantees for distinct values. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 2000, pp.268-279.
Wang J, Krishnan S, Franklin M J, Goldberg K, Kraska T, Milo T. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.469-480.
Haas P J. Large-sample and deterministic confidence intervals for online aggregation. In Proc. the 9th Int. Conf. Scientific and Statistical Database Management, August 1997, pp.51-63.
Haas P J, Hellerstein J M. Ripple joins for online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 1999, pp.287-298.
Jermaine C, Dobra A, Arumugam S, Joshi S, Pol A. A disk-based join with probabilistic guarantees. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2005, pp.563-574.
Luo G, Ellmann C J, Haas P J, Naughton J F. A scalable hash ripple join algorithm. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2002, pp.252-262.
Condie T, Conway N, Alvaro P, Hellerstein J M, Gerth J, Talbot J, Elmeleegy K, Sears R. Online aggregation and continuous query support in MapReduce. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2010, pp.1115-1118.
Shi Y, Meng X, Wang F, Gan Y. You can stop early with COLA: Online processing of aggregate queries in the cloud. In Proc. the 21st Int. Conf. Information and Knowledge Management, October 2012, pp.1223-1232.
Pansare N, Borkar V R, Jermaine C, Condie T. Online aggregation for large MapReduce jobs. PVLDB, 2011, 4(11): 1135-1145.
Google Scholar
Zeng K, Agarwal S, Stoica I. iOLAP: Managing uncertainty for efficient incremental OLAP. In Proc. ACM SIGMOD Int. Conf. Management of Data, July 2016, pp.1347-1361.
Köpcke H, Rahm E. Frameworks for entity matching: A comparison. Data Knowl. Eng., 2010, 69(2): 197-210.
Article Google Scholar
Hernández M A, Stolfo S J. The merge/purge problem for large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 1995, pp.127-138.
McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. ACM SIGMOD Int. Conf. Management of Data, August 2000, pp.169-178.
Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In Proc. the 28th Int. Conf. Very Large Data Bases, August 2002, pp.586-597.
Bhattacharya I, Getoor L. Collective entity resolution in relational data. TKDD, 2007, 1(1): 5.
Article Google Scholar
Altowim Y, Kalashnikov D V, Mehrotra S. Progressive approach to relational entity resolution. PVLDB, 2014, 7(11): 999-1010.
Google Scholar
Whang S E, Marmaros D, Garcia-Molina H. Pay-as-yougo entity resolution. IEEE Trans. Knowl. Data Eng., 2013, 25(5): 1111-1124.
Article Google Scholar
Gruenheid A, Dong X L, Srivastava D. Incremental record linkage. PVLDB, 2014, 7(9): 697-708.
Google Scholar
Whang S E, Garcia-Molina H. Incremental entity resolution on rules and data. VLDB J., 2014, 23(1): 77-102.
Article Google Scholar
Li G, Wang J, Zheng Y, Franklin M J. Crowdsourced data management: A survey. In Proc. the 33rd IEEE Int. Conf. Data Engineering, April 2017, pp.39-40.
Zheng Y, Cheng R, Maniu S, Mo L. On optimality of jury selection in crowdsourcing. In Proc. the 18th Int. Conf. Extending Database Technology, March 2015, pp.193-204.
Zheng Y, Li G, Li Y, Shan C, Cheng R. Truth inference in crowdsourcing: Is the problem solved? PVLDB, 2017, 10(5): 541-552.
Google Scholar
Zheng Y, Li G, Cheng R. DOCS: Domain-aware crowdsourcing system. PVLDB, 2016, 10(4): 361-372.
Google Scholar
Zheng Y, Wang J, Li G, Cheng R, Feng J. QASCA: A quality-aware task assignment system for crowdsourcing applications. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1031-1046.
Xiong H, Zhang D, Chen G, Wang L, Gauthier V, Barnes L E. iCrowd: Near-optimal task allocation for piggyback crowdsensing. IEEE Trans. Mob. Comput., 2016, 15(8): 2010-2022.
Article Google Scholar
Hu H, Zheng Y, Bao Z, Li G, Feng J, Cheng R. Crowdsourced POI labelling: Location-aware result inference and task assignment. In Proc. the 32nd IEEE Int. Conf. Data Engineering, May 2016, pp.61-72.

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
An-Zhen Zhang, Jian-Zhong Li, Hong Gao, Yu-Biao Chen, Heng-Zhao Ma & Mohamed Jaward Bah

Authors

An-Zhen Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Jian-Zhong Li
View author publications
You can also search for this author inPubMed Google Scholar
Hong Gao
View author publications
You can also search for this author inPubMed Google Scholar
Yu-Biao Chen
View author publications
You can also search for this author inPubMed Google Scholar
Heng-Zhao Ma
View author publications
You can also search for this author inPubMed Google Scholar
Mohamed Jaward Bah
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to An-Zhen Zhang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 375 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, AZ., Li, JZ., Gao, H. et al. CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing. J. Comput. Sci. Technol. 33, 366–379 (2018). https://doi.org/10.1007/s11390-018-1824-5

Download citation

Received: 26 February 2017
Revised: 29 January 2018
Published: 23 March 2018
Issue Date: March 2018
DOI: https://doi.org/10.1007/s11390-018-1824-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Online Aggregation: A Review

An Efficient Block Sampling Strategy for Online Aggregation in the Cloud

An Effective and Cost-Based Framework for a Qualitative Hybrid Data Deduplication

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now