Cross-Checking Data Sources in MapReduce

Afrati, Foto; Momani, Zaid; Stasinopoulos, Nikos

doi:10.1007/978-3-319-23201-0_19

Foto Afrati⁵,
Zaid Momani⁵ &
Nikos Stasinopoulos⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 539))

Included in the following conference series:

East European Conference on Advances in Databases and Information Systems

1249 Accesses
1 Citations
1 Altmetric

Abstract

Fact checking from multiple sources is investigated from different and diverse angles and the complexity and diversity of the problem calls for a wide range of methods and techniques [1]. Fact checking tasks are not easy to perform and, most importantly, it is not clear what kind of computations they involve. Fact checking usually involves a large number of data sources that talk about the same thing but we are not sure which holds the correct information, or which has any information at all about the query we care for [2]. A join among all or some data sources can guide us through a fact checking process. However, when we want to perform this join on a distributed computational environment such as MapReduce, it is not obvious how to distribute efficiently the records in the data sources to the reduce tasks in order to join any subset of them in a single MapReduce job. In this paper, we show that the nature of such sources (i.e., since they talk about similar things) offers this opportunity, i.e., to distribute the records with low replication. We also show that the multiway algorithm in [3] can be implemented efficiently in MapReduce when the relations in the join have large overlaps in their schemas (i.e., they share a large number of attributes).

This work was supported by the project Handling Uncertainty in Data Intensive Applications, co-financed by the European Union (European Social Fund - ESF) and Greek national funds, through the Operational Program “Education and Lifelong Learning", under the program THALES.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S., Zhang, W.: From data fusion to knowledge fusion. Proceedings of the VLDB Endowment 7(10) (2014)
Google Scholar
Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment 2(1), 550–561 (2009)
Article Google Scholar
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 99–110. ACM (2010)
Google Scholar
Juster, F.T., Smith, J.P.: Improving the quality of economic data: Lessons from the hrs and ahead. Journal of the American Statistical Association 92(440), 1268–1278 (1997)
Article Google Scholar
Graham, J.W.: Missing data analysis: Making it work in the real world. Annual Review of Psychology 60, 549–576 (2009)
Article Google Scholar
Acock, A.C.: Working with missing values. Journal of Marriage and Family 67(4), 1012–1028 (2005)
Article Google Scholar
Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001)
Chapter Google Scholar
Padmanabhan, B., Zheng, Z., Kimbrough, S.O.: Personalization from incomplete data: what you don’t know can hurt. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 154–163. ACM (2001)
Google Scholar
Magnani, M.: Techniques for dealing with missing data in knowledge discovery tasks. Obtido 15(01), 2007 (2004). http://magnanim.web.cs.unibo.it/index.html
Google Scholar
Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: is the problem solved? Proceedings of the VLDB Endowment 6(2), 97–108 (2012)
Article Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (methodological), 1–38 (1977)
Google Scholar
Afrati, F.N., Delorey, D., Pasumansky, M., Ullman, J.D.: Storing and querying tree-structured records in dremel. PVLDB 7(12), 1131–1142 (2014)
Google Scholar
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54(6), 114–123 (2011)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM (2010)
Google Scholar
Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 618–629. IEEE (2012)
Google Scholar
McNeill, N., Kardes, H., Borthwick, A.: Dynamic record blocking: efficient linking of massive databases in mapreduce. In: Proceedings of the 10th International Workshop on Quality in Databases (QDB) (2012)
Google Scholar
Mestre, D.G., Pires, C.E.: An adaptive blocking approach for entity matching with mapreduce
Google Scholar
Kolb, L., Rahm, E.: Parallel entity resolution with dedoop. Datenbank-Spektrum 13(1), 23–32 (2013)
Article Google Scholar
Kolb, L., Thor, A., Rahm, E.: Don’t match twice: redundancy-free similarity computation with mapreduce. In: Proceedings of the Second Workshop on Data Analytics in the Cloud, pp. 1–5. ACM (2013)
Google Scholar
U.S. General Services Administration: U.S. government‘s open data (2013). http://www.data.gov/ (accessed June 19, 2015)

Download references

Author information

Authors and Affiliations

National Technical University of Athens, Kesariani, Greece
Foto Afrati, Zaid Momani & Nikos Stasinopoulos

Authors

Foto Afrati
View author publications
You can also search for this author in PubMed Google Scholar
Zaid Momani
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Stasinopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikos Stasinopoulos .

Editor information

Editors and Affiliations

Poznan University of Technology, Poznan, Poland
Tadeusz Morzy
INRIA, Montpellier, France
Patrick Valduriez
National Engineering School for Mechanics and Aerotechnics, Poitiers, France
Ladjel Bellatreche

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Afrati, F., Momani, Z., Stasinopoulos, N. (2015). Cross-Checking Data Sources in MapReduce. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds) New Trends in Databases and Information Systems. ADBIS 2015. Communications in Computer and Information Science, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-319-23201-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-23201-0_19
Published: 28 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23200-3
Online ISBN: 978-3-319-23201-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics