Abstract
Fact checking from multiple sources is investigated from different and diverse angles and the complexity and diversity of the problem calls for a wide range of methods and techniques [1]. Fact checking tasks are not easy to perform and, most importantly, it is not clear what kind of computations they involve. Fact checking usually involves a large number of data sources that talk about the same thing but we are not sure which holds the correct information, or which has any information at all about the query we care for [2]. A join among all or some data sources can guide us through a fact checking process. However, when we want to perform this join on a distributed computational environment such as MapReduce, it is not obvious how to distribute efficiently the records in the data sources to the reduce tasks in order to join any subset of them in a single MapReduce job. In this paper, we show that the nature of such sources (i.e., since they talk about similar things) offers this opportunity, i.e., to distribute the records with low replication. We also show that the multiway algorithm in [3] can be implemented efficiently in MapReduce when the relations in the join have large overlaps in their schemas (i.e., they share a large number of attributes).
This work was supported by the project Handling Uncertainty in Data Intensive Applications, co-financed by the European Union (European Social Fund - ESF) and Greek national funds, through the Operational Program “Education and Lifelong Learning", under the program THALES.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S., Zhang, W.: From data fusion to knowledge fusion. Proceedings of the VLDB Endowment 7(10) (2014)
Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment 2(1), 550–561 (2009)
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 99–110. ACM (2010)
Juster, F.T., Smith, J.P.: Improving the quality of economic data: Lessons from the hrs and ahead. Journal of the American Statistical Association 92(440), 1268–1278 (1997)
Graham, J.W.: Missing data analysis: Making it work in the real world. Annual Review of Psychology 60, 549–576 (2009)
Acock, A.C.: Working with missing values. Journal of Marriage and Family 67(4), 1012–1028 (2005)
Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001)
Padmanabhan, B., Zheng, Z., Kimbrough, S.O.: Personalization from incomplete data: what you don’t know can hurt. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 154–163. ACM (2001)
Magnani, M.: Techniques for dealing with missing data in knowledge discovery tasks. Obtido 15(01), 2007 (2004). http://magnanim.web.cs.unibo.it/index.html
Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: is the problem solved? Proceedings of the VLDB Endowment 6(2), 97–108 (2012)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (methodological), 1–38 (1977)
Afrati, F.N., Delorey, D., Pasumansky, M., Ullman, J.D.: Storing and querying tree-structured records in dremel. PVLDB 7(12), 1131–1142 (2014)
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54(6), 114–123 (2011)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM (2010)
Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 618–629. IEEE (2012)
McNeill, N., Kardes, H., Borthwick, A.: Dynamic record blocking: efficient linking of massive databases in mapreduce. In: Proceedings of the 10th International Workshop on Quality in Databases (QDB) (2012)
Mestre, D.G., Pires, C.E.: An adaptive blocking approach for entity matching with mapreduce
Kolb, L., Rahm, E.: Parallel entity resolution with dedoop. Datenbank-Spektrum 13(1), 23–32 (2013)
Kolb, L., Thor, A., Rahm, E.: Don’t match twice: redundancy-free similarity computation with mapreduce. In: Proceedings of the Second Workshop on Data Analytics in the Cloud, pp. 1–5. ACM (2013)
U.S. General Services Administration: U.S. government‘s open data (2013). http://www.data.gov/ (accessed June 19, 2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Afrati, F., Momani, Z., Stasinopoulos, N. (2015). Cross-Checking Data Sources in MapReduce. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds) New Trends in Databases and Information Systems. ADBIS 2015. Communications in Computer and Information Science, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-319-23201-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-23201-0_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23200-3
Online ISBN: 978-3-319-23201-0
eBook Packages: Computer ScienceComputer Science (R0)