Skip to main content

Cross-Checking Data Sources in MapReduce

  • Conference paper
  • First Online:
New Trends in Databases and Information Systems (ADBIS 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 539))

Included in the following conference series:

  • East European Conference on Advances in Databases and Information Systems

Abstract

Fact checking from multiple sources is investigated from different and diverse angles and the complexity and diversity of the problem calls for a wide range of methods and techniques [1]. Fact checking tasks are not easy to perform and, most importantly, it is not clear what kind of computations they involve. Fact checking usually involves a large number of data sources that talk about the same thing but we are not sure which holds the correct information, or which has any information at all about the query we care for [2]. A join among all or some data sources can guide us through a fact checking process. However, when we want to perform this join on a distributed computational environment such as MapReduce, it is not obvious how to distribute efficiently the records in the data sources to the reduce tasks in order to join any subset of them in a single MapReduce job. In this paper, we show that the nature of such sources (i.e., since they talk about similar things) offers this opportunity, i.e., to distribute the records with low replication. We also show that the multiway algorithm in [3] can be implemented efficiently in MapReduce when the relations in the join have large overlaps in their schemas (i.e., they share a large number of attributes).

This work was supported by the project Handling Uncertainty in Data Intensive Applications, co-financed by the European Union (European Social Fund - ESF) and Greek national funds, through the Operational Program “Education and Lifelong Learning", under the program THALES.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S., Zhang, W.: From data fusion to knowledge fusion. Proceedings of the VLDB Endowment 7(10) (2014)

    Google Scholar 

  2. Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment 2(1), 550–561 (2009)

    Article  Google Scholar 

  3. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 99–110. ACM (2010)

    Google Scholar 

  4. Juster, F.T., Smith, J.P.: Improving the quality of economic data: Lessons from the hrs and ahead. Journal of the American Statistical Association 92(440), 1268–1278 (1997)

    Article  Google Scholar 

  5. Graham, J.W.: Missing data analysis: Making it work in the real world. Annual Review of Psychology 60, 549–576 (2009)

    Article  Google Scholar 

  6. Acock, A.C.: Working with missing values. Journal of Marriage and Family 67(4), 1012–1028 (2005)

    Article  Google Scholar 

  7. Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  8. Padmanabhan, B., Zheng, Z., Kimbrough, S.O.: Personalization from incomplete data: what you don’t know can hurt. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 154–163. ACM (2001)

    Google Scholar 

  9. Magnani, M.: Techniques for dealing with missing data in knowledge discovery tasks. Obtido 15(01), 2007 (2004). http://magnanim.web.cs.unibo.it/index.html

    Google Scholar 

  10. Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: is the problem solved? Proceedings of the VLDB Endowment 6(2), 97–108 (2012)

    Article  Google Scholar 

  11. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (methodological), 1–38 (1977)

    Google Scholar 

  12. Afrati, F.N., Delorey, D., Pasumansky, M., Ullman, J.D.: Storing and querying tree-structured records in dremel. PVLDB 7(12), 1131–1142 (2014)

    Google Scholar 

  13. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54(6), 114–123 (2011)

    Article  Google Scholar 

  14. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  15. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM (2010)

    Google Scholar 

  16. Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 618–629. IEEE (2012)

    Google Scholar 

  17. McNeill, N., Kardes, H., Borthwick, A.: Dynamic record blocking: efficient linking of massive databases in mapreduce. In: Proceedings of the 10th International Workshop on Quality in Databases (QDB) (2012)

    Google Scholar 

  18. Mestre, D.G., Pires, C.E.: An adaptive blocking approach for entity matching with mapreduce

    Google Scholar 

  19. Kolb, L., Rahm, E.: Parallel entity resolution with dedoop. Datenbank-Spektrum 13(1), 23–32 (2013)

    Article  Google Scholar 

  20. Kolb, L., Thor, A., Rahm, E.: Don’t match twice: redundancy-free similarity computation with mapreduce. In: Proceedings of the Second Workshop on Data Analytics in the Cloud, pp. 1–5. ACM (2013)

    Google Scholar 

  21. U.S. General Services Administration: U.S. government‘s open data (2013). http://www.data.gov/ (accessed June 19, 2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikos Stasinopoulos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Afrati, F., Momani, Z., Stasinopoulos, N. (2015). Cross-Checking Data Sources in MapReduce. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds) New Trends in Databases and Information Systems. ADBIS 2015. Communications in Computer and Information Science, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-319-23201-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23201-0_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23200-3

  • Online ISBN: 978-3-319-23201-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics