Skip to main content

Join Query Processing in Data Quality Management

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9645))

Included in the following conference series:

Abstract

Data quality management is the essential problem for information systems. As a basic operation of Data quality management, joins on large-scale data play an important role in document clustering. MapReduce is a programming model which is usually applied to process large-scale data. Many tasks can be implemented under the framework, such as data processing of search engines and machine learning. However, there is no efficient support for join operation in current implementations of MapReduce. In this paper, we present a strategies to build the extend bloom filter for the large dataset using MapReduce. We use the extend bloom filter to improve the performance of two-way and multi-way joins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lueebber D, Grimmer U.: Systematic development of data mining based data quality tools. In: 29th VLDB (2003)

    Google Scholar 

  2. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)

    Google Scholar 

  3. Apache Software Foundation. Hadoop, April 2010. http://hadoop.apache.org

  4. Mackert, L.F., Lohman, G.M.: R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the 12th International Conference on Very Large Data Bases (VLDB), pp. 149–159 (1986)

    Google Scholar 

  5. Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. ACM SIGMOD Rec. 40(4), 11–20 (2011)

    Article  Google Scholar 

  6. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD 2010), pp. 975–986 (2010)

    Google Scholar 

  7. Yang, H.-C., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD 2007), pp. 1029–1040 (2007)

    Google Scholar 

  8. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM (CACM) 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  9. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD, pp. 975–986 (2010)

    Google Scholar 

  10. Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1297 (2011)

    Article  Google Scholar 

  11. Broder, A., Mitzenmacher, M.: Network applications of bloom filters: a survey. In: Internet Mathematics, pp. 636–646 (2002)

    Google Scholar 

  12. Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. In: SIGMOD, pp. 11–20 (2011)

    Google Scholar 

  13. Yang, H.C., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD 2007, pp. 1029–1040 (2007)

    Google Scholar 

  14. Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. In: Proceedings of VLDB (2009)

    Google Scholar 

Download references

Acknowledgements

This paper was partially supported by National Sci-Tech Support Plan 2015BAH10F01 and NSFC grant U1509216, 61472099, 61133002.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mingliang Yue .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Yue, M., Gao, H., Shi, S., Wang, H. (2016). Join Query Processing in Data Quality Management. In: Gao, H., Kim, J., Sakurai, Y. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9645. Springer, Cham. https://doi.org/10.1007/978-3-319-32055-7_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32055-7_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32054-0

  • Online ISBN: 978-3-319-32055-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics