Skip to main content

Efficient Unique Column Combinations Discovery Based on Data Distribution

  • Conference paper
  • First Online:
Web-Age Information Management (WAIM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9658))

Included in the following conference series:

  • 1559 Accesses

Abstract

Discovering all unique column combinations in a relation is a fundamental research problem for modern data management and knowledge discovery applications. With the rapid growth of data volume and popularity of distributed platform, some algorithms are trying to discover uniques in large-scale datasets. However, the performance is not always satisfactory for some datasets which have few unique values in each column. This paper proposes a parallel algorithm to discover unique column combinations in large-scale datasets on Hadoop. We first construct a prefix tree to depict all unique candidates. Then we parallelize the verification of candidates in the same layer of the prefix tree. Two parallel strategies can be chosen: one is parallelizing across all subtrees, the other is parallelizing only in a single subtree. The parallel strategies and pruning methods are self-adaptive based on the data distribution. Eventually, experimental results demonstrate the advantages of the method we proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Brown, P., Haas, P.J., Myllymaki, J., Pirahesh, H., Reinwald, B., Sismanis, Y.: Toward automated large-scale information integration and discovery. In: Härder, T., Lehner, W. (eds.) Data Management in a Connected World. LNCS, vol. 3551, pp. 161–180. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Bell, S., Brockhausen, P.: Discovery of constraints and data dependencies in relational databases. In: Lavrač, N., Wrobel, S. (eds.) ECML 1995. LNCS, vol. 912. Springer, Heidelberg (1995)

    Google Scholar 

  3. Kivinen, J., Mannila, H.: Approximate dependency inference from relations. Theoret. Comput. Sci. 149, 129–149 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  4. Petit, J.-M., Toumani, F., Boulicaut, J.-F., Kouloumdjian, J.: Towards the reverse engineering of renormalized relational databases. In: Proceedings of the ICDE, pp. 218–227 (1996)

    Google Scholar 

  5. Sismanis, Y., et al.: GORDIAN: efficient and scalable discovery of composite keys. In: Proceedings of the 32nd International Conference on Very Large Data Bases. VLDB Endowment (2006)

    Google Scholar 

  6. Abedjan, Z., Naumann, F.: Advancing the discovery of unique column combinations. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM (2011)

    Google Scholar 

  7. Han, S., Cai, X., Wang, C., Zhang, H., Wen, Y.: Discovery of unique column combinations with hadoop. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 533–541. Springer, Heidelberg (2014)

    Google Scholar 

  8. Heise, A., Quiané-Ruiz, J.A., Abedjan, Z., et al.: Scalable discovery of unique column combinations. Proc. VLDB Endowment 7(4), 301–312 (2013)

    Article  Google Scholar 

Download references

Acknowledgments

This work is partially supported by National 863 Project of China under Grant No. 2015AA015401, and Research Foundation of The Ministry of Education and China Mobile under Grant No. MCM20150507. This work is also partially supported by Tianjin Municipal Science and Technology Commission under Grant Nos. 13ZCZDGX01098, 14JCQNJC00200 and 14JCYBJC15500.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiwei Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, C., Han, S., Cai, X., Zhang, H., Wen, Y. (2016). Efficient Unique Column Combinations Discovery Based on Data Distribution. In: Cui, B., Zhang, N., Xu, J., Lian, X., Liu, D. (eds) Web-Age Information Management. WAIM 2016. Lecture Notes in Computer Science(), vol 9658. Springer, Cham. https://doi.org/10.1007/978-3-319-39937-9_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-39937-9_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-39936-2

  • Online ISBN: 978-3-319-39937-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics