Skip to main content

EISA: An Efficient Information Theoretical Approach to Value Segmentation in Large Databases

  • Conference paper
Book cover Web Technologies and Applications (APWeb 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8709))

Included in the following conference series:

  • 3233 Accesses

Abstract

Value disparity is a widely known problem, that contributes to poor data quality results and raises many issues in data integration tasks. Value disparity, also known as column heterogeneity, occurs when the same entity is represented by disparate values, often within the same column in a database table. A first step in overcoming value disparity is to identify the distinct segments. This is a highly challenging task due to high number of features that define a particular segment as well as the need to undertake value comparisons which can be exponential in large databases. In this paper, we propose an efficient information theoretical approach to value segmentation, namely EISA. EISA not only reduces the number of the relevant features but also compresses the size of the values to be segmented. We have applied our method on three datasets with varying sizes. Our experimental evaluation of the method demonstrates a high level of accuracy with reasonable efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: SIGMOD, pp. 731–742 (2004)

    Google Scholar 

  2. Dang, X.H., Assent, I., Ng, R.T., Zimek, A., Schubert, E.: Discriminative Features for Identifying and Interpreting Outliers. In: ICDE (2014)

    Google Scholar 

  3. Li, J., Liu, J., Toivonen, H., Yong, J.: Effective Pruning for the Discovery of Conditional Functional Dependencies. The Computer Journal (2012)

    Google Scholar 

  4. Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: Automatic Discovery of Attributes in Relational Databases. In: SIGMOD (2011)

    Google Scholar 

  5. Golab, L., Karloff, H., Korn, F., Srivastava, D., Yu, B.: On generating Near-Optimal Tableaux for Conditional Functional Dependencies, In: PVLDB (2008)

    Google Scholar 

  6. Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering Conditional Functional Dependencies. In: TKDE (2011)

    Google Scholar 

  7. Yeh, P.Z., Puri, C.A.: Discovering Conditional Functional Dependencies to Detect Data Inconsistencies. In: VLDB (2010)

    Google Scholar 

  8. Dai, B.T., Srivastava, D., Koudas, N., Venkatasubramanian, S., Ooi, B.C.: Rapid Identification of Column Heterogeneity. In: ICDM, pp. 159–170 (2006)

    Google Scholar 

  9. Dasu, T., Johnson, T., Muthukrishnan, S., Shkapeny, V.: Mining Database Structure; Or, How to Build a Data Quality Browse. In: SIGMOD (2002)

    Google Scholar 

  10. Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: Scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  11. Slonim, N., Tishby, N.: Agglomerative information bottleneck, pp. 617–623. MIT Press (1999)

    Google Scholar 

  12. Tishby, N., Pereira, O.C., Bialek, W.: The information bottleneck method, pp. 368–377. University of Illinois (1999)

    Google Scholar 

  13. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Pacific Association for Computational Linguistics (2003)

    Google Scholar 

  14. Cover, T.M., Joy, A.T.: Elements of information theory. Wiley Interscience, New York (1991)

    Book  MATH  Google Scholar 

  15. Arenas, M., Libkin, L.: An information-theoretic approach to normal forms for relational and XML data. JACM 52(2), 246–283 (2005)

    Article  MathSciNet  Google Scholar 

  16. Dai, B.T., Koudas, N., Srivastavat, D., Tung, A.K.H., Venkatasubramaniant, S.: Validating Multi-column Schema Matchings by Type. IEEE (2008)

    Google Scholar 

  17. Srivastava, D., Venkatasubramanian, S.: Information Theory For Data Management. In: SIGMOD (2010)

    Google Scholar 

  18. Ahmadi, B., Hadjieleftheriou, M., Seidl, T., Srivastava, D., Suresh: Type-Based Categorization of Relational Attributes, In: EDBT (2009)

    Google Scholar 

  19. Wang, J., Lochovsky, F.H.: Data Extraction and Label Assignment for Web Databases. In: WWW (2003)

    Google Scholar 

  20. N-gram, http://en.wikipedia.org/wiki/N-gram

  21. Principle components analysis, http://en.wikipedia.org/wiki/Principle_components_analysis

  22. Medical care data of the government, https://data.medicare.gov/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, W., Sadiq, S., Zhou, X. (2014). EISA: An Efficient Information Theoretical Approach to Value Segmentation in Large Databases. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8709. Springer, Cham. https://doi.org/10.1007/978-3-319-11116-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11116-2_20

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11115-5

  • Online ISBN: 978-3-319-11116-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics