Abstract
Value disparity is a widely known problem, that contributes to poor data quality results and raises many issues in data integration tasks. Value disparity, also known as column heterogeneity, occurs when the same entity is represented by disparate values, often within the same column in a database table. A first step in overcoming value disparity is to identify the distinct segments. This is a highly challenging task due to high number of features that define a particular segment as well as the need to undertake value comparisons which can be exponential in large databases. In this paper, we propose an efficient information theoretical approach to value segmentation, namely EISA. EISA not only reduces the number of the relevant features but also compresses the size of the values to be segmented. We have applied our method on three datasets with varying sizes. Our experimental evaluation of the method demonstrates a high level of accuracy with reasonable efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: SIGMOD, pp. 731–742 (2004)
Dang, X.H., Assent, I., Ng, R.T., Zimek, A., Schubert, E.: Discriminative Features for Identifying and Interpreting Outliers. In: ICDE (2014)
Li, J., Liu, J., Toivonen, H., Yong, J.: Effective Pruning for the Discovery of Conditional Functional Dependencies. The Computer Journal (2012)
Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: Automatic Discovery of Attributes in Relational Databases. In: SIGMOD (2011)
Golab, L., Karloff, H., Korn, F., Srivastava, D., Yu, B.: On generating Near-Optimal Tableaux for Conditional Functional Dependencies, In: PVLDB (2008)
Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering Conditional Functional Dependencies. In: TKDE (2011)
Yeh, P.Z., Puri, C.A.: Discovering Conditional Functional Dependencies to Detect Data Inconsistencies. In: VLDB (2010)
Dai, B.T., Srivastava, D., Koudas, N., Venkatasubramanian, S., Ooi, B.C.: Rapid Identification of Column Heterogeneity. In: ICDM, pp. 159–170 (2006)
Dasu, T., Johnson, T., Muthukrishnan, S., Shkapeny, V.: Mining Database Structure; Or, How to Build a Data Quality Browse. In: SIGMOD (2002)
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: Scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)
Slonim, N., Tishby, N.: Agglomerative information bottleneck, pp. 617–623. MIT Press (1999)
Tishby, N., Pereira, O.C., Bialek, W.: The information bottleneck method, pp. 368–377. University of Illinois (1999)
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Pacific Association for Computational Linguistics (2003)
Cover, T.M., Joy, A.T.: Elements of information theory. Wiley Interscience, New York (1991)
Arenas, M., Libkin, L.: An information-theoretic approach to normal forms for relational and XML data. JACM 52(2), 246–283 (2005)
Dai, B.T., Koudas, N., Srivastavat, D., Tung, A.K.H., Venkatasubramaniant, S.: Validating Multi-column Schema Matchings by Type. IEEE (2008)
Srivastava, D., Venkatasubramanian, S.: Information Theory For Data Management. In: SIGMOD (2010)
Ahmadi, B., Hadjieleftheriou, M., Seidl, T., Srivastava, D., Suresh: Type-Based Categorization of Relational Attributes, In: EDBT (2009)
Wang, J., Lochovsky, F.H.: Data Extraction and Label Assignment for Web Databases. In: WWW (2003)
Principle components analysis, http://en.wikipedia.org/wiki/Principle_components_analysis
Medical care data of the government, https://data.medicare.gov/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, W., Sadiq, S., Zhou, X. (2014). EISA: An Efficient Information Theoretical Approach to Value Segmentation in Large Databases. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8709. Springer, Cham. https://doi.org/10.1007/978-3-319-11116-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-11116-2_20
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11115-5
Online ISBN: 978-3-319-11116-2
eBook Packages: Computer ScienceComputer Science (R0)