Abstract
In the setting of relational databases, the schema of the database provides a context in which the data should be interpreted. As a consequence, the quality of a relational database depends strongly on the assumption that data fits this context description. In this paper, we investigate the case where the information provided by an attribute value exceeds the framework provided by the schema. It is shown that such an information overflow can have two orthogonal causes: (i) data about multiple attributes are jointly stored as one attribute and (ii) data about multiple tuples are jointly stored as one tuple. Needless to say, such erroneous information storage deteriorates the quality of the database. In this paper, it is investigated how data quality can be improved by a split operator. The major difficulty hereby is to take into account the constraints that are present in a relational database. A generic algorithm is provided and tested on the well-know Cora dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Batini, C., Scannapieca, M.: Data quality: concepts, methodologies and techniques. Springer (2006)
Codd, E.F.: A relational model of data for large shared data banks. Communications of the ACM 13(6), 377–387 (1970)
Fellegi, I., Sunter, A.: A theory for record linkage. American Statistical Association Journal 64(328), 1183–1210 (1969)
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995)
Adelberg, B.: Nodose: A tool for semi-automatically extracting structured and semistructured data from text documents. In: Proceedings of the SIGMOD Conference, pp. 283–294 (1998)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34, 1–44 (1999)
Califf, M.E., Mooney, R.: Relational learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI 1999), pp. 328–334 (1999)
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatically extracting structure from free text addresses. IEEE Data Engineering Bulletin 23(4), 27–32 (2000)
Freitag, D., McCallum, A.: Information extraction with hmm structures learned by stochastic optimization. In: Proceedings of the AAAI 2000, pp. 584–589 (1999)
Bronselaer, A., De Tré, G.: Aspects of object merging. In: Proceedings of the NAFIPS Conference, Toronto, Canada, pp. 27–32 (July 2010)
Bronselaer, A., Van Britsom, D., De Tré, G.: A framework for multiset merging. Fuzzy Sets and Systems 191, 1–20 (2012)
Levenstein, V.: Binary codes capable of correcting deletions, insertions and reversals. Physics Doklady 10(8), 707–710 (1966)
Jaro, M.: Unimatch: A record linkage system: User’s manual. Technical report, US Bureau of the Census (1976)
Bronselaer, A., De Tré, G.: A possibilistic approach on string comparison. IEEE Transactions on Fuzzy Systems 17(1), 208–223 (2009)
Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for xml (and relational) data. In: Proceedings of the SIGMOD International Workshop on Information Quality for Information Systems (IQIS), pp. 1–9 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Bronselaer, A., De Tré, G. (2015). Data Quality Improvement by Constrained Splitting. In: Angelov, P., et al. Intelligent Systems'2014. Advances in Intelligent Systems and Computing, vol 322. Springer, Cham. https://doi.org/10.1007/978-3-319-11313-5_48
Download citation
DOI: https://doi.org/10.1007/978-3-319-11313-5_48
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11312-8
Online ISBN: 978-3-319-11313-5
eBook Packages: EngineeringEngineering (R0)