Data Quality Improvement by Constrained Splitting

Bronselaer, Antoon; De Tré, Guy

doi:10.1007/978-3-319-11313-5_48

Antoon Bronselaer¹² &
Guy De Tré¹²

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 322))

1492 Accesses
1 Citations

Abstract

In the setting of relational databases, the schema of the database provides a context in which the data should be interpreted. As a consequence, the quality of a relational database depends strongly on the assumption that data fits this context description. In this paper, we investigate the case where the information provided by an attribute value exceeds the framework provided by the schema. It is shown that such an information overflow can have two orthogonal causes: (i) data about multiple attributes are jointly stored as one attribute and (ii) data about multiple tuples are jointly stored as one tuple. Needless to say, such erroneous information storage deteriorates the quality of the database. In this paper, it is investigated how data quality can be improved by a split operator. The major difficulty hereby is to take into account the constraints that are present in a relational database. A generic algorithm is provided and tested on the well-know Cora dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Batini, C., Scannapieca, M.: Data quality: concepts, methodologies and techniques. Springer (2006)
Google Scholar
Codd, E.F.: A relational model of data for large shared data banks. Communications of the ACM 13(6), 377–387 (1970)
Article MATH Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. American Statistical Association Journal 64(328), 1183–1210 (1969)
Article Google Scholar
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995)
Google Scholar
Adelberg, B.: Nodose: A tool for semi-automatically extracting structured and semistructured data from text documents. In: Proceedings of the SIGMOD Conference, pp. 283–294 (1998)
Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34, 1–44 (1999)
Article Google Scholar
Califf, M.E., Mooney, R.: Relational learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI 1999), pp. 328–334 (1999)
Google Scholar
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatically extracting structure from free text addresses. IEEE Data Engineering Bulletin 23(4), 27–32 (2000)
Google Scholar
Freitag, D., McCallum, A.: Information extraction with hmm structures learned by stochastic optimization. In: Proceedings of the AAAI 2000, pp. 584–589 (1999)
Google Scholar
Bronselaer, A., De Tré, G.: Aspects of object merging. In: Proceedings of the NAFIPS Conference, Toronto, Canada, pp. 27–32 (July 2010)
Google Scholar
Bronselaer, A., Van Britsom, D., De Tré, G.: A framework for multiset merging. Fuzzy Sets and Systems 191, 1–20 (2012)
Article MATH MathSciNet Google Scholar
Levenstein, V.: Binary codes capable of correcting deletions, insertions and reversals. Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Jaro, M.: Unimatch: A record linkage system: User’s manual. Technical report, US Bureau of the Census (1976)
Google Scholar
Bronselaer, A., De Tré, G.: A possibilistic approach on string comparison. IEEE Transactions on Fuzzy Systems 17(1), 208–223 (2009)
Article Google Scholar
Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for xml (and relational) data. In: Proceedings of the SIGMOD International Workshop on Information Quality for Information Systems (IQIS), pp. 1–9 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Telecommunications and Information Processing, Ghent University, Sint-Pietersnieuwstraat 41, B-9000, Ghent, Belgium
Antoon Bronselaer & Guy De Tré

Authors

Antoon Bronselaer
View author publications
You can also search for this author in PubMed Google Scholar
Guy De Tré
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antoon Bronselaer .

Editor information

Editors and Affiliations

School of Computing and Communications, Lancaster University, Lancaster, United Kingdom
P. Angelov
Institute of Biophysics and Biomedical Engineering, Bulgarian Academy of Sciences, Sofia, Bulgaria
K.T. Atanassov
Intelligent Systems Department, Bulgarian Academy of Sciences Inst. of Infor. & Communication Techn., Sofia, Bulgaria
L. Doukovska
Metalurgy, University of Chemical Technology and, Sofia, Bulgaria
M. Hadjiski
University of Library Studies and IT (ULSIT), Sofia, Bulgaria
V. Jotsov
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
J. Kacprzyk
Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland, New Zealand
N. Kasabov
Intelligent Systems Laboratory, Prof. Assen Zlatarov University Faculty of Technical Sciences, Bourgas, Bulgaria
S. Sotirov
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
E. Szmidt
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
S. Zadrożny

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bronselaer, A., De Tré, G. (2015). Data Quality Improvement by Constrained Splitting. In: Angelov, P., et al. Intelligent Systems'2014. Advances in Intelligent Systems and Computing, vol 322. Springer, Cham. https://doi.org/10.1007/978-3-319-11313-5_48

Download citation

DOI: https://doi.org/10.1007/978-3-319-11313-5_48
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11312-8
Online ISBN: 978-3-319-11313-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics