Skip to main content

Data Quality Improvement by Constrained Splitting

  • Conference paper
Intelligent Systems'2014

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 322))

Abstract

In the setting of relational databases, the schema of the database provides a context in which the data should be interpreted. As a consequence, the quality of a relational database depends strongly on the assumption that data fits this context description. In this paper, we investigate the case where the information provided by an attribute value exceeds the framework provided by the schema. It is shown that such an information overflow can have two orthogonal causes: (i) data about multiple attributes are jointly stored as one attribute and (ii) data about multiple tuples are jointly stored as one tuple. Needless to say, such erroneous information storage deteriorates the quality of the database. In this paper, it is investigated how data quality can be improved by a split operator. The major difficulty hereby is to take into account the constraints that are present in a relational database. A generic algorithm is provided and tested on the well-know Cora dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Batini, C., Scannapieca, M.: Data quality: concepts, methodologies and techniques. Springer (2006)

    Google Scholar 

  2. Codd, E.F.: A relational model of data for large shared data banks. Communications of the ACM 13(6), 377–387 (1970)

    Article  MATH  Google Scholar 

  3. Fellegi, I., Sunter, A.: A theory for record linkage. American Statistical Association Journal 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  4. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995)

    Google Scholar 

  5. Adelberg, B.: Nodose: A tool for semi-automatically extracting structured and semistructured data from text documents. In: Proceedings of the SIGMOD Conference, pp. 283–294 (1998)

    Google Scholar 

  6. Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34, 1–44 (1999)

    Article  Google Scholar 

  7. Califf, M.E., Mooney, R.: Relational learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI 1999), pp. 328–334 (1999)

    Google Scholar 

  8. Borkar, V., Deshmukh, K., Sarawagi, S.: Automatically extracting structure from free text addresses. IEEE Data Engineering Bulletin 23(4), 27–32 (2000)

    Google Scholar 

  9. Freitag, D., McCallum, A.: Information extraction with hmm structures learned by stochastic optimization. In: Proceedings of the AAAI 2000, pp. 584–589 (1999)

    Google Scholar 

  10. Bronselaer, A., De Tré, G.: Aspects of object merging. In: Proceedings of the NAFIPS Conference, Toronto, Canada, pp. 27–32 (July 2010)

    Google Scholar 

  11. Bronselaer, A., Van Britsom, D., De Tré, G.: A framework for multiset merging. Fuzzy Sets and Systems 191, 1–20 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  12. Levenstein, V.: Binary codes capable of correcting deletions, insertions and reversals. Physics Doklady 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  13. Jaro, M.: Unimatch: A record linkage system: User’s manual. Technical report, US Bureau of the Census (1976)

    Google Scholar 

  14. Bronselaer, A., De Tré, G.: A possibilistic approach on string comparison. IEEE Transactions on Fuzzy Systems 17(1), 208–223 (2009)

    Article  Google Scholar 

  15. Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for xml (and relational) data. In: Proceedings of the SIGMOD International Workshop on Information Quality for Information Systems (IQIS), pp. 1–9 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antoon Bronselaer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Bronselaer, A., De Tré, G. (2015). Data Quality Improvement by Constrained Splitting. In: Angelov, P., et al. Intelligent Systems'2014. Advances in Intelligent Systems and Computing, vol 322. Springer, Cham. https://doi.org/10.1007/978-3-319-11313-5_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11313-5_48

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11312-8

  • Online ISBN: 978-3-319-11313-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics