Skip to main content
Log in

A masking index for quantifying hidden glitches

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Data glitches are errors in a dataset. They are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking and propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches. We outline four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration. It enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for choosing an anomaly detection method that is best suited for the glitches that are present in any given dataset. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Data set from OpenData by Socrata retrieved on March 26, 2013: https://opendata.socrata.com/Government/Unclaimed-bank-accounts/.

  2. http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.

  3. Permanent Service for Mean Sea level—PSMSL: http://www.psmsl.org/.

References

  1. Acuna E, Rodriguez CA (2004) Meta analysis study of outlier detection methods in classification, IPSI

  2. Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New York

    Google Scholar 

  3. Ben-Gal I (2005) Outlier detection. In: Maimon O, Rockach L (eds) Data mining and knowledge discovery handbook: a complete guide for practitioners and researchers. Kluwer, Dordrecht

    Google Scholar 

  4. Berti-Equille L, Dasu T, Srivastava D (2011) Discovery of complex glitch patterns: a novel approach to quantitative data cleaning, ICDE, pp 733–744

  5. Blake R, Mangiameli P (2011) The effects and interactions of data quality and problem complexity on classification. J Data Inf Qual 2(2):8:1–8:28

  6. Dasu T, Loh JM (2012) Statistical distortion: consequences of data cleaning. PVLDB 5(11):1674–1683

    Google Scholar 

  7. Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782–792

    Article  MathSciNet  Google Scholar 

  8. Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7:1–26

    Article  MathSciNet  Google Scholar 

  9. Hawkins D (1980) Identification of outliers. Chapman and Hall, London

    Book  MATH  Google Scholar 

  10. Iglewics B, Martinez J (1982) Outlier detection using robust measures of scale. J Stat Comput Simul 15:285–293

    Article  Google Scholar 

  11. Kushmerick N (1999) Learning to remove internet advertisements. In: Proceedings of the third annual conference on autonomous agents, AGENTS ’99, pp 175–181

  12. Rao CR (1973) Linear statistical inference and its applications. Wiley, New York

    Book  MATH  Google Scholar 

  13. Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(2):304–319

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laure Berti-Équille.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berti-Équille, L., Loh, J.M. & Dasu, T. A masking index for quantifying hidden glitches. Knowl Inf Syst 44, 253–277 (2015). https://doi.org/10.1007/s10115-014-0760-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0760-0

Keywords

Navigation