A masking index for quantifying hidden glitches

Berti-Équille, Laure; Loh, Ji Meng; Dasu, Tamraparni

doi:10.1007/s10115-014-0760-0

A masking index for quantifying hidden glitches

Regular Paper
Published: 01 July 2014

Volume 44, pages 253–277, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Laure Berti-Équille^1,2,
Ji Meng Loh³ &
Tamraparni Dasu⁴

237 Accesses
Explore all metrics

Abstract

Data glitches are errors in a dataset. They are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking and propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches. We outline four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration. It enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for choosing an anomaly detection method that is best suited for the glitches that are present in any given dataset. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Data set from OpenData by Socrata retrieved on March 26, 2013: https://opendata.socrata.com/Government/Unclaimed-bank-accounts/.
http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.
Permanent Service for Mean Sea level—PSMSL: http://www.psmsl.org/.

References

Acuna E, Rodriguez CA (2004) Meta analysis study of outlier detection methods in classification, IPSI
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New York
Google Scholar
Ben-Gal I (2005) Outlier detection. In: Maimon O, Rockach L (eds) Data mining and knowledge discovery handbook: a complete guide for practitioners and researchers. Kluwer, Dordrecht
Google Scholar
Berti-Equille L, Dasu T, Srivastava D (2011) Discovery of complex glitch patterns: a novel approach to quantitative data cleaning, ICDE, pp 733–744
Blake R, Mangiameli P (2011) The effects and interactions of data quality and problem complexity on classification. J Data Inf Qual 2(2):8:1–8:28
Dasu T, Loh JM (2012) Statistical distortion: consequences of data cleaning. PVLDB 5(11):1674–1683
Google Scholar
Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782–792
Article MathSciNet Google Scholar
Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7:1–26
Article MathSciNet Google Scholar
Hawkins D (1980) Identification of outliers. Chapman and Hall, London
Book MATH Google Scholar
Iglewics B, Martinez J (1982) Outlier detection using robust measures of scale. J Stat Comput Simul 15:285–293
Article Google Scholar
Kushmerick N (1999) Learning to remove internet advertisements. In: Proceedings of the third annual conference on autonomous agents, AGENTS ’99, pp 175–181
Rao CR (1973) Linear statistical inference and its applications. Wiley, New York
Book MATH Google Scholar
Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(2):304–319
Article Google Scholar

Download references

Author information

Authors and Affiliations

IRD ESPACE DEV, 500, rue J.F. Breton, Montpellier, France
Laure Berti-Équille
Qatar Computing Research Institute, Tornado Tower, 18th Floor, West Bay, Doha, Qatar
Laure Berti-Équille
Department of Mathematical Sciences, New Jersey Institute of Technology, Newark, NJ, USA
Ji Meng Loh
AT&T Labs-Research, Bedminster, NJ, USA
Tamraparni Dasu

Authors

Laure Berti-Équille
View author publications
You can also search for this author inPubMed Google Scholar
Ji Meng Loh
View author publications
You can also search for this author inPubMed Google Scholar
Tamraparni Dasu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Laure Berti-Équille.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berti-Équille, L., Loh, J.M. & Dasu, T. A masking index for quantifying hidden glitches. Knowl Inf Syst 44, 253–277 (2015). https://doi.org/10.1007/s10115-014-0760-0

Download citation

Received: 31 December 2013
Revised: 02 May 2014
Accepted: 18 May 2014
Published: 01 July 2014
Issue Date: August 2015
DOI: https://doi.org/10.1007/s10115-014-0760-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A masking index for quantifying hidden glitches

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the nature and types of anomalies: a review of deviations in data

Statistical Approaches to Detect Anomalies

A Typology of Data Anomalies

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A masking index for quantifying hidden glitches

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the nature and types of anomalies: a review of deviations in data

Statistical Approaches to Detect Anomalies

A Typology of Data Anomalies

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now