Skip to main content

Detecting the Data Group Most Prone to a Specific Disguise Value

  • Conference paper
  • First Online:
Trends and Applications in Knowledge Discovery and Data Mining (PAKDD 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Included in the following conference series:

  • 2157 Accesses

Abstract

Disguised missing data, an emerging data quality problem coined by Pearson in 2006, is a special kind of missing data that refers to values not exactly missing in the data entries, but cannot reflect the fact and so may lead to severe bias on analysis results. In this paper, we present a novel problem of detecting disguised missing data, i.e., finding out the data group most prone to a specific disguise value. We show that this problem can be formalized as an optimization problem and so a genetic-algorithms-based method is proposed to handle this problem. According to preliminary experimental results conducted on real datasets, our method can discover the same optimal data groups obtained by exhaustive method. A further evaluation on the FDA adverse drug event reporting dataset shows that our method yields similar results concluded by manual examinations performed by experienced analyzers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Belen, R.: Detecting disguised missing data. Master thesis, The Middle East Technical University (2009)

    Google Scholar 

  2. Belen, R., Temizel, T.T.: A framework to detect disguised missing data. In: Senthil Kumar, A.V. (ed.) Knowledge Discovery Practices and Emerging Applications of Data Mining: Trends and New Domains, pp. 1–22. IGI Global, Hershey (2010)

    Chapter  Google Scholar 

  3. FDA Adverse Event Reporting System. http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm083765.htm

  4. Hua, M., Pei, J.: Cleaning disguised missing data: a heuristic approach. In: Proceedings of 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 950–958 (2007)

    Google Scholar 

  5. Hua, M., Pei, J.: DiMaC: a system for cleaning disguised missing data. In: Proceedings of 2008 ACM SIGMOD International Conference on Management of Data, pp. 1263–1266 (2008)

    Google Scholar 

  6. Little, R., Rubin, D.: Statistical Analysis with Missing Data. Wiley Publishers, New York (1987)

    MATH  Google Scholar 

  7. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1996)

    Google Scholar 

  8. Natarajan, K., Li, J., Koronios, A.: Detecting mis-entered values in large data sets. In: Proceedings of the 4th World Congress on Engineering Asset Management, pp. 805–812 (2009)

    Google Scholar 

  9. Pearson, R.K.: The Problem of Disguised Missing Data. ACM SIGKDD Explor. Newslett. 8(1), 83–92 (2006)

    Article  Google Scholar 

  10. UCI Machine Learning Repository: Pima Indians Diabetes Data Set. http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wen-Yang Lin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Lin, WY., Feng, WY. (2014). Detecting the Data Group Most Prone to a Specific Disguise Value. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13186-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13185-6

  • Online ISBN: 978-3-319-13186-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics