Skip to main content

Automatic Detecting Documents Containing Personal Health Information

  • Conference paper
Artificial Intelligence in Medicine (AIME 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5651))

Included in the following conference series:

Abstract

With the increasing usage of computers and Internet, personal health information (PHI) is distributed across multiple institutes and often scattered on multiple devices and stored in diverse formats. Non-traditional medical records such as emails and e-documents containing PHI are in a high risk of privacy leakage. We are facing the challenges of locating and managing PHI in the distributed environment. The goal of this study is to classify electronic documents into PHI and non-PHI. A supervised machine learning method was used for this text categorization task. Three classifiers: SVM, decision tree and Naive Bayesian were used and tested on three data sets. Lexical, semantic and syntactic features and their combinations were compared in terms of their effectiveness of classifying PHI documents. The results show that combining semantic and/or syntactic with lexical features is more effective than lexical features alone for PHI classification. The supervised machine learning method is effective in classifying documents into PHI and non-PHI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aronson, A.R.: Effective mapping of biomedical text to the umls metathesaurus: the metamap program. In: Proceedings of AMIA Symposium, pp. 17–21 (2001)

    Google Scholar 

  2. Bloehdron, S., Hotho, A.: Boosting for text classification with semantic features. In: Workshop on Text-based Information Retrieval (TIR 2004) at the 27th German Conference on Artificial Intelligence (2004)

    Google Scholar 

  3. Cai, L., Hofmann, T.: Text categorization by boosting automatically extracted concepts. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Toronto, CA, pp. 182–189 (2003)

    Google Scholar 

  4. Hodge, J.G., Gostin, L.O., Lacobson, P.D.: Legal issues concerning electronic health information privacy, quality, and liability. JAMA 282, 1466–1471 (1999)

    Article  PubMed  Google Scholar 

  5. Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, 15th ACM international conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pp. 37–50 (1992)

    Google Scholar 

  6. Liu, H.: Monty tagger, http://web.media.mit.edu/~hugo/montytagger/

  7. McCray, A.T., Burgun, A., Bodenreider, O.: Aggregating umls semantic types for reducing conceptual complexity. In: Proceedings of Medinfo 10(Pt 1), pp. 216–220 (2001)

    Google Scholar 

  8. Xuan-Hieu, P.: Crfchunker: Crf english phrase chunker (2006), http://crfchunker.sourceforge.net/

  9. Pratt, W., Unruh, K., Civan, A., Skeels, M.M.: Personal health information management. Communication of ACM 49(1), 51–55 (2006)

    Article  Google Scholar 

  10. Roberts, A.: jtokeniser (2005), http://www.andy-roberts.net/software/jTokeniser

  11. Sazarva, G., Farkas, R., Busa-Fekete, R.: State-of-the-art anonymization of medical records using an iterative machine learning framework. JAMIA 14(5), 574–579 (2007)

    Google Scholar 

  12. Sebastiani, F.: Machine learning in automatic text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  13. Uzuner, O., Sibanda, T., Luo, Y., Szolovits, P.: A de-identification for medical discharge summaries. Artificial Intelligence in Medicine 42, 13–35 (2008)

    Article  PubMed  Google Scholar 

  14. Wellner, B., Huygk, M., Aberdeen, J., Morgan, A., Mardis, S., Peshkin, L., et al.: Rapidly retargetable approaches to de-identification in medical records. JAMIA 14(5), 564–573 (2007)

    PubMed  PubMed Central  Google Scholar 

  15. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, Y., Liu, H., Geng, L., Keays, M.S., You, Y. (2009). Automatic Detecting Documents Containing Personal Health Information. In: Combi, C., Shahar, Y., Abu-Hanna, A. (eds) Artificial Intelligence in Medicine. AIME 2009. Lecture Notes in Computer Science(), vol 5651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02976-9_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02976-9_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02975-2

  • Online ISBN: 978-3-642-02976-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics