Automatic Detecting Documents Containing Personal Health Information

Wang, Yunli; Liu, Hongyu; Geng, Liqiang; Keays, Matthew S.; You, Yonghua

doi:10.1007/978-3-642-02976-9_46

Yunli Wang²²,
Hongyu Liu²²,
Liqiang Geng²²,
Matthew S. Keays²² &
…
Yonghua You²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5651))

Included in the following conference series:

Conference on Artificial Intelligence in Medicine in Europe

2113 Accesses
1 Citations

Abstract

With the increasing usage of computers and Internet, personal health information (PHI) is distributed across multiple institutes and often scattered on multiple devices and stored in diverse formats. Non-traditional medical records such as emails and e-documents containing PHI are in a high risk of privacy leakage. We are facing the challenges of locating and managing PHI in the distributed environment. The goal of this study is to classify electronic documents into PHI and non-PHI. A supervised machine learning method was used for this text categorization task. Three classifiers: SVM, decision tree and Naive Bayesian were used and tested on three data sets. Lexical, semantic and syntactic features and their combinations were compared in terms of their effectiveness of classifying PHI documents. The results show that combining semantic and/or syntactic with lexical features is more effective than lexical features alone for PHI classification. The supervised machine learning method is effective in classifying documents into PHI and non-PHI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aronson, A.R.: Effective mapping of biomedical text to the umls metathesaurus: the metamap program. In: Proceedings of AMIA Symposium, pp. 17–21 (2001)
Google Scholar
Bloehdron, S., Hotho, A.: Boosting for text classification with semantic features. In: Workshop on Text-based Information Retrieval (TIR 2004) at the 27th German Conference on Artificial Intelligence (2004)
Google Scholar
Cai, L., Hofmann, T.: Text categorization by boosting automatically extracted concepts. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Toronto, CA, pp. 182–189 (2003)
Google Scholar
Hodge, J.G., Gostin, L.O., Lacobson, P.D.: Legal issues concerning electronic health information privacy, quality, and liability. JAMA 282, 1466–1471 (1999)
Article PubMed Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, 15th ACM international conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pp. 37–50 (1992)
Google Scholar
Liu, H.: Monty tagger, http://web.media.mit.edu/~hugo/montytagger/
McCray, A.T., Burgun, A., Bodenreider, O.: Aggregating umls semantic types for reducing conceptual complexity. In: Proceedings of Medinfo 10(Pt 1), pp. 216–220 (2001)
Google Scholar
Xuan-Hieu, P.: Crfchunker: Crf english phrase chunker (2006), http://crfchunker.sourceforge.net/
Pratt, W., Unruh, K., Civan, A., Skeels, M.M.: Personal health information management. Communication of ACM 49(1), 51–55 (2006)
Article Google Scholar
Roberts, A.: jtokeniser (2005), http://www.andy-roberts.net/software/jTokeniser
Sazarva, G., Farkas, R., Busa-Fekete, R.: State-of-the-art anonymization of medical records using an iterative machine learning framework. JAMIA 14(5), 574–579 (2007)
Google Scholar
Sebastiani, F.: Machine learning in automatic text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Uzuner, O., Sibanda, T., Luo, Y., Szolovits, P.: A de-identification for medical discharge summaries. Artificial Intelligence in Medicine 42, 13–35 (2008)
Article PubMed Google Scholar
Wellner, B., Huygk, M., Aberdeen, J., Morgan, A., Mardis, S., Peshkin, L., et al.: Rapidly retargetable approaches to de-identification in medical records. JAMIA 14(5), 564–573 (2007)
PubMed PubMed Central Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Information Technology, National Research Council Canada, 46 Dineen Dr. Fredericton, NB, Canada
Yunli Wang, Hongyu Liu, Liqiang Geng, Matthew S. Keays & Yonghua You

Authors

Yunli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Liqiang Geng
View author publications
You can also search for this author in PubMed Google Scholar
Matthew S. Keays
View author publications
You can also search for this author in PubMed Google Scholar
Yonghua You
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Verona, Department of Computer Science, Ca’ Vignal 2, strada le Grazie 15, 37134, Verona, Italy
Carlo Combi
Department of Information Systems Engineering, Ben Gurion University of the Negev, P.O. Box 653, 84105, Beer-Sheva, Israel
Yuval Shahar
Department of Medical Informatics, University of Amsterdam, Academic Medical Center, Meibergdreef 15, 1105, Amsterdam, AZ, The Netherlands
Ameen Abu-Hanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Liu, H., Geng, L., Keays, M.S., You, Y. (2009). Automatic Detecting Documents Containing Personal Health Information. In: Combi, C., Shahar, Y., Abu-Hanna, A. (eds) Artificial Intelligence in Medicine. AIME 2009. Lecture Notes in Computer Science(), vol 5651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02976-9_46

Download citation

DOI: https://doi.org/10.1007/978-3-642-02976-9_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02975-2
Online ISBN: 978-3-642-02976-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics