ABSTRACT
This paper presents the current solutions concerning ethical and security issues in the system EVTIMA -- an environment supporting text mining of patient records (PRs) in Bulgarian. Confidentiality and anonymisation of the analysed documents are crucial from ethical point of view and are considered as leading development principles. Here we describe in detail our method for PR de-identification of PRs which uses data vocabularies, regular expressions and additional heuristics to locate the identification information. It is trained on a corpus of 197 documents and tested on 1000 documents. The algorithm works in three steps and de-identifies 97% of the personalising information. Thus it is comparable to the reported results in similar tasks in English.
- Lohr S. How Privacy Vanishes Online. The New York Times, 16 March 2010, http://www.nytimes.com/2010/03/17/technology/17privacy.html.Google Scholar
- Law for Personal Data Protection, http://www.lex.bg/bg/laws/ldoc/2135426048Google Scholar
- Neubauer, T. and B. Riedl. Improving Patients Privacy with Pseudonymization. In S. K. Andersen et al. (Eds.). eHealth Beyond the Horizon -- Get IT There. Proc. MIE 2008, IOS Press, 2008, pp. 691--696.Google Scholar
- Rector, A., J. Rogers, A. Taweel, D. Ingram, D. Kalra, J. Milan, P. Singleton, R. Gaizauskas, M. Hepple, D. Scott, and R. Power. Clef - joining up healthcare with clinical and post-genomic research. In Proc. of UK e-Science All Hands Meeting, 2003, pp. 203--211.Google Scholar
- Neamatullah I, Douglass M, Lehman LH, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD. Automated De-Identification of Free-Text Medical Records. BMC Medical Informatics and Decision Making, 2008, 8:32. doi: 10.1186/1472-6947-8-32Google ScholarCross Ref
- Roger F. H. The minimum basic data set for hospital statistics in the EEC, Commission of the EC, DG Information, Market, and Innovation, 1981, ISBN 928252437X, 151 pages.Google Scholar
- Cunningham, H. Information Extraction, Automatic. Elsevier, Encyclopedia of Language and Linguistics, 2005, available at http://gate.ac.uk/sale/ell2/ie/main.pdfGoogle Scholar
- Boytcheva, S., I. Nikolova, E. Paskaleva, G. Angelova, D. Tcharaktchiev and N. Dimitrova. Extraction and Exploration of Correlations in Patient Status Data. In: Savova, G., V. Karkaletsis and G. Angelova (Eds). Biomedical Information Extraction, Proceedings of the International Workshop held in conjunction with RANLP-09, Borovets, Bulgaria, 18 September 2009, pp. 1--7. Google ScholarDigital Library
- Cox LH. Disclosure Risk for Tabular Economic Data. In P. Doyle, J. Lane, J. Theeuwes and L Zayatz (Eds.) Confidentiality, Disclosure and Data Access, Elseiver, Amsterdam, 2001, pp. 167--183.Google Scholar
- Domingo-Ferrer J, and V. Torra. A quantitative comparison of Disclosure Control methods for Microdata. In P. Doyle, J. Lane, J. Theeuwes and L Zayatz (Eds.) Confidentiality, Disclosure and Data Access, Elseiver, Amsterdam, 2001, pp. 111--133.Google Scholar
- Ethics and security in text mining of patient records in Bulgarian: the EVTIMA solution
Recommendations
Bulgarian-Polish-Lithuanian corpus: current development
MRTECEEL '09: Proceedings of the Workshop on Multilingual Resources, Technologies and Evaluation for Central and Eastern European LanguagesThis paper discusses the building of the first Bulgarian---Polish---Lithuanian (for short, BG---PL---LT) experimental corpus. The BG---PL---LT corpus (currently under development only for research) contains more than 3 million words and comprises two ...
Anonymizing and Sharing Medical Text Records
Health information technology has increased accessibility of health and medical data and benefited medical research and healthcare management. However, there are rising concerns about patient privacy in sharing medical and healthcare data. A large ...
Acquiring paraphrases from text corpora
K-CAP '09: Proceedings of the fifth international conference on Knowledge captureParaphrases are textual expressions that convey the same meaning using different surface forms. Capturing the variability of language, they play an important role in many natural language applications includ ing question answering, machine translation, ...
Comments