Abstract
Many applications of Formal Concept Analysis (FCA) start with a set of structured data such as objects and their properties. In practice, most of the data which is readily available are in the form of unstructured or semistructured text. A typical application of FCA assumes the extraction of objects and their properties by some other methods or techniques. For example, in the 2003 Los Alamos National Lab (LANL) project on Advanced Knowledge Integration In Assessing Terrorist Threats, a data extraction tool was used to mine the text for the structured data. In this paper, we provide a detailed description of our approach to extraction of personal names for possible subsequent use inFCA. Our basic approach is to integrate statistics on names and other words into an adaptation of a Hidden Markov Model (HMM). We use lists of names and their relative frequencies compiled from U.S. Census data. We also use a list of non-name words along with their frequencies in a training set from our collection of documents. These lists are compiled into one master list to be used as a part of the design.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
International Workshop on the Concept Formation and Extraction in Under-Traversed Domains (CFEUTD-2011).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ganter, B., Wille: Formal concept analysis. Springer, Heidelberg (1999)
U.S. Government. Frequently occurring first names and surnames from the 1990 census, http://www.census.gov/genealogy/www/freqnames.html (viewed August 2005)
U.S. Government. The freedom of information act 5 U.S.C. sec. 552 as amended in 2002, http://www.usdoj.gov/oip/foiaupdates/VolXVII4/page2.htm (viewed June 30, 2004)
U.S. Government. The privacy act of 1974 5 u.s.c. sec. 552a, http://www.usdoj.gov/04foia/privstat.htm (viewed August 22, 2005)
Miller, D., Boisen, S., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from noisy input: Speech and OCR. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 316–324 (2000)
Rocha, L.M.: Proximity and semi-metric analysis of social networks. Report of Advanced Knowledge Integratio In Assessing Terrorist Threats LDRD-DR Network Analysis Component. LAUR 02-6557
Taghva, K., Beckley, R., Coombs, J.: The effects of OCR error on the extraction of private information. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 348–357. Springer, Heidelberg (2006)
Taghva, K., Beckley, R., Coombs, J., Borsack, J., Pereda, R., Nartker, T.: Automatic redaction of private information using relational information extraction. In: Proc. IS&T/SPIE 2006 Intl. Symp. on Electronic Imaging Science and Technology (2006)
Taghva, K., Borsack, J., Nartker, T.: A process flow for realizing high accuracy for ocr text. In: SDIUT 2006 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Taghva, K., Beckley, R., Coombs, J. (2011). Name Extraction and Formal Concept Analysis. In: Andrews, S., Polovina, S., Hill, R., Akhgar, B. (eds) Conceptual Structures for Discovering Knowledge. ICCS 2011. Lecture Notes in Computer Science(), vol 6828. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22688-5_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-22688-5_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22687-8
Online ISBN: 978-3-642-22688-5
eBook Packages: Computer ScienceComputer Science (R0)