Skip to main content

Names: A New Frontier in Text Mining

  • Conference paper
  • First Online:
Book cover Intelligence and Security Informatics (ISI 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2665))

Included in the following conference series:

Abstract

Over the past 15 years the government has funded research in information extraction, with the goal of developing the technology to extract entities, events, and their interrelationships from free text for further analysis. A crucial component of linking entities across documents is the ability to recognize when different name strings are potential references to the same entity. Given the extraordinary range of variation international names can take when rendered in the Roman alphabet, this is a daunting task. This paper surveys existing technologies for name matching and for accomplishing pieces of the cross-document extraction and linking task. It proposes a direction for future work in which existing entity extraction, coreference, and database name matching technologies would be harnessed for cross-document coreference and linking capabilities. The extension of name variant matching to free text will add important text mining functionality for intelligence and security informatics toolkits.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Taft, R.L.: Name Search Techniques. Special Rep. No. 1. Bureau of Systems Development, New York State Identification and Intelligence System, Albany (1970)

    Google Scholar 

  2. Verton, D.: Technology Aids Hunt for Terrorists. Computer World, 9 September (2002)

    Google Scholar 

  3. Borgman, C.L., Siegfried, S.L.: Getty’s Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms. Journal of the American Society for Information Science, Vol. 43 No. 7. (1992) 459–476

    Article  Google Scholar 

  4. Grishman, R., Sundheim, B.: Message Understanding Conference — 6: A Brief History. In: Proceedings of the 16th International Conference on Computational Linguistics. Copenhagen (1999)

    Google Scholar 

  5. DARPA. Tipster Text Program Phase III Proceedings. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  6. National Institute of Standards and Technology. ACE-Automatic Content Extraction Information Technology Laboratories. http://www.itl.nist.gov/iad/894.01/tests/ace/index.htm (2000)

  7. Fuhr, N.: XML Information Retrieval and Extraction [to appear]

    Google Scholar 

  8. Hermansen, J.C.: Automatic Name Searching in Large Databases of International Names. Georgetown University Dissertation, Washington, DC (1985)

    Google Scholar 

  9. Holmes, D., McCabe, M.C.: Improving Precision and Recall for Soundex Retrieval. In: Proceedings of the 2002 IEEE International Conference on Information Technology — Coding and Computing. Las Vegas (2002)

    Google Scholar 

  10. Navarro, G., Baeza-Yates, R., Azevedo Arcoverde, J.M.: Matchsimile: A Flexible Approximate Matching Tool for Searching Proper Names. Journal of the American Society for Information Science and Technology, Vol. 54 No. 1 (2003) 3–15

    Article  Google Scholar 

  11. Patman, F., Shaefer, L.: Is Soundex Good Enough for You? On the Hidden Risks of Soundex-Based Name Searching. Language Analysis Systems, Inc., Herndon (2001)

    Google Scholar 

  12. Lutz, R., Greene, S.: Measuring Phonological Similarity: The Case of Personal Names. Language Analysis Systems, Inc., Herndon (2002)

    Google Scholar 

  13. Bikel, D.M., Schwartz, R., Weischedel, R.M.: An Algorithm that Learns What’s in a Name. Machine Learning, Vol. 34 No. 1–3. (1999) 211–231

    Article  MATH  Google Scholar 

  14. Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: NYU: Description of the MENE Named Entity System as Used in MUC-7. In: Proceedings of the Seventh Message Understanding Conference. Fairfax (1998)

    Google Scholar 

  15. Baluja, S., Mittal, V.O., Sukthankar, R.: Applying Machine Learning for High Performance Named-Entity Extraction. Pacific Association for Computational Linguistics (1999)

    Google Scholar 

  16. Collins, M.,: Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 489–496

    Google Scholar 

  17. Zelenko, D., Aone, C., Richardella, A.: Kernel Methods for Relation Detection Extraction. Journal of Machine Learning Research [to appear]

    Google Scholar 

  18. Soon, W.M., Ng, H.T., Lim, D.C.Y.: A Machine Learning Approach to Coreference Resolution of Noun Phrases. Association for Computational Linguistics (2001)

    Google Scholar 

  19. Bontcheva, K., Dimitrov, M., Maynard, D., Tablin, V., Cunningham, H.: Shallow Methods for Named Entity Coreference Resolution. TALN (2002)

    Google Scholar 

  20. Hartrumpf, S.: Coreference Resolution with Syntactico-Semantic Rules and Corpus Statistics. In: Proceedings of CoNLL-2001. Toulouse (2001) 137–144

    Google Scholar 

  21. Ng, V., Cardie, C.: Improving Machine Learning Approaches to Coreference Resolution. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 104–111

    Google Scholar 

  22. McCarthy, J.F., Lehnert, W.G.: Using Decision Trees for Coreference Resolution. In: Mellish, C. (ed.): Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (1995) 1050–1055

    Google Scholar 

  23. Bagga, A., Baldwin, B.: Entity-Based Cross-Document Coreferencing Using the Vector Space Model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (1998) 79–85

    Google Scholar 

  24. Ravin, Y., Kazi, Z. Is Hillary Rodham Clinton the President? Disambiguating Names Across Documents. In: Proceedings of the ACL’99 Workshop on Coreference and Its Applications (1999)

    Google Scholar 

  25. Schiffman, B., Mani, I., Concepcion, K.J.: Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (2001) 450–457

    Google Scholar 

  26. Bagga, A.: Evaluation of Coreferences and Coreference Resolution Systems. In: Proceedings of the First International Conference on Language Resources and Evaluation (1998) 563–566

    Google Scholar 

  27. Inxight. A Research Engine for the Pharmaceutical Industry. http://www.inxight.com

  28. Hetzler, B., Harris, W.M., Havre, S., Whitney, P.: Visualizing the Full Spectrum of Document Relationships. In: Structures and Relations in Knowledge Organization. Proceedings of the 5th International ISKO Conference. ERGON Verlag, Wurzburg (1998) 168–175

    Google Scholar 

  29. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., Schroeder, J.: COPLINK: Managing law enforcement data and knowledge. Communications of the ACM, Vol. 46 No. 1 (2003)

    Google Scholar 

  30. InfoGlide Software. Similarity Search Engine: The Power of Similarity Searching. http://www.infoglide.com/content/images/whitepapers.pdf(2002)

  31. American Association for Artificial Intelligence Fall Symposium on Artificial Intelligence and Link Analysis (1998)

    Google Scholar 

  32. i2. Analyst’s Notebook. http://www.i2.co.uk/Products/Analysts-Notebook (2002)

  33. Winkler, W.E.: The State of Record Linkage and Current Research Problems. Technical Report RR99/04. U.S. Census Bureau, http://www.census.gov/srd/papers/pdf/rr99-04.pdf

  34. Wang, G., Chen, H., Atabakhsh, H.: Automatically Detecting Deceptive Criminal Identities [to appear]

    Google Scholar 

  35. Fuhr, N.: Probabilistic Datalog — A Logic for Powerful Retrieval Methods. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (1995) 282–290

    Google Scholar 

  36. Fuhr, N.: Models for Integrated Information Retrieval and Database Systems. IEEE Data Engineering Bulletin, Vol. 19 No. 1. (1996)

    Google Scholar 

  37. Hoogeveen, M., van der Meer, K.: Integration of Information Retrieval and Database Management in Support of Multimedia Police Work. Journal of Information Science, Vol. 20 No. 2 (1994)

    Google Scholar 

  38. Institute for Mathematics and Its Applications. IMA Hot Topics Workshop: Text Mining. http://www.ima.umn.edu/reactive/spring/tm.html (2000)

  39. KDD-2000 Workshop on Text Mining. The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston (2000) http://www-2.cs.cmu.edu/~dunja/WshKDD2000.html

  40. SIAM Text Mining Workshop. http://www.cs.utk.edu/tmw02 (2002)

  41. Text-ML 2002 orkshop on Text Learning. The Nineteenth International Conference on Machine Learning ICML-2002. Sydney (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Patman, F., Thompson, P. (2003). Names: A New Frontier in Text Mining. In: Chen, H., Miranda, R., Zeng, D.D., Demchak, C., Schroeder, J., Madhusudan, T. (eds) Intelligence and Security Informatics. ISI 2003. Lecture Notes in Computer Science, vol 2665. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44853-5_3

Download citation

  • DOI: https://doi.org/10.1007/3-540-44853-5_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40189-6

  • Online ISBN: 978-3-540-44853-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics