Skip to main content

A Sequence Labeling Method Using Syntactical and Textual Patterns for Record Linkage

  • Conference paper
Pattern Recognition and Data Mining (ICAPR 2005)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3686))

Included in the following conference series:

Abstract

Record linkage is an important application area of text pattern analysis. In this paper we propose a new sequence labeling method that can be used to extract entities from a string for record linkage. The proposed method combines a classifier and a Hidden Markov Model (HMM) to utilize both syntactical and textual information from the string. We first describe the model used in the proposed method and then discuss the parameter estimation for this model. The proposed method incorporates a classifier for handling textual information and integrates the classifier with the HMM statistically by estimating the error probability of the classifier. We applied the proposed method to the bibliographic sequence labeling problem, in which bibliographic components are extracted from reference strings. We compared the proposed method with other methods that use textual or syntactical information alone and showed that the proposed method outperforms them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aizawa, A., Oyama, K.: A First Linkage Detection Scheme for Multi-Source Information Integration. In: Proc. of Intl. Workshop. on Challenges in Web Information Retrieval and Integration (WIRI 2005), pp. 31–40 (2005)

    Google Scholar 

  2. Aizawa, A., Oyama, K., Takasu, A., Adachi, J.: Techniques and research trends in record linkage studies. IEICE Tran. on Inforamtion and Systems J88-D-I(3), 576–589 (2005)

    Google Scholar 

  3. Altun, Y., Hofmann, T.: Large Margin Methods for Label Sequence Learning. In: Proc. of 8th European Conf. on Speeach Communication and Technology (2003)

    Google Scholar 

  4. Ayres, F.H., Huggill, J.A.W., Yannakoudakis, E.J.: The universal standard bibligraphic code (USBC): its use for clearing, merging and controlling large databases. Program - Automated Library and Information Systems 22(2), 117–132 (1988)

    Article  Google Scholar 

  5. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Infromation Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  6. Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proc. of 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003), pp. 39–48 (2003)

    Google Scholar 

  7. CiteSeer.IST: Scientific Leterature Digital Library, http://citeseer.ist.psu.edu/cs

  8. McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Information Retrieval 3(2), 127–163 (2000)

    Article  Google Scholar 

  9. Mecab, http://chasen.org/~taku/software/mecab/

  10. Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)

    Article  Google Scholar 

  11. Okada, T., Takasu, A., Adachi, J.: Bibliographic component extraction using support vector machines and hidden markov models. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232, pp. 501–512. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  12. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  13. Google Scholar, http://scholar.google.com

  14. Takasu, A.: Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model. In: Proc. of ACM-IEEE Joint Conference on Digital Libraries (JCDL 2003), pp. 49–60 (2003)

    Google Scholar 

  15. Takasu, A., Katayama, N., et al.: Approximate Matching for OCR-Processed Bibliographic Data. In: Proc. of 13th Internationa Conference on Pattern Recognition, pp. 175–179 (1996)

    Google Scholar 

  16. Tax, D.M.J., Duin, R.P.W.: Using Two-Class Classifiers for Multiclass Classification. In: Proc. of Infl. Conf. on Pattern Recognition (2002)

    Google Scholar 

  17. TinySVM, http://chasen.org/~taku/software/TinySVM/

  18. Li, D.R.X., Morie, P.: Semantic integration over text: From ambiguous names to identifiable entities. AI Magazine 26(1), 45–58 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Takasu, A. (2005). A Sequence Labeling Method Using Syntactical and Textual Patterns for Record Linkage. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds) Pattern Recognition and Data Mining. ICAPR 2005. Lecture Notes in Computer Science, vol 3686. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551188_22

Download citation

  • DOI: https://doi.org/10.1007/11551188_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28757-5

  • Online ISBN: 978-3-540-28758-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics