A System for Adaptive Information Extraction from Highly Informal Text

Alonso i Alemany, Laura; Carrascosa, Rafael

doi:10.1007/978-3-642-22327-3_14

Laura Alonso i Alemany¹⁹ &
Rafael Carrascosa¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6716))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

1794 Accesses

Abstract

We present a first version of ado, a system for Adaptive Data Organization, that is, information extraction from highly informal text: short text messages, classified ads, tweets, etc. It is built on a modular architecture that integrates in a transparent way off-the-shelf NLP tools, general procedures on strings and machine learning and processes tailored to a domain.

The system is called adaptive because it implements a semi-supervised approach. Knowledge resources are initially built by hand, and they are updated automatically by feeds from the corpus. This allows ado to adapt to the rapidly changing user-generated language.

In order to estimate the impact of future developments, we have carried out an orientative evaluation of the system with a small corpus of classified advertisements of the real estate domain in Spanish. This evaluation shows that tokenization and chunking can be well resolved by simple techniques, but normalization, morphosyntactic and semantic tagging require either more complex techniques or a bigger training corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

AiTi, A., Min, Z., PohKhim, Y., ZhenZhen, F., Jian, S.: Input normalization for an english-to-chinese sms translation system. In: The Tenth Machine Translation Summit (2005)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD (2003)
Google Scholar
Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: Workshop on Computational Approaches to Linguistic Creativity. NAACL (2009)
Google Scholar
Durbin, R., Eddy, S., Drogh, A., Mitchison, G.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)
Book MATH Google Scholar
Gómez-Ballester, E., Forcada-Zubizarreta, M.L., Micó-Andrés, M.L.: A gradient-descent method to adapt the edit-distance to a classification task. IOS Press, Amsterdam (2000)
Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistic Quarterly 2, 83–97 (1955)
Article MathSciNet MATH Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory 10(8), 707–710 (1966)
MathSciNet MATH Google Scholar
Loper, E., Bird, S.: Nltk: The natural language toolkit. In: Proceedings of the ACL Demonstration Session, pp. 214–217 (2004)
Google Scholar
McCallum, A., Bellare, K., Pereira, F.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proceedings of the Twenty-First Annual Conference on Uncertainty in Artificial Intelligence (UAI 2005), pp. 388–395. AUAI Press, Arlington (2005)
Google Scholar
Michelson, M., Knoblock, C.A.: Phoebus: a system for extracting and integrating data from unstructured and ungrammatical sources. In: AAAI 2006 (2006)
Google Scholar
Oncina, J., Sebban, M.: Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recognition 39(9), 1575–1587 (2006)
Article MATH Google Scholar
Pakhomov, S.: Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical texts. In: ACL, pp. 160–167 (2002)
Google Scholar
Ristad, E.S., Yanilos, P.N.: Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 522–532 (1998)
Article Google Scholar
Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., Richards, C.: Normalization of non-standard words. Computer Speech and Language 15(3) (2001)
Google Scholar
Winkler, W.E.: The state of record linkage and current research problems. Tech. Rep. R99/04, Statistics of Income Division (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

NLP group, FaMAF-UNC, Córdoba, Argentina
Laura Alonso i Alemany & Rafael Carrascosa

Authors

Laura Alonso i Alemany
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Carrascosa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, University of Alicante, 03080, Alicante, Spain
Rafael Muñoz
Department of Software and Computing Systems, University of Alicante, Aptdo. de Correos 99, 03080, Alicante, Spain
Andrés Montoyo
CNAM- Laboratoire Cédric, 292 Rue St. Martin, 75141, Paris Cedex 03, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alonso i Alemany, L., Carrascosa, R. (2011). A System for Adaptive Information Extraction from Highly Informal Text. In: Muñoz, R., Montoyo, A., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2011. Lecture Notes in Computer Science, vol 6716. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22327-3_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-22327-3_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22326-6
Online ISBN: 978-3-642-22327-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics