One Size Fits All? A Simple Technique to Perform Several NLP Tasks

Gayo-Avello, Daniel; Álvarez-Gutiérrez, Darío; Gayo-Avello, José

doi:10.1007/978-3-540-30228-5_24

One Size Fits All? A Simple Technique to Perform Several NLP Tasks

Daniel Gayo-Avello⁵,
Darío Álvarez-Gutiérrez⁵ &
José Gayo-Avello⁵

Conference paper
First Online: 20 October 2004

648 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3230))

Abstract

Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common aspects such as: (1) documents are mapped to a vector space where n-grams are used as coordinates and their relative frequencies as vector weights, (2) many of them compute a context which plays a role similar to stop-word lists, and (3) cosine distance is commonly used for document-to-document and query-to-document comparisons. blindLight is a new approach related to these classical n-gram techniques although it introduces two major differences: (1) Relative frequencies are no more used as vector weights but replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques although not so computationally expensive. This new approach can be simultaneously used to perform document categorization and clustering, information retrieval, and text summarization. In this paper we will describe the foundations of such a technique and its application to both a particular categorization problem (i.e., language identification) and information retrieval tasks.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D’Amore, R., Mah, C.P.: One-time complete indexing of text: Theory and practice. In: Proc. of SIGIR 1985, pp. 155–164 (1985)
Google Scholar
Kimbrell, R.E.: Searching for text? Send an n-gram! Byte 13(5), 297–312 (1988)
Google Scholar
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848 (1995)
Article Google Scholar
Cohen, J.D.: Highlights: Language and Domain-Independent Automatic Indexing Terms for Abstracting. JASIS 46(3), 162–174 (1995)
Article Google Scholar
Huffman, S.: The Genetic Classification of Languages by n-gram Analysis: A Computational Technique, Ph. D. thesis, Georgetown University (1998)
Google Scholar
Thomas, T.R.: Document retrieval from a large dataset of free-text descriptions of physician-patient encounters via n-gram analysis. Technical Report LA-UR-93-0020, Los Alamos National Laboratory, Los Alamos, NM (1993)
Google Scholar
Cavnar, W.B.: Using an n-gram-based document representation with a vector processing retrieval model. In: Proc. of TREC-3, pp. 269–277 (1994)
Google Scholar
Huffman, S.: Acquaintance: Language-Independent Document Categorization by N Grams. In: Proceedings of The Fourth Text REtrieval Conference (1995)
Google Scholar
Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J.: Naïve Algorithms for Keyphrase Extraction and Text Summarization from a Single Document Inspired by the Protein Biosynthesis Process. In: Ijspeert, A.J., Murata, M., Wakamiya, N. (eds.) BioADIT 2004. LNCS, vol. 3141, pp. 440–455. Springer, Heidelberg (2004)
Chapter Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Google Scholar
Ferreira da Silva, J., Pereira Lopes, G.: A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. In: Proc. of MOL6 (1999)
Google Scholar
Ferreira da Silva, J., Pereira Lopes, G.: Extracting Multiword Terms from Document Collections. In: Proc. of VExTAL, Venice, Italy (1999)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals (English translation from Russian). Soviet Physics Doklady 10(8), 707–710 (1966)
Google Scholar
Ziegler, D.: The Automatic Identification of Languages Using Linguistic Recognition Signals. PhD Thesis, State University of New York, Buffalo (1991)
Google Scholar
Souter, C., Churcher, G., Hayes, J., Johnson, S.: Natural Language Identification using Corpus-based Models. Hermes Journal of Linguistics 13, 183–203 (1994); Faculty of Modern Languages, Aarhus School of Business, Denmark
Google Scholar
Beesley, K.R.: Language Identifier: A Computer Program for Automatic Natural-Language Identification of Online Text. In: Language at Crossroads: Proceedings of the 19th Annual Conference of the American Translators Association, pp. 47–54 (1988)
Google Scholar
Dunning, T.: Statistical identification of language. Technical Report MCCS 94-273, New Mexico State University (1994)
Google Scholar
Kessler, B.: Computational Dialectology in Irish Gaelic. Dublin: EACL. In: Proceedings of the European Association for Computational Linguistics, pp. 60–67 (1995)
Google Scholar
Nerbonne, J., Heeringa, W.: Measuring Dialect Distance Phonetically. In: Coleman, J. (ed.) Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pp. 11–18 (1997)
Google Scholar
Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press, Cambridge (1999)
Google Scholar
Jarvis, R.A., Patrick, E.A.: Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Transactions on Computers 22(11), 1025–1034 (1973)
Article Google Scholar
Verdaguer, P.: Grammaire de la langue catalane. Les origines de la langue, Curial (1999)
Google Scholar
Koehn, P.: Europarl: A Multilingual Corpus for Evaluation of Machine Translation, Draft (unpublished), http://www.isi.edu/~koehn/publications/europarl.ps

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Oviedo, Calvo Sotelo s/n, 33007, Oviedo, Spain
Daniel Gayo-Avello, Darío Álvarez-Gutiérrez & José Gayo-Avello

Authors

Daniel Gayo-Avello
View author publications
You can also search for this author in PubMed Google Scholar
Darío Álvarez-Gutiérrez
View author publications
You can also search for this author in PubMed Google Scholar
José Gayo-Avello
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software and Computing Systems, University of Alicante, Spain
José Luis Vicedo
Natural Language Processing and Information Systems Group, Department of Software and Computing Systems, University of Alicante, Spain
Patricio Martínez-Barco
Grupo de investigación del Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Alicante, Spain
Rafael Muńoz
Departamento de Lenguajes y Sistemas Informáticos, Carretera de San Vicente del Raspeig, Universidad de Alicante, 03690 San Vicente del Raspeig, Alicante, Spain
Maximiliano Saiz Noeda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J. (2004). One Size Fits All? A Simple Technique to Perform Several NLP Tasks. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds) Advances in Natural Language Processing. EsTAL 2004. Lecture Notes in Computer Science(), vol 3230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30228-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-540-30228-5_24
Published: 20 October 2004
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23498-2
Online ISBN: 978-3-540-30228-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics