Abstract
Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common aspects such as: (1) documents are mapped to a vector space where n-grams are used as coordinates and their relative frequencies as vector weights, (2) many of them compute a context which plays a role similar to stop-word lists, and (3) cosine distance is commonly used for document-to-document and query-to-document comparisons. blindLight is a new approach related to these classical n-gram techniques although it introduces two major differences: (1) Relative frequencies are no more used as vector weights but replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques although not so computationally expensive. This new approach can be simultaneously used to perform document categorization and clustering, information retrieval, and text summarization. In this paper we will describe the foundations of such a technique and its application to both a particular categorization problem (i.e., language identification) and information retrieval tasks.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
D’Amore, R., Mah, C.P.: One-time complete indexing of text: Theory and practice. In: Proc. of SIGIR 1985, pp. 155–164 (1985)
Kimbrell, R.E.: Searching for text? Send an n-gram! Byte 13(5), 297–312 (1988)
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848 (1995)
Cohen, J.D.: Highlights: Language and Domain-Independent Automatic Indexing Terms for Abstracting. JASIS 46(3), 162–174 (1995)
Huffman, S.: The Genetic Classification of Languages by n-gram Analysis: A Computational Technique, Ph. D. thesis, Georgetown University (1998)
Thomas, T.R.: Document retrieval from a large dataset of free-text descriptions of physician-patient encounters via n-gram analysis. Technical Report LA-UR-93-0020, Los Alamos National Laboratory, Los Alamos, NM (1993)
Cavnar, W.B.: Using an n-gram-based document representation with a vector processing retrieval model. In: Proc. of TREC-3, pp. 269–277 (1994)
Huffman, S.: Acquaintance: Language-Independent Document Categorization by N Grams. In: Proceedings of The Fourth Text REtrieval Conference (1995)
Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J.: Naïve Algorithms for Keyphrase Extraction and Text Summarization from a Single Document Inspired by the Protein Biosynthesis Process. In: Ijspeert, A.J., Murata, M., Wakamiya, N. (eds.) BioADIT 2004. LNCS, vol. 3141, pp. 440–455. Springer, Heidelberg (2004)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Ferreira da Silva, J., Pereira Lopes, G.: A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. In: Proc. of MOL6 (1999)
Ferreira da Silva, J., Pereira Lopes, G.: Extracting Multiword Terms from Document Collections. In: Proc. of VExTAL, Venice, Italy (1999)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals (English translation from Russian). Soviet Physics Doklady 10(8), 707–710 (1966)
Ziegler, D.: The Automatic Identification of Languages Using Linguistic Recognition Signals. PhD Thesis, State University of New York, Buffalo (1991)
Souter, C., Churcher, G., Hayes, J., Johnson, S.: Natural Language Identification using Corpus-based Models. Hermes Journal of Linguistics 13, 183–203 (1994); Faculty of Modern Languages, Aarhus School of Business, Denmark
Beesley, K.R.: Language Identifier: A Computer Program for Automatic Natural-Language Identification of Online Text. In: Language at Crossroads: Proceedings of the 19th Annual Conference of the American Translators Association, pp. 47–54 (1988)
Dunning, T.: Statistical identification of language. Technical Report MCCS 94-273, New Mexico State University (1994)
Kessler, B.: Computational Dialectology in Irish Gaelic. Dublin: EACL. In: Proceedings of the European Association for Computational Linguistics, pp. 60–67 (1995)
Nerbonne, J., Heeringa, W.: Measuring Dialect Distance Phonetically. In: Coleman, J. (ed.) Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pp. 11–18 (1997)
Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press, Cambridge (1999)
Jarvis, R.A., Patrick, E.A.: Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Transactions on Computers 22(11), 1025–1034 (1973)
Verdaguer, P.: Grammaire de la langue catalane. Les origines de la langue, Curial (1999)
Koehn, P.: Europarl: A Multilingual Corpus for Evaluation of Machine Translation, Draft (unpublished), http://www.isi.edu/~koehn/publications/europarl.ps
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J. (2004). One Size Fits All? A Simple Technique to Perform Several NLP Tasks. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds) Advances in Natural Language Processing. EsTAL 2004. Lecture Notes in Computer Science(), vol 3230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30228-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-540-30228-5_24
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23498-2
Online ISBN: 978-3-540-30228-5
eBook Packages: Springer Book Archive