Skip to main content

One Size Fits All? A Simple Technique to Perform Several NLP Tasks

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3230))

Abstract

Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common aspects such as: (1) documents are mapped to a vector space where n-grams are used as coordinates and their relative frequencies as vector weights, (2) many of them compute a context which plays a role similar to stop-word lists, and (3) cosine distance is commonly used for document-to-document and query-to-document comparisons. blindLight is a new approach related to these classical n-gram techniques although it introduces two major differences: (1) Relative frequencies are no more used as vector weights but replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques although not so computationally expensive. This new approach can be simultaneously used to perform document categorization and clustering, information retrieval, and text summarization. In this paper we will describe the foundations of such a technique and its application to both a particular categorization problem (i.e., language identification) and information retrieval tasks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D’Amore, R., Mah, C.P.: One-time complete indexing of text: Theory and practice. In: Proc. of SIGIR 1985, pp. 155–164 (1985)

    Google Scholar 

  2. Kimbrell, R.E.: Searching for text? Send an n-gram! Byte 13(5), 297–312 (1988)

    Google Scholar 

  3. Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848 (1995)

    Article  Google Scholar 

  4. Cohen, J.D.: Highlights: Language and Domain-Independent Automatic Indexing Terms for Abstracting. JASIS 46(3), 162–174 (1995)

    Article  Google Scholar 

  5. Huffman, S.: The Genetic Classification of Languages by n-gram Analysis: A Computational Technique, Ph. D. thesis, Georgetown University (1998)

    Google Scholar 

  6. Thomas, T.R.: Document retrieval from a large dataset of free-text descriptions of physician-patient encounters via n-gram analysis. Technical Report LA-UR-93-0020, Los Alamos National Laboratory, Los Alamos, NM (1993)

    Google Scholar 

  7. Cavnar, W.B.: Using an n-gram-based document representation with a vector processing retrieval model. In: Proc. of TREC-3, pp. 269–277 (1994)

    Google Scholar 

  8. Huffman, S.: Acquaintance: Language-Independent Document Categorization by N Grams. In: Proceedings of The Fourth Text REtrieval Conference (1995)

    Google Scholar 

  9. Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J.: Naïve Algorithms for Keyphrase Extraction and Text Summarization from a Single Document Inspired by the Protein Biosynthesis Process. In: Ijspeert, A.J., Murata, M., Wakamiya, N. (eds.) BioADIT 2004. LNCS, vol. 3141, pp. 440–455. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  10. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)

    Google Scholar 

  11. Ferreira da Silva, J., Pereira Lopes, G.: A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. In: Proc. of MOL6 (1999)

    Google Scholar 

  12. Ferreira da Silva, J., Pereira Lopes, G.: Extracting Multiword Terms from Document Collections. In: Proc. of VExTAL, Venice, Italy (1999)

    Google Scholar 

  13. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals (English translation from Russian). Soviet Physics Doklady 10(8), 707–710 (1966)

    Google Scholar 

  14. Ziegler, D.: The Automatic Identification of Languages Using Linguistic Recognition Signals. PhD Thesis, State University of New York, Buffalo (1991)

    Google Scholar 

  15. Souter, C., Churcher, G., Hayes, J., Johnson, S.: Natural Language Identification using Corpus-based Models. Hermes Journal of Linguistics 13, 183–203 (1994); Faculty of Modern Languages, Aarhus School of Business, Denmark

    Google Scholar 

  16. Beesley, K.R.: Language Identifier: A Computer Program for Automatic Natural-Language Identification of Online Text. In: Language at Crossroads: Proceedings of the 19th Annual Conference of the American Translators Association, pp. 47–54 (1988)

    Google Scholar 

  17. Dunning, T.: Statistical identification of language. Technical Report MCCS 94-273, New Mexico State University (1994)

    Google Scholar 

  18. Kessler, B.: Computational Dialectology in Irish Gaelic. Dublin: EACL. In: Proceedings of the European Association for Computational Linguistics, pp. 60–67 (1995)

    Google Scholar 

  19. Nerbonne, J., Heeringa, W.: Measuring Dialect Distance Phonetically. In: Coleman, J. (ed.) Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pp. 11–18 (1997)

    Google Scholar 

  20. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press, Cambridge (1999)

    Google Scholar 

  21. Jarvis, R.A., Patrick, E.A.: Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Transactions on Computers 22(11), 1025–1034 (1973)

    Article  Google Scholar 

  22. Verdaguer, P.: Grammaire de la langue catalane. Les origines de la langue, Curial (1999)

    Google Scholar 

  23. Koehn, P.: Europarl: A Multilingual Corpus for Evaluation of Machine Translation, Draft (unpublished), http://www.isi.edu/~koehn/publications/europarl.ps

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J. (2004). One Size Fits All? A Simple Technique to Perform Several NLP Tasks. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds) Advances in Natural Language Processing. EsTAL 2004. Lecture Notes in Computer Science(), vol 3230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30228-5_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30228-5_24

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23498-2

  • Online ISBN: 978-3-540-30228-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics