Skip to main content

Drive-by Language Identification

A Byproduct of Applied Prototype Semantics

  • Conference paper
  • 1784 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6008))

Abstract

While there exist many effective and efficient algorithms, most of them based on supervised n-gram or word dictionary methods, we propose a semi-supervised approach to language identification, based on prototype semantics.

Our method is primarily aimed at noise-rich environments with only very small text fragments to analyze and no training data available, even at analyzing the probable language affiliations of single words.

We have integrated our prototype system into a larger web crawling and information management architecture and evaluated the prototype against an experimental setup including datasets in 11 european languages.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pedersen, T., Mihalcea, R.: Advances in word sense disambiguation. In: 43rd Annual Meeting of the Association for Computational Linguistics, University of Michigan, Ann Arbor, USA (2005)

    Google Scholar 

  2. Martins, B., Silva, M.J.: Language identification in web pages. In: SAC 2005: Proceedings of the 2005 ACM symposium on Applied computing, pp. 764–768. ACM, New York (2005)

    Google Scholar 

  3. Winnemöller, R.: Knowledge based feature engineering using text sense representation trees. In: International Conference RANLP - 2005, Borovets, Bulgaria (2005)

    Google Scholar 

  4. Winnemöller, R.: Using meaning aspects for word sense disambiguation. In: 9th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Haifa, Israel (2008)

    Google Scholar 

  5. Mahesh, K., Nirenburg, S.: A situated ontology for practical nlp. In: Workshop on Basic Ontological Issues in Knowledge Sharing, International Joint Conference on Artificial Intelligence (IJCAI 1995), Montreal, Canada (1995)

    Google Scholar 

  6. Winnemöller, R.: Zur bedeutungsorientierten Auflösung von Wortmehrdeutigkeiten - Vorschlag einer Methodik. PhD thesis, University of Hamburg, Hamburg, Germany (2009)

    Google Scholar 

  7. Wittgenstein, L.: Philosophische Untersuchungen. In: Werkausgabe, B.I. (ed.) Frankfurt am Main. Suhrkamp Verlag (1984)

    Google Scholar 

  8. Bärenfänger, O.: Merkmals- und prototypensemantik: Einige grundsätzliche überlegungen. Linguistik online 12 (2002)

    Google Scholar 

  9. Meinhardt, H.J.: Invariante, variante und prototypische merkmale der wortbedeutung. Zeitschrift für Germanistik 5, 60–69 (1984)

    MathSciNet  Google Scholar 

  10. Overberg, P.: Merkmalssemantik vs. prototypensemantik - anspruch und leistung zweier grundkonzepte der lexikalischen semantik. Master’s thesis, Universität Münster (1999)

    Google Scholar 

  11. Miller, G.A., Fellbaum, C., Tengi, R., Wolff, S., Wakefield, P., Langone, H., Haskell, B.: Wordnet - a lexical database for the english language (2005), http://www.cogsci.princeton.edu/~wn/index.shtml

  12. Winnemöller, R.: Constructing text sense representations. In: Hirst, G., Nirenburg, S. (eds.) ACL 2004: Second Workshop on Text Meaning and Interpretation, Barcelona, Spain, pp. 17–24. Association for Computational Linguistics (2004)

    Google Scholar 

  13. Netscape Communications Corporation: Open directory project (2004), http://dmoz.org

  14. Zadeh, L.: Fuzzy sets. Information Control 8, 338–353 (1965)

    Article  MATH  MathSciNet  Google Scholar 

  15. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)

    Google Scholar 

  16. Zavarsky, P., Mikami, Y., Wada, S.: Language and encoding scheme identification of extremely large sets of multilingual text. In: Conference Proceedings: the tenth Machine Translation Summit, Phuket, Thailand, pp. 354–355 (2005)

    Google Scholar 

  17. Singh, A.K., Surana, H.: Can corpus based measures be used for comparative study of languages? In: Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology, Prague, Czech, pp. 40–47 (2007)

    Google Scholar 

  18. Rehurek, R., Kolkus, M.: Language identification on the web: Extending the dictionary method. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 357–368. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  19. Biemann, C., Teresniak, S.: Disentangling from babylonian confusion - unsupervized language identification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 762–773. Springer, Heidelberg (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Winnemöller, R. (2010). Drive-by Language Identification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2010. Lecture Notes in Computer Science, vol 6008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12116-6_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12116-6_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12115-9

  • Online ISBN: 978-3-642-12116-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics