Skip to main content
Log in

CatDetect, a framework for detecting Catalan tweets

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This work deals with language detection. It includes new proposals ranging from lexicon and morphological analysis to an increasing use of machine learning solutions. In this case, the language study is focused on Catalan, a minority language. In the context of the Twitter social network, this increases difficulty in detecting tweets (messages written on the Twitter social network). To achieve that, a Catalan-Twitter corpus was generated using lexical and morphological approaches, which then will be used to create supervised models based on machine learning techniques. They were also evaluated in order to see which obtains the best prediction score and thus, is the most suitable to be used. We demonstrate how our proposal is successful with Twitter in the case of minority languages. The best model is to be used on a website, where users can test the algorithm interactively in the front-end webpage and in background by means of a webservice across a RESTful API.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. CatDetect: catdetect.udl.cat

  2. stemming. Process carried out in order to find the stem of a word.

  3. Ranks NL: http://www.ranks.nl/stopwords/catalan

  4. LaTeL: http://latel.upf.edu/morgana/altres/pub/ca_stop.htm

  5. trigram. combination of 3 characters which has a potential chance of appearing in a sentence written in a given language. Appendix contains the list of trigrams for Catalan.

  6. http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

  7. https://github.com/dennybritz/cnn-text-classification-tf

  8. Language Detection API: https://detectlanguage.com

References

  1. Baldwin T, Lui M (2010) Language identification: the long and the short of the matter. In: Human language technologies: the 2010 annual conference of the North {A}merican chapter of the association for computational linguistics. Association for Computational Linguistics, pp 229–237

  2. Bergsma S, McNamee P, Bagdouri M, Fink C, Wilson T (2012) Language identification for creating language-specific twitter collections. In: Proceedings of the second workshop on language in social media. Association for Computational Linguistics, pp 65–74

  3. Bird S, Klein E, Loper E (2009) Natural language processing with python. O’Reilly Media, Inc.

  4. Brown RD (2013) Selecting and weighting N-grams to identify 1100 languages. In: Habernal I, Matoušek V (eds) Text, speech, and dialogue. Springer, Berlin, pp 475–483

  5. Bruguera J (2008) Introducció a l’etimologia. Institut d’Estudis Catalans

  6. Cardoso PMD, Roy A (2016) Language identification for social media: short messages and transliteration. In: Proceedings of the 25th international conference companion on World Wide Web, WWW ’16 companion, pp 611–614, Republic and Canton of Geneva. International World Wide Web Conferences Steering Committee, CHE

  7. Carter S, Weerkamp W, Tsagkias M (2013) Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Lang Resour Eval 47(1):195–215

    Article  Google Scholar 

  8. Cruz YA, Velarde SE, Morffis AP (2014) Detección de Idioma en Twitter. GECONTEC: Revista Internacional de Gestión del Conocimiento y la Tecnología 2(3):35–45

    Google Scholar 

  9. Feixa C, Rubio C, Ganau J, Solsona F (2017) L’emigrant 2.0: emigració juvenil, nous moviments socials i xarxes digitals. Generalitat de Catalunya. Departament de Treball, Afers Socials i Famílies. Direcció General de Joventut

  10. Google (2109) Language detection

  11. Herry S, Sedogbo C, Gas B, Zarader JL (2006) Language detection combining discriminating approach and temporal decision with neural network modeling. In: 2006 IEEE odyssey - the speaker and language recognition workshop, pp 1–4

  12. Jauhiainen T, Lui M, Zampieri M, Baldwin T, Lindén K. (2019) Automatic language identification in texts: a survey. J Artif Intell Res 65:1–103

    Article  MathSciNet  Google Scholar 

  13. Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing ({EMNLP}), pp 1746–1751, Doha, Qatar. Association for Computational Linguistics

  14. Kotsiantis SB, Kanellopoulos D, Pintelas PE (2006) Data preprocessing for supervised leaning. Int J Comput Sci 1(1):11–117

    Google Scholar 

  15. Lee PM (2012) Bayesian statistics: an introduction, 4th edn. Wiley, New York

    MATH  Google Scholar 

  16. Linguakit (2019) https://linguakit.com. Accessed 2020 Aug 18

  17. Lui M, Baldwin T (2015) Accurate language identification of twitter messages. In: Workshop on language analysis for social media (LASM). Gothenburg, pp 17–25

  18. Lui M, Lau JH, Baldwin T (2014) Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics 2:27–40

    Article  Google Scholar 

  19. McNamee P (2005) Language identification: a solved problem suitable for undergraduate instruction. J Comput Sci Coll 20(3):94–101

    Google Scholar 

  20. Padró L, Stanilovsky E (2012) Freeling 3.0: towards wider multilinguality. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, pp 2473–2479

  21. Pavliy B, Lewis J (2016) The performance of twitter’s language detection algorithm and google’s compact language detector on language detection in Ukrainian and Russian tweets. The 8th bulletin of the Faculty of Contemporary Social Studies, Toyama International University, 8(2016.3)

  22. Quinlan JR (1987) Simplifying decision trees. International Journal of Man-Machine Studies 27(3):221–234

    Article  Google Scholar 

  23. Ripley BD (2014) Pattern recognition and neural networks. Cambridge University Press

  24. Rokach L, Maimon O (2014) Data mining with decision trees: theory and applications, 2nd edn. World Scientific Publishing Co., Inc, River Edge

    Book  Google Scholar 

  25. Shawe-Taylor NC, John (2013) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press

  26. Simons GF, Fenning CD (2020) Ethnologue: languages of the world, 23rd edn. SIL International, Dallas Texas

    Google Scholar 

  27. Tharwat A (2018) Classification assessment methods. Applied Computing and Informatics

  28. Varoquaux G, Buitinck L, Louppe G, Grisel O, Pedregosa F, Mueller A (2015) Scikit-learn. GetMobile: Mobile Computing and Communications 19(1):29–33

    Article  Google Scholar 

  29. Winkelmolen F, Mascardi V (2011) Statistical language identification of short texts. In: Proceedings of the 3rd international conference on agents and artificial intelligence - vol 1: ICAART, pp 498–503. INSTICC, SciTePress

  30. Zhang H (2005) Exploring conditions for the optimality of naïve bayes. Int J Pattern Recognit Artif Intell 19(2):183–198

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the Ministerio de Economía y Competitividad under contract TIN2017-84553-C2-2-R. IT, JV, JM, JR and FS are members of the research group 2017-SGR363, funded by the Generalitat de Catalunya. The research leading to these results has received funding from RecerCaixa.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesc Solsona.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Catalan trigrams

Appendix: Catalan trigrams

  • de

  • es

  • de

  • la

  • la

  • el

  • que

  • el

  • co

  • ent

  • s d

  • qu

  • i

  • en

  • er

  • a

  • ls

  • nt

  • pe

  • e l

  • a d

  • en

  • per

  • ci

  • ar

  • ue

  • al

  • se

  • est

  • at

  • es

  • ts

  • s

  • pr

  • aci

  • un

  • res

  • men

  • s e

  • del

  • s a

  • s p

  • re

  • les

  • l’

  • na

  • a l

  • ca

  • d’

  • els

  • a p

  • ia

  • ns

  • con

  • le

  • tat

  • a c

  • i d

  • a a

  • ra

  • a e

  • no

  • ant

  • al

  • t d

  • s i

  • di

  • ta

  • re

  • a s

  • com

  • s c

  • ita

  • ons

  • sta

  • ica

  • po

  • r a

  • in

  • pro

  • tre

  • pa

  • ues

  • amb

  • ion

  • des

  • un

  • ma

  • da

  • s s

  • a i

  • an

  • mb

  • am

  • l d

  • e d

  • va

  • pre

  • ter

  • e e

  • e c

  • a m

  • cia

  • una

  • i e

  • nci

  • tra

  • te

  • ona

  • os

  • t e

  • n e

  • l c

  • ca

  • cio

  • l p

  • tr

  • par

  • r l

  • t a

  • e p

  • aqu

  • nta

  • so

  • ame

  • era

  • r e

  • e s

  • ada

  • n a

  • s q

  • si

  • ha

  • als

  • tes

  • va

  • m

  • ici

  • nte

  • s l

  • s m

  • i a

  • or

  • mo

  • ist

  • ect

  • lit

  • m s

  • to

  • ir

  • a t

  • esp

  • ran

  • str

  • om

  • l s

  • st

  • nts

  • me

  • no

  • r d

  • d’a

  • l’a

  • ats

  • ria

  • s t

  • ta

  • sen

  • rs

  • eix

  • tar

  • s n

  • n l

  • tal

  • e a

  • t p

  • art

  • mi

  • ll

  • tic

  • ten

  • ser

  • aq

  • ina

  • ntr

  • a f

  • sti

  • ol

  • a q

  • for

  • ura

  • ers

  • ari

  • int

  • act

  • l’e

  • fi

  • r s

  • e t

  • tor

  • si

  • ste

  • rec

  • a r

  • fe

  • is

  • em

  • n d

  • car

  • bre

  • fo

  • vi

  • an

  • ali

  • i p

  • ix

  • ell

  • l m

  • pos

  • orm

  • l l

  • i l

  • ac

  • fer

  • s r

  • ess

  • eu

  • e m

  • ens

  • ara

  • eri

  • sa

  • ssi

  • us

  • ort

  • tot

  • ll

  • por

  • ora

  • ci

  • tan

  • ass

  • n c

  • ost

  • nes

  • rac

  • a u

  • ver

  • ont

  • ha

  • ti

  • itz

  • gra

  • t c

  • n

  • a v

  • ren

  • cat

  • nal

  • ri

  • qua

  • t l

  • do

  • t s

  • rma

  • ual

  • i s

  • s f

  • n p

  • s v

  • te

  • t i

  • ba

  • cte

  • tam

  • man

  • l t

  • ial

  • fa

  • ic

  • ve

  • ble

  • a n

  • all

  • tza

  • ies

  • le

  • omp

  • r c

  • nc

  • rti

  • it

  • rre

  • fic

  • any

  • on

  • sa

  • r p

  • tur

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Plaza, S., Vilaplana, J., Mateo, J. et al. CatDetect, a framework for detecting Catalan tweets. Multimed Tools Appl 80, 10657–10677 (2021). https://doi.org/10.1007/s11042-020-10182-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10182-3

Keywords

Navigation