CatDetect, a framework for detecting Catalan tweets

Plaza, Sergi; Vilaplana, Jordi; Mateo, Jordi; Rius, Josep; Solsona, Francesc

doi:10.1007/s11042-020-10182-3

CatDetect, a framework for detecting Catalan tweets

Published: 26 November 2020

Volume 80, pages 10657–10677, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Sergi Plaza¹,
Jordi Vilaplana²,
Jordi Mateo²,
Josep Rius³ &
…
Francesc Solsona ORCID: orcid.org/0000-0002-4830-9184²

192 Accesses
Explore all metrics

Abstract

This work deals with language detection. It includes new proposals ranging from lexicon and morphological analysis to an increasing use of machine learning solutions. In this case, the language study is focused on Catalan, a minority language. In the context of the Twitter social network, this increases difficulty in detecting tweets (messages written on the Twitter social network). To achieve that, a Catalan-Twitter corpus was generated using lexical and morphological approaches, which then will be used to create supervised models based on machine learning techniques. They were also evaluated in order to see which obtains the best prediction score and thus, is the most suitable to be used. We demonstrate how our proposal is successful with Twitter in the case of minority languages. The best model is to be used on a website, where users can test the algorithm interactively in the front-end webpage and in background by means of a webservice across a RESTful API.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Language identification of multilingual posts from Twitter: a case study

Article 29 September 2016

Incorporating Code-Switching and Borrowing in Dutch-English Automatic Language Detection on Twitter

Hong Kong Protests: Using Natural Language Processing for Fake News Detection on Twitter

Notes

CatDetect: catdetect.udl.cat
stemming. Process carried out in order to find the stem of a word.
Ranks NL: http://www.ranks.nl/stopwords/catalan
LaTeL: http://latel.upf.edu/morgana/altres/pub/ca_stop.htm
trigram. combination of 3 characters which has a potential chance of appearing in a sentence written in a given language. Appendix contains the list of trigrams for Catalan.
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
https://github.com/dennybritz/cnn-text-classification-tf
Language Detection API: https://detectlanguage.com

References

Baldwin T, Lui M (2010) Language identification: the long and the short of the matter. In: Human language technologies: the 2010 annual conference of the North {A}merican chapter of the association for computational linguistics. Association for Computational Linguistics, pp 229–237
Bergsma S, McNamee P, Bagdouri M, Fink C, Wilson T (2012) Language identification for creating language-specific twitter collections. In: Proceedings of the second workshop on language in social media. Association for Computational Linguistics, pp 65–74
Bird S, Klein E, Loper E (2009) Natural language processing with python. O’Reilly Media, Inc.
Brown RD (2013) Selecting and weighting N-grams to identify 1100 languages. In: Habernal I, Matoušek V (eds) Text, speech, and dialogue. Springer, Berlin, pp 475–483
Bruguera J (2008) Introducció a l’etimologia. Institut d’Estudis Catalans
Cardoso PMD, Roy A (2016) Language identification for social media: short messages and transliteration. In: Proceedings of the 25th international conference companion on World Wide Web, WWW ’16 companion, pp 611–614, Republic and Canton of Geneva. International World Wide Web Conferences Steering Committee, CHE
Carter S, Weerkamp W, Tsagkias M (2013) Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Lang Resour Eval 47(1):195–215
Article Google Scholar
Cruz YA, Velarde SE, Morffis AP (2014) Detección de Idioma en Twitter. GECONTEC: Revista Internacional de Gestión del Conocimiento y la Tecnología 2(3):35–45
Google Scholar
Feixa C, Rubio C, Ganau J, Solsona F (2017) L’emigrant 2.0: emigració juvenil, nous moviments socials i xarxes digitals. Generalitat de Catalunya. Departament de Treball, Afers Socials i Famílies. Direcció General de Joventut
Google (2109) Language detection
Herry S, Sedogbo C, Gas B, Zarader JL (2006) Language detection combining discriminating approach and temporal decision with neural network modeling. In: 2006 IEEE odyssey - the speaker and language recognition workshop, pp 1–4
Jauhiainen T, Lui M, Zampieri M, Baldwin T, Lindén K. (2019) Automatic language identification in texts: a survey. J Artif Intell Res 65:1–103
Article MathSciNet Google Scholar
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing ({EMNLP}), pp 1746–1751, Doha, Qatar. Association for Computational Linguistics
Kotsiantis SB, Kanellopoulos D, Pintelas PE (2006) Data preprocessing for supervised leaning. Int J Comput Sci 1(1):11–117
Google Scholar
Lee PM (2012) Bayesian statistics: an introduction, 4th edn. Wiley, New York
MATH Google Scholar
Linguakit (2019) https://linguakit.com. Accessed 2020 Aug 18
Lui M, Baldwin T (2015) Accurate language identification of twitter messages. In: Workshop on language analysis for social media (LASM). Gothenburg, pp 17–25
Lui M, Lau JH, Baldwin T (2014) Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics 2:27–40
Article Google Scholar
McNamee P (2005) Language identification: a solved problem suitable for undergraduate instruction. J Comput Sci Coll 20(3):94–101
Google Scholar
Padró L, Stanilovsky E (2012) Freeling 3.0: towards wider multilinguality. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, pp 2473–2479
Pavliy B, Lewis J (2016) The performance of twitter’s language detection algorithm and google’s compact language detector on language detection in Ukrainian and Russian tweets. The 8th bulletin of the Faculty of Contemporary Social Studies, Toyama International University, 8(2016.3)
Quinlan JR (1987) Simplifying decision trees. International Journal of Man-Machine Studies 27(3):221–234
Article Google Scholar
Ripley BD (2014) Pattern recognition and neural networks. Cambridge University Press
Rokach L, Maimon O (2014) Data mining with decision trees: theory and applications, 2nd edn. World Scientific Publishing Co., Inc, River Edge
Book Google Scholar
Shawe-Taylor NC, John (2013) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press
Simons GF, Fenning CD (2020) Ethnologue: languages of the world, 23rd edn. SIL International, Dallas Texas
Google Scholar
Tharwat A (2018) Classification assessment methods. Applied Computing and Informatics
Varoquaux G, Buitinck L, Louppe G, Grisel O, Pedregosa F, Mueller A (2015) Scikit-learn. GetMobile: Mobile Computing and Communications 19(1):29–33
Article Google Scholar
Winkelmolen F, Mascardi V (2011) Statistical language identification of short texts. In: Proceedings of the 3rd international conference on agents and artificial intelligence - vol 1: ICAART, pp 498–503. INSTICC, SciTePress
Zhang H (2005) Exploring conditions for the optimality of naïve bayes. Int J Pattern Recognit Artif Intell 19(2):183–198
Article Google Scholar

Download references

Acknowledgments

This work was supported by the Ministerio de Economía y Competitividad under contract TIN2017-84553-C2-2-R. IT, JV, JM, JR and FS are members of the research group 2017-SGR363, funded by the Generalitat de Catalunya. The research leading to these results has received funding from RecerCaixa.

Author information

Authors and Affiliations

GFT. Parc Científic i Tecnològic de Lleida, Building H1, Lleida, 25003, Spain
Sergi Plaza
Department of Computer Science, University of Lleida, Jaume II 69, Lleida, 25001, Spain
Jordi Vilaplana, Jordi Mateo & Francesc Solsona
Department of AEGERN, University of Lleida, Jaume II 73, Lleida, 25001, Spain
Josep Rius

Authors

Sergi Plaza
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Vilaplana
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Mateo
View author publications
You can also search for this author in PubMed Google Scholar
Josep Rius
View author publications
You can also search for this author in PubMed Google Scholar
Francesc Solsona
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesc Solsona.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Catalan trigrams

de
es
de
la
la
el
que
el
co
ent
s d
qu
i
en
er
a
ls
nt
pe
e l
a d
en
per
ci
ar
ue
al
se
est
at
es
ts
s
pr
aci
un
res
men
s e
del
s a
s p
re
les
l’
na
a l
ca
d’
els
a p
ia
ns
con
le
tat
a c
i d
a a
ra
a e
no
ant
al
t d
s i
di
ta
re
a s
com
s c
ita
ons
sta
ica
po
r a
in
pro
tre
pa
ues
amb
ion
des
un
ma
da
s s
a i
an
mb
am
l d
e d
va
pre
ter
e e
e c
a m
cia
una
i e
nci
tra
te
ona
os
t e
n e
l c
ca
cio
l p
tr
par
r l
t a
e p
aqu
nta
so
ame
era
r e
e s
ada
n a
s q
si
ha
als
tes
va
m
ici
nte
s l
s m
i a
or
mo
ist
ect
lit
m s
to
ir
a t
esp
ran
str
om
l s
st
nts
me
no
r d
d’a
l’a
ats
ria
s t
ta
sen
rs
eix
tar
s n
n l
tal
e a
t p
art
mi
ll
tic
ten
ser
aq
ina
ntr
a f
sti
ol
a q
for
ura
ers
ari
int
act
l’e
fi
r s
e t
tor
si
ste
rec
a r
fe
is
em
n d
car
bre
fo
vi
an
ali
i p
ix
ell
l m
pos
orm
l l
i l
ac
fer
s r
ess
eu
e m
ens
ara
eri
sa
ssi
us
ort
tot
ll
por
ora
ci
tan
ass
n c
ost
nes
rac
a u
ver
ont
ha
ti
itz
gra
t c
n
a v
ren
cat
nal
ri
qua
t l
do
t s
rma
ual
i s
s f
n p
s v
te
t i
ba
cte
tam
man
l t
ial
fa
ic
ve
ble
a n
all
tza
ies
le
omp
r c
nc
rti
it
rre
fic
any
on
sa
r p
tur

Rights and permissions

Reprints and permissions

About this article

Cite this article

Plaza, S., Vilaplana, J., Mateo, J. et al. CatDetect, a framework for detecting Catalan tweets. Multimed Tools Appl 80, 10657–10677 (2021). https://doi.org/10.1007/s11042-020-10182-3

Download citation

Received: 20 April 2020
Revised: 11 September 2020
Accepted: 11 November 2020
Published: 26 November 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11042-020-10182-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CatDetect, a framework for detecting Catalan tweets

Abstract

Access this article

Similar content being viewed by others

Language identification of multilingual posts from Twitter: a case study

Incorporating Code-Switching and Borrowing in Dutch-English Automatic Language Detection on Twitter

Hong Kong Protests: Using Natural Language Processing for Fake News Detection on Twitter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: Catalan trigrams

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CatDetect, a framework for detecting Catalan tweets

Abstract

Access this article

Similar content being viewed by others

Language identification of multilingual posts from Twitter: a case study

Incorporating Code-Switching and Borrowing in Dutch-English Automatic Language Detection on Twitter

Hong Kong Protests: Using Natural Language Processing for Fake News Detection on Twitter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: Catalan trigrams

Appendix: Catalan trigrams

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation