Abstract
This work deals with language detection. It includes new proposals ranging from lexicon and morphological analysis to an increasing use of machine learning solutions. In this case, the language study is focused on Catalan, a minority language. In the context of the Twitter social network, this increases difficulty in detecting tweets (messages written on the Twitter social network). To achieve that, a Catalan-Twitter corpus was generated using lexical and morphological approaches, which then will be used to create supervised models based on machine learning techniques. They were also evaluated in order to see which obtains the best prediction score and thus, is the most suitable to be used. We demonstrate how our proposal is successful with Twitter in the case of minority languages. The best model is to be used on a website, where users can test the algorithm interactively in the front-end webpage and in background by means of a webservice across a RESTful API.
Similar content being viewed by others
Notes
CatDetect: catdetect.udl.cat
stemming. Process carried out in order to find the stem of a word.
Ranks NL: http://www.ranks.nl/stopwords/catalan
trigram. combination of 3 characters which has a potential chance of appearing in a sentence written in a given language. Appendix contains the list of trigrams for Catalan.
Language Detection API: https://detectlanguage.com
References
Baldwin T, Lui M (2010) Language identification: the long and the short of the matter. In: Human language technologies: the 2010 annual conference of the North {A}merican chapter of the association for computational linguistics. Association for Computational Linguistics, pp 229–237
Bergsma S, McNamee P, Bagdouri M, Fink C, Wilson T (2012) Language identification for creating language-specific twitter collections. In: Proceedings of the second workshop on language in social media. Association for Computational Linguistics, pp 65–74
Bird S, Klein E, Loper E (2009) Natural language processing with python. O’Reilly Media, Inc.
Brown RD (2013) Selecting and weighting N-grams to identify 1100 languages. In: Habernal I, Matoušek V (eds) Text, speech, and dialogue. Springer, Berlin, pp 475–483
Bruguera J (2008) Introducció a l’etimologia. Institut d’Estudis Catalans
Cardoso PMD, Roy A (2016) Language identification for social media: short messages and transliteration. In: Proceedings of the 25th international conference companion on World Wide Web, WWW ’16 companion, pp 611–614, Republic and Canton of Geneva. International World Wide Web Conferences Steering Committee, CHE
Carter S, Weerkamp W, Tsagkias M (2013) Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Lang Resour Eval 47(1):195–215
Cruz YA, Velarde SE, Morffis AP (2014) Detección de Idioma en Twitter. GECONTEC: Revista Internacional de Gestión del Conocimiento y la Tecnología 2(3):35–45
Feixa C, Rubio C, Ganau J, Solsona F (2017) L’emigrant 2.0: emigració juvenil, nous moviments socials i xarxes digitals. Generalitat de Catalunya. Departament de Treball, Afers Socials i Famílies. Direcció General de Joventut
Google (2109) Language detection
Herry S, Sedogbo C, Gas B, Zarader JL (2006) Language detection combining discriminating approach and temporal decision with neural network modeling. In: 2006 IEEE odyssey - the speaker and language recognition workshop, pp 1–4
Jauhiainen T, Lui M, Zampieri M, Baldwin T, Lindén K. (2019) Automatic language identification in texts: a survey. J Artif Intell Res 65:1–103
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing ({EMNLP}), pp 1746–1751, Doha, Qatar. Association for Computational Linguistics
Kotsiantis SB, Kanellopoulos D, Pintelas PE (2006) Data preprocessing for supervised leaning. Int J Comput Sci 1(1):11–117
Lee PM (2012) Bayesian statistics: an introduction, 4th edn. Wiley, New York
Linguakit (2019) https://linguakit.com. Accessed 2020 Aug 18
Lui M, Baldwin T (2015) Accurate language identification of twitter messages. In: Workshop on language analysis for social media (LASM). Gothenburg, pp 17–25
Lui M, Lau JH, Baldwin T (2014) Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics 2:27–40
McNamee P (2005) Language identification: a solved problem suitable for undergraduate instruction. J Comput Sci Coll 20(3):94–101
Padró L, Stanilovsky E (2012) Freeling 3.0: towards wider multilinguality. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, pp 2473–2479
Pavliy B, Lewis J (2016) The performance of twitter’s language detection algorithm and google’s compact language detector on language detection in Ukrainian and Russian tweets. The 8th bulletin of the Faculty of Contemporary Social Studies, Toyama International University, 8(2016.3)
Quinlan JR (1987) Simplifying decision trees. International Journal of Man-Machine Studies 27(3):221–234
Ripley BD (2014) Pattern recognition and neural networks. Cambridge University Press
Rokach L, Maimon O (2014) Data mining with decision trees: theory and applications, 2nd edn. World Scientific Publishing Co., Inc, River Edge
Shawe-Taylor NC, John (2013) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press
Simons GF, Fenning CD (2020) Ethnologue: languages of the world, 23rd edn. SIL International, Dallas Texas
Tharwat A (2018) Classification assessment methods. Applied Computing and Informatics
Varoquaux G, Buitinck L, Louppe G, Grisel O, Pedregosa F, Mueller A (2015) Scikit-learn. GetMobile: Mobile Computing and Communications 19(1):29–33
Winkelmolen F, Mascardi V (2011) Statistical language identification of short texts. In: Proceedings of the 3rd international conference on agents and artificial intelligence - vol 1: ICAART, pp 498–503. INSTICC, SciTePress
Zhang H (2005) Exploring conditions for the optimality of naïve bayes. Int J Pattern Recognit Artif Intell 19(2):183–198
Acknowledgments
This work was supported by the Ministerio de Economía y Competitividad under contract TIN2017-84553-C2-2-R. IT, JV, JM, JR and FS are members of the research group 2017-SGR363, funded by the Generalitat de Catalunya. The research leading to these results has received funding from RecerCaixa.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Catalan trigrams
Appendix: Catalan trigrams
-
de
-
es
-
de
-
la
-
la
-
el
-
que
-
el
-
co
-
ent
-
s d
-
qu
-
i
-
en
-
er
-
a
-
ls
-
nt
-
pe
-
e l
-
a d
-
en
-
per
-
ci
-
ar
-
ue
-
al
-
se
-
est
-
at
-
es
-
ts
-
s
-
pr
-
aci
-
un
-
res
-
men
-
s e
-
del
-
s a
-
s p
-
re
-
les
-
l’
-
na
-
a l
-
ca
-
d’
-
els
-
a p
-
ia
-
ns
-
con
-
le
-
tat
-
a c
-
i d
-
a a
-
ra
-
a e
-
no
-
ant
-
al
-
t d
-
s i
-
di
-
ta
-
re
-
a s
-
com
-
s c
-
ita
-
ons
-
sta
-
ica
-
po
-
r a
-
in
-
pro
-
tre
-
pa
-
ues
-
amb
-
ion
-
des
-
un
-
ma
-
da
-
s s
-
a i
-
an
-
mb
-
am
-
l d
-
e d
-
va
-
pre
-
ter
-
e e
-
e c
-
a m
-
cia
-
una
-
i e
-
nci
-
tra
-
te
-
ona
-
os
-
t e
-
n e
-
l c
-
ca
-
cio
-
l p
-
tr
-
par
-
r l
-
t a
-
e p
-
aqu
-
nta
-
so
-
ame
-
era
-
r e
-
e s
-
ada
-
n a
-
s q
-
si
-
ha
-
als
-
tes
-
va
-
m
-
ici
-
nte
-
s l
-
s m
-
i a
-
or
-
mo
-
ist
-
ect
-
lit
-
m s
-
to
-
ir
-
a t
-
esp
-
ran
-
str
-
om
-
l s
-
st
-
nts
-
me
-
no
-
r d
-
d’a
-
l’a
-
ats
-
ria
-
s t
-
ta
-
sen
-
rs
-
eix
-
tar
-
s n
-
n l
-
tal
-
e a
-
t p
-
art
-
mi
-
ll
-
tic
-
ten
-
ser
-
aq
-
ina
-
ntr
-
a f
-
sti
-
ol
-
a q
-
for
-
ura
-
ers
-
ari
-
int
-
act
-
l’e
-
fi
-
r s
-
e t
-
tor
-
si
-
ste
-
rec
-
a r
-
fe
-
is
-
em
-
n d
-
car
-
bre
-
fo
-
vi
-
an
-
ali
-
i p
-
ix
-
ell
-
l m
-
pos
-
orm
-
l l
-
i l
-
ac
-
fer
-
s r
-
ess
-
eu
-
e m
-
ens
-
ara
-
eri
-
sa
-
ssi
-
us
-
ort
-
tot
-
ll
-
por
-
ora
-
ci
-
tan
-
ass
-
n c
-
ost
-
nes
-
rac
-
a u
-
ver
-
ont
-
ha
-
ti
-
itz
-
gra
-
t c
-
n
-
a v
-
ren
-
cat
-
nal
-
ri
-
qua
-
t l
-
do
-
t s
-
rma
-
ual
-
i s
-
s f
-
n p
-
s v
-
te
-
t i
-
ba
-
cte
-
tam
-
man
-
l t
-
ial
-
fa
-
ic
-
ve
-
ble
-
a n
-
all
-
tza
-
ies
-
le
-
omp
-
r c
-
nc
-
rti
-
it
-
rre
-
fic
-
any
-
on
-
sa
-
r p
-
tur
Rights and permissions
About this article
Cite this article
Plaza, S., Vilaplana, J., Mateo, J. et al. CatDetect, a framework for detecting Catalan tweets. Multimed Tools Appl 80, 10657–10677 (2021). https://doi.org/10.1007/s11042-020-10182-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10182-3