Abstract
This paper aims to present an experiment developed in order to produce a corpus with automated annotation, using pre-existing annotated corpus and machine learning classification methods. A search for pre-existing annotated corpora in Brazilian Portuguese was applied, founding six corpora of which one has been selected as the training dataset. A set of tweets was collected in a specific area of Recife (Pernambuco-Brazil) using some keywords related to kinds of crimes and reinforcing some places in that area. Preprocessing tasks were applied over the pre-existing corpus and the tweets’ set collected. Latent Dirichlet Allocation was applied for topic modeling followed by Multinomial Naïve Bayes, Linear Support Vector Machines, and Logistic Regression for the sentiment polarity classification. The results of the cross-validation of the experiment indicated Linear Support Vector Machines as the most accurate classification method among the three considering the specific training set used, and by this method, the new annotated corpus about the selected topic related to public security was created.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
He, W., Wang, F.K., Akula, V.: Managing extracted knowledge from big social media data for business decision making. J. Knowl. Manage 21, 275–294 (2017). https://doi.org/10.1108/JKM-07-2015-0296
Vatrapu, R., Mukkamala, R.R., Hussain, A., Flesch, B.: Social set analysis: a set theoretical approach to big data analytics. IEEE Access 4, 2542–2571 (2016). https://doi.org/10.1109/ACCESS.2016.2559584
Colombo, P., Ferrari, E.: Access control in the era of big data: state of the art and research directions. In: Proceedings of the 23rd ACM on Symposium on Access Control Models and Technologies – SACMAT 2018, pp 185–192. ACM Press, New York, NY, USA (2018)
Bjurstrom, S.: Sentiment analysis methodology for social web intelligence. In: Proceedings of the Twenty-first Americas Conference on Information Systems. Association for Information Systems, Puerto Rico, pp 1–12 (2015)
Stieglitz, S., Mirbabaie, M., Ross, B., Neuberger, C.: Social media analytics – challenges in topic discovery, data collection, and data preparation. Int. J. Inf. Manage. 39, 156–168 (2018). https://doi.org/10.1016/j.ijinfomgt.2017.12.002
Feng, L., Chiam, Y.K., Lo, S.K.: Text-mining techniques and tools for systematic literature reviews: a systematic literature review. In: 2017 24th Asia-Pacific Software Engineering Conference (APSEC), pp 41–50. IEEE (2017)
Lorentzen, D.G.: Webometrics benefitting from web mining? An investigation of methods and applications of two research fields. Scientometrics 99, 409–445 (2014). https://doi.org/10.1007/s11192-013-1227-x
Sisodia, D.S., Reddy, N.R.: Sentiment analysis of prospective buyers of mega online sale using tweets. In: International Conference on Power, Control, Signals and Instrumentation Engineering, ICPCSI 2017, pp. 2734–2739 (2018). https://doi.org/10.1109/ICPCSI.2017.8392217
Boulos, M.N.K., Sanfilippo, A.P., Corley, C.D., Wheeler, S.: Social web mining and exploitation for serious applications: technosocial predictive analytics and related technologies for public health, environmental and national security surveillance. Comput. Methods Programs Biomed. 100, 16–23 (2010). https://doi.org/10.1016/j.cmpb.2010.02.007
de Carvalho, V.D.H., Costa, A.P.C.S.: Social web mining as a tool to support public security sentiment analysis. In: Freitas, P.S., Dargam, F., Ribeiro, R., et al. (eds.) 5th International Conference on Decision Support System Technology, pp. 164–169. EURO Working Group on Decision Support Systems, Funchal (2019)
Gerber, M.S.: Predicting crime using Twitter and kernel density estimation. Decis. Support Syst. 61, 115–125 (2014). https://doi.org/10.1016/j.dss.2014.02.003
Nepomuceno, T.C.C., Costa, A.P.C.S.: Spatial visualization on patterns of disaggregate robberies. Oper. Res. (2019). https://doi.org/10.1007/s12351-019-00479-z
Pereira, D.V.S., Mota, C.M.M., Andresen, M.A.: The homicide drop in Recife, Brazil: a study of crime concentrations and spatial patterns. Homicide Stud. 21, 21–38 (2017). https://doi.org/10.1177/1088767916634405
Henriques de Gusmão, A.P., Aragão Pereira, R.M., Silva, M.M., da Costa Borba, B.F.: The use of a decision support system to aid a location problem regarding a public security facility. In: Freitas, P.S.A., Dargam, F., Moreno, J.M. (eds.) EmC-ICDSST 2019. LNBIP, vol. 348, pp. 15–27. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-18819-1_2
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2, 1–135 (2008). https://doi.org/10.1561/1500000011
Kharrat, S., Kchaou, S.: Lexicon-based methods for sentiment analysis. Comput. Linguist. 37, 267–307 (2007)
Brum, H.B., Das Graças Volpe Nunes, M.: Building a sentiment corpus of tweets in Brazilian Portuguese. In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, pp. 4167–4172 (2019)
Chathuranga, J., Ediriweera, S., Hasantha, R., et al.: Annotating opinions and opinion targets in student course feedback. In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, pp. 2684–2688 (2019)
Turchi, M., Negri, M.: Automatic annotation of machine translation datasets with binary quality judgements. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, pp. 1788–1792 (2014)
Win, S.S.M., Aung, T.N.: Automated text annotation for social media data during natural disasters. Adv. Sci. Technol. Eng. Syst. 3, 119–127 (2018). https://doi.org/10.25046/aj030214
Walkowiak, T., Gniewkowski, M.: Distance measures for clustering of documents in a topic space. Adv. Intell. Syst. Comput. 987, 544–552 (2020). https://doi.org/10.1007/978-3-030-19501-4_54
Cook, P., Brinton, L.J.: Building and evaluating web corpora representing national varieties of English. Lang. Resour. Eval. 51, 643–662 (2017). https://doi.org/10.1007/s10579-016-9378-z
Hovy, E., Lavid, J.: Towards a ‘science’of corpus annotation: a new methodological challenge for corpus linguistics. Int. J. Transl. 22, 13–36 (2010)
Baccouche, A., Garcia-Zapirain, B., Elmaghraby, A.: Annotation technique for health-related tweets sentiment analysis. In: 2018 IEEE International Symposium on Signal Processing and Information Technology, ISSPIT 2018, pp. 382–387 (2019). https://doi.org/10.1109/ISSPIT.2018.8642685
Zhang, H., Gan, W., Jiang, B.: Machine learning and lexicon based methods for sentiment classification: a survey. In: 2014 11th Web Information System and Application Conference (WISA). IEEE, New York, NY, USA, pp 262–265 (2014)
Neogi, P.P.G., Das, A.K., Goswami, S., Mustafi, J.: Topic modeling for text classification. In: Mandal, J.K., Bhattacharya, D. (eds.) Emerging Technology in Modelling and Graphics. AISC, vol. 937, pp. 395–407. Springer, Singapore (2020). https://doi.org/10.1007/978-981-13-7403-6_36
Dahal, B., Kumar, S.A.P., Li, Z.: Topic modeling and sentiment analysis of global climate change tweets. Soc. Netw. Anal. Min. 9, 1–20 (2019). https://doi.org/10.1007/s13278-019-0568-8
Cunningham-Nelson, S., Baktashmotlagh, M., Boles, W.: Visualizing student opinion through text analysis. IEEE Trans. Educ. 62, 305–311 (2019). https://doi.org/10.1109/TE.2019.2924385
Groß-Klußmann, A., König, S., Ebner, M.: Buzzwords build momentum: global financial twitter sentiment and the aggregate stock market. Expert Syst. Appl. 136, 171–186 (2019). https://doi.org/10.1016/j.eswa.2019.06.027
Srinivasan, B., Mohan Kumar, K.: Flock the similar users of twitter by using latent Dirichlet allocation. Int. J. Sci. Technol. Res. 8, 1421–1425 (2019)
Aggarwal, C.C.: Machine learning for text. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73531-3
Blei, D., Carin, L., Dunson, D.: Probabilistic topic models. IEEE Signal Process. Mag. 27, 55–65 (2010). https://doi.org/10.1109/MSP.2010.938079
Ravi, K., Ravi, V.: A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl.-Based Syst. 89, 14–46 (2015). https://doi.org/10.1016/j.knosys.2015.06.015
Yang, P., Chen, Y.: A survey on sentiment analysis by using machine learning methods. In: 2nd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), pp 117–121. IEEE (2017)
Asghar, M.Z., Kundi, F.M., Ahmad, S., et al.: T-SAF: Twitter sentiment analysis framework using a hybrid classification scheme. Expert Syst. 35, 1–19 (2018). https://doi.org/10.1111/exsy.12233
Khan, F.H., Bashir, S., Qamar, U.: TOM: Twitter opinion mining framework using hybrid classification scheme. Decis. Support Syst. 57, 245–257 (2014). https://doi.org/10.1016/j.dss.2013.09.004
De Arruda, G.D., Roman, N.T., Monteiro, A.M.: An Annotated Corpus for Sentiment Analysis in Political News, pp. 101–110 (2015)
dos Santos, H.D.P., Woloszyn, V., Vieira, R., Blogset, B.R.: A Brazilian Portuguese blog corpus. In: LREC 2018 11th International Conference on Language Resources and Evaluation, pp. 661–664 (2019)
Freitas, C., Motta, E., Milidiú, R.L., César, J.: Sparkling Vampire… LOL! Annotating opinions in a book review corpus. In: Aluísio, S., Tagnin, S.E.O. (eds.) New Language Technologies and Linguistic Research: A Two-Way Road, pp. 128–146. Cambridge Scholars Publishing, Newcastle upon Tyne (2013)
de Souza, K.F., Pereira, M.H.R., Dalip, D.H.: UniLex: Método Léxico para Análise de Sentimentos Textuais sobre Conteúdo de Tweets em Português Brasileiro. Abakós 5, 79 (2017). https://doi.org/10.5752/p.2316-9451.2017v5n2p79
Rosa, R.L., Rodriguez, D.Z., Bressan, G.: SentiMeter-Br: A new social web analysis metric to discover consumers’ sentiment. In: Proceedings of the International Symposium Consumer Electronics, ISCE, pp. 153–154 (2013). https://doi.org/10.1109/ISCE.2013.6570158
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media Inc., Sebastopol (2009). https://www.nltk.org/
Reinoso, G., Farooq, B., Forum, C.T.R.: Urban pulse analysis using big data. In: Canadian Transportation Research Forum 50th Annual Conference. Transportation Association of Canada (TAC), Montreal, p. 16 (2015)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Acknowledgment
This paper was funded in part by the Coordination for the Improvement of Higher Education Personnel (Brazil) – Finance Code 001, and by the National Council for Scientific and Technological Development (Brazil).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
de Carvalho, V.D.H., Nepomuceno, T.C.C., Costa, A.P.C.S. (2020). An Automated Corpus Annotation Experiment in Brazilian Portuguese for Sentiment Analysis in Public Security. In: Moreno-Jiménez, J., Linden, I., Dargam, F., Jayawickrama, U. (eds) Decision Support Systems X: Cognitive Decision Support Systems and Technologies. ICDSST 2020. Lecture Notes in Business Information Processing, vol 384. Springer, Cham. https://doi.org/10.1007/978-3-030-46224-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-46224-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46223-9
Online ISBN: 978-3-030-46224-6
eBook Packages: Computer ScienceComputer Science (R0)