Abstract
Microblogging sites are being used as analysis avenues due to their peculiarities (promptness, short texts...). Lately, researchers have focused mainly in classification performance rather than interpretability. When the problem requires transparency, it is necessary to build interpretable pipelines, and even though, resulting models are too complex to be considered comprehensible, making it impossible for humans to understand the actual decisions. This paper presents a feature selection mechanism that is able to improve comprehensibility by using less but more meaningful features. Results show that our proposal is better and the most stable one in terms of accuracy, generalisation and comprehensibility in microblogging context.
This work was financially supported by the Spanish Ministry of Economy and Competitiveness (MINECO), project FFI2016-79748-R, and cofinanced by the European Social Fund (ESF). Manuel Francisco Aparicio was supported by the FPI 2017 predoctoral programme, from the Spanish Ministry of Economy and Competitiveness (MINECO), grant reference BES-2017-081202.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alharbi, A., de Doncker, E.: Twitter sentiment analysis with a deep neural network: an enhanced approach using user behavioral information. Cognit. Syst. Res. 54, 50–61 (2019). https://doi.org/10.1016/j.cogsys.2018.10.001
Alonso, J.M., Magdalena, L., González-Rodríguez, G.: Looking for a good fuzzy system interpretability index: an experimental approach. Int. J. Approx. Reasoning 51(1), 115–134 (2009). https://doi.org/10.1016/j.ijar.2009.09.004. https://www.sciencedirect.com/science/article/pii/S0888613X09001418
Alsaig, A., Alsaig, A., Alsadun, M., Barghi, S.: Context based algorithm for social influence measurement on Twitter. In: Cong Vinh, P., Alagar, V. (eds.) ICCASA/ICTCC -2018. LNICST, vol. 266, pp. 136–149. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-06152-4_12
Brown, T.B., et al.: Language models are few-shot learners. arXiv:2005.14165 [cs], July 2020
Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization (2001)
Cowan, N.: The magical number 4 in short-term memory: a reconsideration of mental storage capacity. Behav. Brain Sci. 24(1), 87–114 (2001). Discussion 114–185
Deng, X., Li, Y., Weng, J., Zhang, J.: Feature selection for text classification: a review. Multimedia Tools Appl. 78(3), 3797–3816 (2019). https://doi.org/10.1007/s11042-018-6083-5
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], October 2018
Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning. arXiv:1702.08608 [cs, stat], February 2017
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. JMLR 3, 1289–1305 (2003)
Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45268-0_6
Hans, C., Dobra, A., West, M.: Shotgun stochastic search for “Large p” regression. J. Am. Stat. Assoc. 102 (2005). https://doi.org/10.2307/27639881
Internet Live Stats: Twitter usage statistics - internet live stats (2020). https://www.internetlivestats.com/twitter-statistics/
Meiri, R., Zahavi, J.: Using simulated annealing to optimize the feature selection problem in marketing applications. Eur. J. Oper. Res. 171(3), 842–858 (2006). https://doi.org/10.1016/j.ejor.2004.09.010. http://www.sciencedirect.com/science/article/pii/S0377221704005892
Mencar, C., Fanelli, A.M.: Interpretability constraints for fuzzy information granulation. Inf. Sci. 178(24), 4585–4618 (2008). https://doi.org/10.1016/j.ins.2008.08.015. https://www.sciencedirect.com/science/article/pii/S0020025508003484
Miller, G.A.: The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychol. Rev. 63(2), 81–97 (1956). https://doi.org/10.1037/h0043158
Minaee, S., et al.: Deep learning based text classification: a comprehensive review. arXiv:2004.03705 [cs, stat], April 2020, version: 1
Misangyi, V.F., LePine, J.A., Algina, J., Francis Goeddeke, J.: The adequacy of repeated-measures regression for multilevel research: comparisons with repeated-measures ANOVA, multivariate repeated-measures anova, and multilevel modeling across various multilevel research designs. Organ. Res. Methods (2016). https://doi.org/10.1177/1094428105283190. https://journals.sagepub.com/doi/10.1177/1094428105283190
Moreo, A., Navarro, M., Castro, J.L., Zurita, J.M.: A high-performance FAQ retrieval method using minimal differentiator expressions. Knowl. Based Syst. 36, 9–20 (2012). https://doi.org/10.1016/j.knosys.2012.05.015. http://www.sciencedirect.com/science/article/pii/S0950705112001657
O’Dair, M., Fry, A.: Beyond the black box in music streaming: the impact of recommendation systems upon artists. Pop. Commun. (2019). https://doi.org/10.1080/15405702.2019.1627548
Periñán-Pascual, C., Arcas-Túnez, F.: Detecting environmentally-related problems on Twitter. Biosyst. Eng. 177, 31–48 (2019). https://doi.org/10.1016/j.biosystemseng.2018.10.001
Peters, M.E., et al.: Deep contextualized word representations. arXiv:1802.05365 [cs], February 2018
Phillips, A.: The Moral Dilemma of Algorithmic Censorship, August 2018. https://becominghuman.ai/the-moral-dilemma-of-algorithmic-censorship-6d7b6faefe7
Rudin, C.: Please stop explaining black box models for high stakes decisions. arXiv:1811.10154 [cs, stat], November 2018
Twitter Inc.: Q1 2019 earning report. Technical report, Twitter Inc. (2019). https://s22.q4cdn.com/826641620/files/doc_financials/2019/q1/Q1-2019-Slide-Presentation.pdf
Wang, H., Hong, M.: Supervised Hebb rule based feature selection for text classification. Inf. Process. Manag. 56(1), 167–191 (2019). https://doi.org/10.1016/j.ipm.2018.09.004. http://www.sciencedirect.com/science/article/pii/S0306457318305752
Wu, G., Wang, L., Zhao, N., Lin, H.: Improved expected cross entropy method for text feature selection. In: 2015 International Conference on Computer Science and Mechanical Automation (CSMA), pp. 49–54, October 2015. https://doi.org/10.1109/CSMA.2015.17. ISSN: null
Xue, B., Zhang, M., Browne, W.: Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Trans. Cybern. 43, 1656–1671 (2013). https://doi.org/10.1109/TSMCB.2012.2227469
Zheng, H.T., et al.: Learning-based topic detection using multiple features. Concurr. Comput. Pract. Exp. 30(15), e4444 (2018). https://doi.org/10.1002/cpe.4444. wOS:000438339700001
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80–89 (2004). https://doi.org/10.1145/1007730.1007741
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Francisco, M., Castro, J.L. (2022). Discriminatory Expressions to Improve Model Comprehensibility in Short Documents. In: El Yacoubi, M., Granger, E., Yuen, P.C., Pal, U., Vincent, N. (eds) Pattern Recognition and Artificial Intelligence. ICPRAI 2022. Lecture Notes in Computer Science, vol 13363. Springer, Cham. https://doi.org/10.1007/978-3-031-09037-0_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-09037-0_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09036-3
Online ISBN: 978-3-031-09037-0
eBook Packages: Computer ScienceComputer Science (R0)