Text Preprocessing for Shrinkage Regression and Topic Modeling to Analyse EU Public Consultation Data

Mimouni, Nada; Yeung, Timothy Yu-Cheong

doi:10.1007/978-3-031-24337-0_8

Nada Mimouni⁸ &
Timothy Yu-Cheong Yeung⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13451))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

348 Accesses

Abstract

Most text categorization methods use a common representation based on the bag of words model. Use this representation for learning involve a preprocessing step including many tasks such as stopwords removal and stemming. The output of this step has a direct influence on the quality of the learning task. This work aims at comparing different methods of preprocessing of textual inputs for LASSO logistic regression and LDA topic modeling in terms of mean squared error (MSE). Logistic regression and topic modeling are used to predict a binary position, or stance, with the textual data extracted from two public consultations of the European Commission. Texts are preprocessed and then input into LASSO and topic modeling to explain or cluster the documents’ positions. For LASSO, stemming with POS-tag is on average a better method than lemmatization and stemming without POS-tag. Besides, tf-idf on average performs better than counts of distinct terms, and deleting terms that appear only once reduces the prediction errors. For LDA topic modeling, stemming gives a slightly lower MSE in most cases but no significant difference between stemming and lemmatization was found.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Text Classification Based on Topic Modeling and Chi-square

Topic Identification from Spanish Unstructured Health Texts

Topic Modeling for Text Classification

Notes

1.
Consultations of the European Commission: https://ec.europa.eu/info/consultations_en.
2.
https://spacy.io/.
3.
https://pypi.org/project/nltk/.
4.
https://pypi.org/project/gensim/.
5.
https://scikit-learn.org/.
6.
Tf-idf is a statistical measure that evaluates how a term is important to a document in a collection, computed as: \(tf\text{- }idf_{t,d} = tf_{t,d} \times idf_{t} \) where \(idf_{t} = log( \frac{n_{documents}}{df_{t}})\) and \(df_{t} = \) number of documents containing t.

References

Jović, A., Brkić, K., Bogunović, N.: A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1200–1205. IEEE (2015)
Google Scholar
Gao, W., Hu, L., Zhang, P., Wang, F.: Feature selection by integrating two groups of feature evaluation criteria. Expert Syst. Appl. 110, 11–19 (2018)
Article Google Scholar
Labani, M., Moradi, P., Ahmadizar, F., Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018)
Article Google Scholar
Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)
Article Google Scholar
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manag. 50, 104–112 (2014)
Article Google Scholar
Vijayarani, S., Ilamathi, M.J., Nithya, M.: Preprocessing techniques for text mining-an overview. Int. J. Comput. Sci. Commun. Netw. 5, 7–16 (2015)
Google Scholar
Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633. ACM (2004)
Google Scholar
Leopold, E., Kindermann, J.: Text categorization with support vector machines. How to represent texts in input space? Mach. Learn. 46, 423–444 (2002)
Article MATH Google Scholar
Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Tokenising, stemming and stopword removal on anti-spam filtering domain. In: Marín, R., Onaindía, E., Bugarín, A., Santos, J. (eds.) CAEPIA 2005. LNCS (LNAI), vol. 4177, pp. 449–458. Springer, Heidelberg (2006). https://doi.org/10.1007/11881216_47
Chapter Google Scholar
Toman, M., Tesar, R., Jezek, K.: Influence of word normalization on text classification. Proc. InSciT 4, 354–358 (2006)
Google Scholar
Genkin, A., Lewis, D.D., Madigan, D.: Large-scale Bayesian logistic regression for text categorization. Technometrics 49, 291–304 (2007)
Article MathSciNet Google Scholar
Onan, A., Korukoğlu, S., Bulut, H.: Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst. Appl. 57, 232–247 (2016)
Article Google Scholar
Gentzkow, M., Kelly, B.T., Taddy, M.: Text as data. Technical report, National Bureau of Economic Research (2017)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Sukhija, N., Tatineni, M., Brown, N., Moer, M.V., Rodriguez, P., Callicott, S.: Topic modeling and visualization for big data in social sciences. In: 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), pp. 1198–1205 (2016)
Google Scholar
Yoon, H.G., Kim, H., Kim, C.O., Song, M.: Opinion polarity detection in twitter data combining shrinkage regression and topic modeling. J. Informet. 10, 634–644 (2016)
Article Google Scholar
Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S.K., Albertson, B., Rand, D.G.: Structural topic models for open-ended survey responses. Am. J. Polit. Sci. 58, 1064–1082 (2014)
Article Google Scholar
Manning, C.D., Raghavan, P., Schutze, H.: Stemming and lemmatization. In: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Article Google Scholar
Fellbaum, C.: Wordnet. Wiley Online Library (1998)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Flynn, C.J., Hurvich, C.M., Simonoff, J.S.: Efficiency for regularization parameter selection in penalized likelihood estimation of misspecified models. J. Am. Stat. Assoc. 108, 1031–1043 (2013)
Article MathSciNet MATH Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 2010, Stroudsburg, PA, USA, pp. 100–108. Association for Computational Linguistics (2010)
Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, pp. 399–408. ACM, New York (2015)
Google Scholar
Sievert, C., Shirley, K.E.: LDAvis: a method for visualizing and interpreting topics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Governance Analytics, University Paris-Dauphine, PSL Research University, Place du Maréchal de Lattre de Tassigny, Paris, 75016, France
Nada Mimouni & Timothy Yu-Cheong Yeung

Authors

Nada Mimouni
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Yu-Cheong Yeung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nada Mimouni .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mimouni, N., Yeung, T.YC. (2023). Text Preprocessing for Shrinkage Regression and Topic Modeling to Analyse EU Public Consultation Data. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-24337-0_8
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Text Preprocessing for Shrinkage Regression and Topic Modeling to Analyse EU Public Consultation Data

Abstract

Access this chapter

Similar content being viewed by others

Text Classification Based on Topic Modeling and Chi-square

Topic Identification from Spanish Unstructured Health Texts

Topic Modeling for Text Classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Text Preprocessing for Shrinkage Regression and Topic Modeling to Analyse EU Public Consultation Data

Abstract

Access this chapter

Similar content being viewed by others

Text Classification Based on Topic Modeling and Chi-square

Topic Identification from Spanish Unstructured Health Texts

Topic Modeling for Text Classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation