research-article

Text classification using Fuzzy TF-IDF and Machine Learning Models

Authors:

Mariem Bounabi,

Karim El Moutaouakil,

Khalid SatoriAuthors Info & Claims

BDIoT '19: Proceedings of the 4th International Conference on Big Data and Internet of Things

Article No.: 18, Pages 1 - 6

https://doi.org/10.1145/3372938.3372956

Published: 07 January 2020 Publication History

Abstract

The representation of the information has an important impact on the text classification task. Several weighting methods were proposed in the literature, and the term frequency-inverse term frequency (TFIDF), the most know on the text treatment field. The FTF-IDF is a vector representation where the components of the TFIDF are presented as inputs to the Fuzzy Inference System (FIS). In this work, we compare several Machin Learning algorithms such as Naïve Bayes and its derivatives, SVM and Random forest classifiers, using the FTF-IDF representation. To improve the quality of the used classifiers, we call sum attribute selection methods. The recognition rate, for the tested systems, is satisfied, where the system based on naïve Bayes classifier, the FTF-IDF weighting terms, and the info gain select attributes method gives 98.7% as accuracy.

References

[1]

Lawrence, S., & Giles, C. L. (1998). Searching the World Wide Web.Science, 280(5360), 98--100.

[2]

Pazzani, M. J., Muramatsu, J., & Billsus, D. (1996, August). Syskill &Webert:Identifying interesting web sites. In AAAI/IAAI, Vol. 1 (pp. 54--61).

[3]

Sahami, M., Dumais S., Heckerman, D., Horvitz, E.(1998) A Bayesian Approach to Filtering Junk email.AAAI 98 Workshops on Text Categorization, July.

[4]

Lang, K (1995). NewsWeeder: Learning to Filter News. Proceedings of the 12th International Conference on Machine Learning, 331--339, Lake Tahoe, CA.

[5]

Qi, X., & Davison, B. D. (2009). Web page classification: Features and algorithms.ACM computing surveys (CSUR), 41(2), 12.

Digital Library

[6]

Quinlan, J.R. (1986) Induction of Decision Trees. Machine Learning, 1:81--106.

[7]

Apte, C., Damerau, F. and Weiss, S. (1998) Text Mining with Decision Rule and Decision Trees. In Proceedings of the Conference on Automated Learning and Discovery, CMU, June.

[8]

Tzeras, K. and Hartmann, S. (1993) Automatic Indexing Based on Bayesian Inference Networks. In Proceedings of the 16th Annual ACM/SIGIR Conference on Research and Development in Information Retrieval, 22--34.

Digital Library

[9]

Wiener, E., Pederson, J. and Weigend, A. (1995) A Neural Network Approach to Topic Spotting. Fourth Annual Symposium on Document Analysis and Information Retrieve

[10]

Masand, B., Linoff, G., Waltz, D. (1992) Classifying News Stories Using Memory Based Reasoning. In Proceedings of the 15th Annual ACM/SIGIR Conference on Research and Development in Information Retrieval, 59--65.

[11]

Deerwester, S., Dumais, S., Furnas G., Landauer, T. Harshman, R. (1990) Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391--407.

[12]

Joachims, T. (1998) Text Categorization with Suppor Vector Machines: Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning (ECML), Springer Verlag.

Digital Library

[13]

Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation, 60(5), 503--520.

[14]

Stein, B., zu Eissen, S. M., & Potthast, M. (2007, July). Strategies for retrieving plagiarized documents. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 825--826). ACM.

Digital Library

[15]

Sauvola, J., Seppanen, T., Haapakoski, S., & Pietikainen, M. (1997, August). Adaptive document binarization. In Proceedings of the Fourth International Conference on Document Analysis and Recognition (Vol. 1, pp. 147--152). IEEE.

[16]

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. MIT Press. Herrera, F., Ventura, S., Bello, R., Cornelis, C., Zafra, A., Tarragó, D. S., et al. (2016). Multiple instance learning--Foundations and algorithms. Springer.

[17]

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York, NY, USA: Cambridge University Press.

[18]

Chang, M., & Poon, C. K. (2009). Using phrases as features in email classification. Journal of Systems and Software, 82 (6), 1036--1045. 2009.01.013.

[19]

Xie, F., Wu, X., & Zhu, X. (2017). Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowledge-Based Systems, 115, 27--39.

[20]

Uysal, A. K., & Günal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 36, 226--235. 2012.06.005

[21]

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the fifty-second annual meeting of the association for computational linguistics: 1 (pp. 238--247). (Long papers)

[22]

Chaturvedi, I., Ong, Y., Tsang, I. W., Welsch, R. E., & Cambria, E. (2016). Learning word dependencies in text by means of a deep recurrent belief network. Knowledge-Based Systems, 108, 144--154.

Digital Library

[23]

Enríquez, F., Troyano, J. A., & López-Solaz, T. (2016). An approach to the use of word embeddings in an opinion classification task. Expert Systems with Applications, 66, 1--6.

Digital Library

[24]

Mikolov, T., Yih, W.-t., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT-2013): 13 (pp. 746--751).

[25]

Tommasel, A., & Godoy, D. (2018). Short-text feature construction and selection in social media data: a survey. Artificial Intelligence Review, 49 (3), 301--338.

Digital Library

[26]

Pennington, J., Socher, R., & Manning. (2014).Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532--1543).

[27]

Olvera-López, J. A., Carrasco-Ochoa, J. A., Trinidad, J. F. M., & Kittler, J. (2010).A review of instance selection methods. Artificial Intelligence Review 34 (2), 133--143.

Digital Library

[28]

Tsai, C., & Chang, C. (2013). SVOIS: Support vector oriented instance selection for text classification. Information Sciences, 38 (8), 1070--1083. 05.001.

[29]

Mirończuk, M. M., & Protasiewicz, J. (2018). A recent overview of the stateofthe-art elements of text classification. Expert Systems with Applications, 106 36--54.

[30]

Bounabi, M., El Moutaouakil, K., & Satori, K. (2017, March). A comparison of Text Classification methods Method of weighted terms selected by different Stemming Techniques. In Proceedings of the 2nd international Conference on Big Data, Cloud and Applications (p. 43). ACM.

Digital Library

[31]

D. Greene and P. Cunningham. 2006.Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering: Proc.ICML 2006.

[32]

M. Sokolova, N. Japkowicz and S. Szpakowicz: "Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation", Lecture Notes in Computer Science, Vol. 4304, 2006, pp. 1015--10.

Digital Library

[33]

R. J. Urbanowicz, M. Meeker, W. Lacava, R. S. Olson, and H. Jason, "ReliefBased Feature Selection: Introduction and Review."

[34]

E. C. Blessie and E. Karthikeyan, "Sigmis: A Feature Selection Algorithm Using Correlation Based Method," vol. 6, no. 3, pp. 385--394, 2012.

[35]

Bounabi, M., El Moutaouakil, K., & Satori, K. (2019). A comparison of text classification methods using different stemming techniques. International Journal of Computer Applications in Technology (IJCAT), 60(4), 298--306.

Digital Library

[36]

Desai, A., & Sunil, R. (2012). Analysis of machine learning algorithms using WEKA. International Journal of Computer Applications, 975, 8887.

[37]

Bounabi, M., El Moutaouakil, K., & Satori, K. (2018, April). A Probabilistic Vector Representation and Neural Network for Text Classification. In International Conference on Big Data, Cloud and Applications (pp. 343--355).

[38]

Ettaouil, M., Lazaar, M., Elmoutaouakil, K., Haddouch, K., "A new algorithm for optimization of the kohonen network architectures using the continuous hopfield networks", (2013) WSEAS Transactions on Computers, 12 (4), pp. 155--163.

[39]

Ettaouil, M., Lazaar, M., En-Naimani, Z., "A hybrid ANN/HMM models for arabic speech recognition using optimal codebook", (2013) 2013 8th International Conference on Intelligent Systems: Theories and Applications, SITA 2013.

[40]

Omara, H., Lazaar, M., Tabii, Y."Effect of feature selection on gene expression datasets classification accuracy", (2018) International Journal of Electrical and Computer Engineering, 8 (5), pp. 3194--3203

[41]

Khaldi, R., Chiheb, R., El Afia A., Akaaboune, A., and Faizi, R., 2017. Prediction of Supplier Performance: A Novel DEA-ANFIS Based Approach. 2nd BDCA conference. ACM.

Digital Library

[42]

Yager, R. R., & Zadeh, L. A. (Eds.). (2012). An introduction to fuzzy logic applications in intelligent systems (Vol. 165). Springer Science & Business Media.

[43]

Dahmouni, Abdellatif, K. El Moutaouakil, and Khalid Satori. "A Cloud Face Recognition System using A New Optimal Local Binary Pattern. " In Proceedings of the 2nd international Conference on Big Data, Cloud and Applications, p. 39. ACM, 2017.

Cited By

Mohammed RKarim EElkari BHammouni AChellak SBaizri HCheggour M(2025)Twitter-Sentiment Analysis of Moroccan Diabetic: A Comparison StudyBig Data and Internet of Things10.1007/978-3-031-74491-4_68(880-896)Online publication date: 3-Jan-2025
https://doi.org/10.1007/978-3-031-74491-4_68
Suramanka LHanskunatai A(2024)Multi-Label Classification of Foreign Tourists' Opinions on Thailand Tourism DevelopmentProceedings of the 2024 9th International Conference on Big Data and Computing10.1145/3695220.3695227(32-38)Online publication date: 24-May-2024
https://dl.acm.org/doi/10.1145/3695220.3695227
Lai YChen M(2023)Review of Survey Research in Fuzzy Approach for Text MiningIEEE Access10.1109/ACCESS.2023.326816511(39635-39649)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3268165
Show More Cited By

Recommendations

A new approach to text classification based on naïve Bayes and modified TF-IDF algorithms
SCAMS '17: Proceedings of the Mediterranean Symposium on Smart City Application

In text mining, classification is a technique of assigning documents to predefined classes. Naïve Bayes algorithm is the basic of text classification technique; it is the most widely used algorithm for diverse text classification applications.

This ...
A Text Classification Approach using Parallel Naive Bayes in Big Data Context
SITA'18: Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications

Text classification is a domain that has been inspiring researchers since many years. Indeed, several approaches have been developed in order to find methods that improve the performance of text classification. But in last decades, because of the ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

BDIoT '19: Proceedings of the 4th International Conference on Big Data and Internet of Things

October 2019

476 pages

ISBN:9781450372404

DOI:10.1145/3372938

Conference Chairs:
Mohamed Lazaar
ENSIAS, Mohammed V University in Rabat, Morocco
,
Claude Duvallet
Le Havre University in Le Havre, France
,
Mohammed Al Achhab
ENSA, Abdelmalek Essaadi University in Tetuan, Morocco
,
Oussama Mahboub
ENSA, Abdelmalek Essaadi University in Tetuan, Morocco
,
Hassan Silkan
FS, Chouaib Doukkali University in El Jadida, Morocco

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 January 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

BDIoT'19

BDIoT'19: The 4th International Conference On Big Data and Internet of Things

October 23 - 24, 2019

Rabat, Morocco

Acceptance Rates

BDIoT '19 Paper Acceptance Rate 75 of 136 submissions, 55%;

Overall Acceptance Rate 75 of 136 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
257
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mohammed RKarim EElkari BHammouni AChellak SBaizri HCheggour M(2025)Twitter-Sentiment Analysis of Moroccan Diabetic: A Comparison StudyBig Data and Internet of Things10.1007/978-3-031-74491-4_68(880-896)Online publication date: 3-Jan-2025
https://doi.org/10.1007/978-3-031-74491-4_68
Suramanka LHanskunatai A(2024)Multi-Label Classification of Foreign Tourists' Opinions on Thailand Tourism DevelopmentProceedings of the 2024 9th International Conference on Big Data and Computing10.1145/3695220.3695227(32-38)Online publication date: 24-May-2024
https://dl.acm.org/doi/10.1145/3695220.3695227
Lai YChen M(2023)Review of Survey Research in Fuzzy Approach for Text MiningIEEE Access10.1109/ACCESS.2023.326816511(39635-39649)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3268165
Malla SKumar LAlphonse P(2023)Novel fuzzy deep learning approach for automated detection of useful COVID-19 tweetsArtificial Intelligence in Medicine10.1016/j.artmed.2023.102627143:COnline publication date: 18-Oct-2023
https://dl.acm.org/doi/10.1016/j.artmed.2023.102627
Sidiropoulos GDiamianos NApostolidis KPapakostas G(2022)Text Classification Using Intuitionistic Fuzzy Set Measures—An Evaluation StudyInformation10.3390/info1305023513:5(235)Online publication date: 5-May-2022
https://doi.org/10.3390/info13050235
He HZhou GZhao S(2022)Exploring E-Commerce Product Experience Based on Fusion Sentiment Analysis MethodIEEE Access10.1109/ACCESS.2022.321475210(110248-110260)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3214752
Fumanal-Idocin JTakáč ZHoranská LBustince HCordon O(2022)Fuzzy Clustering to Encode Contextual Information in Artistic Image ClassificationInformation Processing and Management of Uncertainty in Knowledge-Based Systems10.1007/978-3-031-08974-9_28(355-366)Online publication date: 4-Jul-2022
https://doi.org/10.1007/978-3-031-08974-9_28
Wattanakitrungroj NPinpo NTongman S(2021)Sentiment Polarity Classification using Minimal Feature Vectors and Machine Learning AlgorithmsProceedings of the 12th International Conference on Advances in Information Technology10.1145/3468784.3469947(1-8)Online publication date: 29-Jun-2021
https://dl.acm.org/doi/10.1145/3468784.3469947
Bounabi MMoutaouakil KSatori K(2020)The Automatic option of inference rules for the fuzzy TF-IDF2020 IEEE 2nd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS)10.1109/ICECOCS50124.2020.9314404(1-6)Online publication date: 2-Dec-2020
https://doi.org/10.1109/ICECOCS50124.2020.9314404
Pal KPatel B(2020)Emotion Classification with Reduced Feature Set SGDClassifier, Random Forest and Performance TuningComputing Science, Communication and Security10.1007/978-981-15-6648-6_8(95-108)Online publication date: 19-Jul-2020
https://doi.org/10.1007/978-981-15-6648-6_8

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten