skip to main content
10.1145/3372938.3372956acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbdiotConference Proceedingsconference-collections
research-article

Text classification using Fuzzy TF-IDF and Machine Learning Models

Published: 07 January 2020 Publication History

Abstract

The representation of the information has an important impact on the text classification task. Several weighting methods were proposed in the literature, and the term frequency-inverse term frequency (TFIDF), the most know on the text treatment field. The FTF-IDF is a vector representation where the components of the TFIDF are presented as inputs to the Fuzzy Inference System (FIS). In this work, we compare several Machin Learning algorithms such as Naïve Bayes and its derivatives, SVM and Random forest classifiers, using the FTF-IDF representation. To improve the quality of the used classifiers, we call sum attribute selection methods. The recognition rate, for the tested systems, is satisfied, where the system based on naïve Bayes classifier, the FTF-IDF weighting terms, and the info gain select attributes method gives 98.7% as accuracy.

References

[1]
Lawrence, S., & Giles, C. L. (1998). Searching the World Wide Web.Science, 280(5360), 98--100.
[2]
Pazzani, M. J., Muramatsu, J., & Billsus, D. (1996, August). Syskill &Webert:Identifying interesting web sites. In AAAI/IAAI, Vol. 1 (pp. 54--61).
[3]
Sahami, M., Dumais S., Heckerman, D., Horvitz, E.(1998) A Bayesian Approach to Filtering Junk email.AAAI 98 Workshops on Text Categorization, July.
[4]
Lang, K (1995). NewsWeeder: Learning to Filter News. Proceedings of the 12th International Conference on Machine Learning, 331--339, Lake Tahoe, CA.
[5]
Qi, X., & Davison, B. D. (2009). Web page classification: Features and algorithms.ACM computing surveys (CSUR), 41(2), 12.
[6]
Quinlan, J.R. (1986) Induction of Decision Trees. Machine Learning, 1:81--106.
[7]
Apte, C., Damerau, F. and Weiss, S. (1998) Text Mining with Decision Rule and Decision Trees. In Proceedings of the Conference on Automated Learning and Discovery, CMU, June.
[8]
Tzeras, K. and Hartmann, S. (1993) Automatic Indexing Based on Bayesian Inference Networks. In Proceedings of the 16th Annual ACM/SIGIR Conference on Research and Development in Information Retrieval, 22--34.
[9]
Wiener, E., Pederson, J. and Weigend, A. (1995) A Neural Network Approach to Topic Spotting. Fourth Annual Symposium on Document Analysis and Information Retrieve
[10]
Masand, B., Linoff, G., Waltz, D. (1992) Classifying News Stories Using Memory Based Reasoning. In Proceedings of the 15th Annual ACM/SIGIR Conference on Research and Development in Information Retrieval, 59--65.
[11]
Deerwester, S., Dumais, S., Furnas G., Landauer, T. Harshman, R. (1990) Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391--407.
[12]
Joachims, T. (1998) Text Categorization with Suppor Vector Machines: Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning (ECML), Springer Verlag.
[13]
Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation, 60(5), 503--520.
[14]
Stein, B., zu Eissen, S. M., & Potthast, M. (2007, July). Strategies for retrieving plagiarized documents. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 825--826). ACM.
[15]
Sauvola, J., Seppanen, T., Haapakoski, S., & Pietikainen, M. (1997, August). Adaptive document binarization. In Proceedings of the Fourth International Conference on Document Analysis and Recognition (Vol. 1, pp. 147--152). IEEE.
[16]
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. MIT Press. Herrera, F., Ventura, S., Bello, R., Cornelis, C., Zafra, A., Tarragó, D. S., et al. (2016). Multiple instance learning--Foundations and algorithms. Springer.
[17]
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York, NY, USA: Cambridge University Press.
[18]
Chang, M., & Poon, C. K. (2009). Using phrases as features in email classification. Journal of Systems and Software, 82 (6), 1036--1045. 2009.01.013.
[19]
Xie, F., Wu, X., & Zhu, X. (2017). Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowledge-Based Systems, 115, 27--39.
[20]
Uysal, A. K., & Günal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 36, 226--235. 2012.06.005
[21]
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the fifty-second annual meeting of the association for computational linguistics: 1 (pp. 238--247). (Long papers)
[22]
Chaturvedi, I., Ong, Y., Tsang, I. W., Welsch, R. E., & Cambria, E. (2016). Learning word dependencies in text by means of a deep recurrent belief network. Knowledge-Based Systems, 108, 144--154.
[23]
Enríquez, F., Troyano, J. A., & López-Solaz, T. (2016). An approach to the use of word embeddings in an opinion classification task. Expert Systems with Applications, 66, 1--6.
[24]
Mikolov, T., Yih, W.-t., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT-2013): 13 (pp. 746--751).
[25]
Tommasel, A., & Godoy, D. (2018). Short-text feature construction and selection in social media data: a survey. Artificial Intelligence Review, 49 (3), 301--338.
[26]
Pennington, J., Socher, R., & Manning. (2014).Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532--1543).
[27]
Olvera-López, J. A., Carrasco-Ochoa, J. A., Trinidad, J. F. M., & Kittler, J. (2010).A review of instance selection methods. Artificial Intelligence Review 34 (2), 133--143.
[28]
Tsai, C., & Chang, C. (2013). SVOIS: Support vector oriented instance selection for text classification. Information Sciences, 38 (8), 1070--1083. 05.001.
[29]
Mirończuk, M. M., & Protasiewicz, J. (2018). A recent overview of the stateofthe-art elements of text classification. Expert Systems with Applications, 106 36--54.
[30]
Bounabi, M., El Moutaouakil, K., & Satori, K. (2017, March). A comparison of Text Classification methods Method of weighted terms selected by different Stemming Techniques. In Proceedings of the 2nd international Conference on Big Data, Cloud and Applications (p. 43). ACM.
[31]
D. Greene and P. Cunningham. 2006.Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering: Proc.ICML 2006.
[32]
M. Sokolova, N. Japkowicz and S. Szpakowicz: "Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation", Lecture Notes in Computer Science, Vol. 4304, 2006, pp. 1015--10.
[33]
R. J. Urbanowicz, M. Meeker, W. Lacava, R. S. Olson, and H. Jason, "ReliefBased Feature Selection: Introduction and Review."
[34]
E. C. Blessie and E. Karthikeyan, "Sigmis: A Feature Selection Algorithm Using Correlation Based Method," vol. 6, no. 3, pp. 385--394, 2012.
[35]
Bounabi, M., El Moutaouakil, K., & Satori, K. (2019). A comparison of text classification methods using different stemming techniques. International Journal of Computer Applications in Technology (IJCAT), 60(4), 298--306.
[36]
Desai, A., & Sunil, R. (2012). Analysis of machine learning algorithms using WEKA. International Journal of Computer Applications, 975, 8887.
[37]
Bounabi, M., El Moutaouakil, K., & Satori, K. (2018, April). A Probabilistic Vector Representation and Neural Network for Text Classification. In International Conference on Big Data, Cloud and Applications (pp. 343--355).
[38]
Ettaouil, M., Lazaar, M., Elmoutaouakil, K., Haddouch, K., "A new algorithm for optimization of the kohonen network architectures using the continuous hopfield networks", (2013) WSEAS Transactions on Computers, 12 (4), pp. 155--163.
[39]
Ettaouil, M., Lazaar, M., En-Naimani, Z., "A hybrid ANN/HMM models for arabic speech recognition using optimal codebook", (2013) 2013 8th International Conference on Intelligent Systems: Theories and Applications, SITA 2013.
[40]
Omara, H., Lazaar, M., Tabii, Y."Effect of feature selection on gene expression datasets classification accuracy", (2018) International Journal of Electrical and Computer Engineering, 8 (5), pp. 3194--3203
[41]
Khaldi, R., Chiheb, R., El Afia A., Akaaboune, A., and Faizi, R., 2017. Prediction of Supplier Performance: A Novel DEA-ANFIS Based Approach. 2nd BDCA conference. ACM.
[42]
Yager, R. R., & Zadeh, L. A. (Eds.). (2012). An introduction to fuzzy logic applications in intelligent systems (Vol. 165). Springer Science & Business Media.
[43]
Dahmouni, Abdellatif, K. El Moutaouakil, and Khalid Satori. "A Cloud Face Recognition System using A New Optimal Local Binary Pattern. " In Proceedings of the 2nd international Conference on Big Data, Cloud and Applications, p. 39. ACM, 2017.

Cited By

View all
  • (2025)Twitter-Sentiment Analysis of Moroccan Diabetic: A Comparison StudyBig Data and Internet of Things10.1007/978-3-031-74491-4_68(880-896)Online publication date: 3-Jan-2025
  • (2024)Multi-Label Classification of Foreign Tourists' Opinions on Thailand Tourism DevelopmentProceedings of the 2024 9th International Conference on Big Data and Computing10.1145/3695220.3695227(32-38)Online publication date: 24-May-2024
  • (2023)Review of Survey Research in Fuzzy Approach for Text MiningIEEE Access10.1109/ACCESS.2023.326816511(39635-39649)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
BDIoT '19: Proceedings of the 4th International Conference on Big Data and Internet of Things
October 2019
476 pages
ISBN:9781450372404
DOI:10.1145/3372938
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 January 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Attribute Selection
  2. Fuzzy TF-IDF
  3. Machin learning models
  4. TF-IDF
  5. text classification

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

BDIoT'19

Acceptance Rates

BDIoT '19 Paper Acceptance Rate 75 of 136 submissions, 55%;
Overall Acceptance Rate 75 of 136 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)2
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Twitter-Sentiment Analysis of Moroccan Diabetic: A Comparison StudyBig Data and Internet of Things10.1007/978-3-031-74491-4_68(880-896)Online publication date: 3-Jan-2025
  • (2024)Multi-Label Classification of Foreign Tourists' Opinions on Thailand Tourism DevelopmentProceedings of the 2024 9th International Conference on Big Data and Computing10.1145/3695220.3695227(32-38)Online publication date: 24-May-2024
  • (2023)Review of Survey Research in Fuzzy Approach for Text MiningIEEE Access10.1109/ACCESS.2023.326816511(39635-39649)Online publication date: 2023
  • (2023)Novel fuzzy deep learning approach for automated detection of useful COVID-19 tweetsArtificial Intelligence in Medicine10.1016/j.artmed.2023.102627143:COnline publication date: 18-Oct-2023
  • (2022)Text Classification Using Intuitionistic Fuzzy Set Measures—An Evaluation StudyInformation10.3390/info1305023513:5(235)Online publication date: 5-May-2022
  • (2022)Exploring E-Commerce Product Experience Based on Fusion Sentiment Analysis MethodIEEE Access10.1109/ACCESS.2022.321475210(110248-110260)Online publication date: 2022
  • (2022)Fuzzy Clustering to Encode Contextual Information in Artistic Image ClassificationInformation Processing and Management of Uncertainty in Knowledge-Based Systems10.1007/978-3-031-08974-9_28(355-366)Online publication date: 4-Jul-2022
  • (2021)Sentiment Polarity Classification using Minimal Feature Vectors and Machine Learning AlgorithmsProceedings of the 12th International Conference on Advances in Information Technology10.1145/3468784.3469947(1-8)Online publication date: 29-Jun-2021
  • (2020)The Automatic option of inference rules for the fuzzy TF-IDF2020 IEEE 2nd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS)10.1109/ICECOCS50124.2020.9314404(1-6)Online publication date: 2-Dec-2020
  • (2020)Emotion Classification with Reduced Feature Set SGDClassifier, Random Forest and Performance TuningComputing Science, Communication and Security10.1007/978-981-15-6648-6_8(95-108)Online publication date: 19-Jul-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media