ABSTRACT
The growth of applications in both scientific socialism and naturalism causes it increasingly difficult to assess whether a question is sincere or not. It is mandatory for many marketing and financial companies. Many utilizations will be reconfigured beyond recognition, especially text and images, while others face potential extinction as a corollary of advances in technology and computer science in particular. Analyzing text and image data will be truly needed for understanding valuable insights. In this paper, we analyzed the Quora dataset obtained from Kaggle.com to filter insincere and spam content. We used different preprocessing algorithms and analysis models provided in PySpark. Besides, we analyzed the manner of users established in writing their posts via the proposed prediction models. Finally, we showed the most accurate algorithm of the selected algorithms for classifying questions on Quora. The Gradient Boosted Tree was the best model for questions on Quora with an accuracy was 79.5% and followed was Long-Short Term Memory (LSTM) reaching 78.0%. Compared to other methods, the same building in Scikit-Learn and machine learning GRU, BiLSTM, BiGRU, applying models in PySpark could get a better answer in classifying questions on Quora.
- [1] Ohbyung Kwon, Namyeon Lee, Bongsik Shin, Data quality management data usage experience and acquisition intention of big data analytics, International Journal of Information managemnet, 2014.Google ScholarCross Ref
- [2] Mohammed AI-Ramani, Izzat Alsmadi,Using data analytics to filter insincere post from online social networks, a case study: Quora insincere question, Computer information system falcuty publications, 2020.Google Scholar
- [3] Richard A. Plunza, Yijia Zhoua, Maria Isabel Carrasco Vintimillab, Kathleen Mckeownc, Tao Yud, Laura Uguccionia, Maria Paola Sutto, Twitter sentiment in New York City parks as measure of well-being, Science Direct, 2019.Google Scholar
- [4] Tao Yu, Christopher Hidey, Owen Rambow, Kathleen McKeown, Leveraging sparse and dense feature combinations for sentiment classification, Science Direct, 2017.Google Scholar
- [5] Maryam Khanian Najafabadi, Mohd Naz’ri Mahrin, Suriayati Chuprat, Haslina Md Sarkan, Improving the accuracy of collaborative filtering recommendations using clustering and association rules mining on implicit data, 2016.Google Scholar
- [6] Maity, Suman Kalyan, Kharb, Aman,Mukherjee, Animesh, Language Use Matters: Analysis of the Linguistic Structure of Question Texts Can Characterize Answerability in Quora, 2017, http://arxiv.org/abs/1703.04001.Google Scholar
- [7] Pradeep Kumar Roy, Multilayer Convolutional Neural Network to Filter Low Quality Content from Quora, Neural Processing Letters, 2020.Google Scholar
- [8]Zhou Kaixin, Nai Wei, Zhu Shuoxian, Zhang Shupei, Xing Ying, Yang, Zan, Li Dan, Logistic Regression Based on Bat-Inspired Algorithm with Gaussian Initialization, IEEE 5th Advanced Information Technology Electronic and Automation Control Conference (IAEAC), 2021.Google Scholar
- [9]Du Qiming, Li Nan, Yang Shudan, Sun Daozhu, Liu Wenfu, Integrating KNN and Gradient Boosting Decision Tree for Recommendation, IEEE 5th Advanced Information Technology Electronic and Automation Control Conference (IAEAC), 2021.Google Scholar
- [10]Gardner Charles; Lo Dan Chia-Tien, PCA Embedded Random Forest, SoutheastCon, 2021.Google Scholar
- [11]Haoran Xu, Prediction on Bundesliga Games Based on Decision Tree Algorithm, IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), 2021.Google Scholar
- [12]Lei Ren, Jiabao Dong, Xiaokang Wang, Zihao Meng,Li Zhao, and M. Jamal Deen, A Data-Driven Auto-CNN-LSTM Prediction Model for Lithium-Ion Battery Remaining Useful Life, IEEE Transactions Industrial Informatics, 2021.Google Scholar
- [13]Vikash Kumar Sainia, Bhawana Bhardwajb, Vishu Guptac, Rajesh Kumarc, Akhilesh Mathurc, Gated Recurrent Unit (GRU) Based Short Term Forecasting for Wind Energy Estimation, International Conference on Power, Energy, Control and Transmission Systems, 2020.Google Scholar
- [14]Kelly Anthony; Johnson Marc Anthony, Investigating the Statistical Assumptions of Naïve Bayes Classifiers, Annual Conference on Information Sciences and Systems, 2021.Google Scholar
- [15]Pang Dong, Le Xinyi, Indoor Localization Using Bidirectional LSTM Networks, 13th International Conference on Advanced Computational Intelligence,2021.Google Scholar
- [16]Cheng Y, Sun H, Chen H, Li M, Cai Y, Cai Z, Huang J, Sentiment Analysis Using Multi-Head Attention Capsules With Multi-Channel CNN and Bidirectional GRU, IEEE Access, 2021.Google ScholarCross Ref
- [17] Liu Lei, Research on Logistic Regression Algorithm of Breast Cancer Diagnose Data by Machine Learning, International Conference on Robots & Intelligent System (ICRIS), 2018.Google Scholar
- [18] Sakai Yu, Yang Chen, Kihira Shingo, Tsankova Nadejda, Khan Fahad, Hormigo Adilia, Lai Albert, Cloughesy Timothy, Nael Kambiz, MRI Radiomic Features to Predict IDH1 Mutation Status in Gliomas: A Machine Learning Approach using Gradient Tree Boosting, International Journal of Molecular Sciences, 2020.Google Scholar
- [19]Mishra Shivam, Shukla Aakash, Arora Sandeep, Kathuria Himandhu, Singh Mandeep, Controlling Weather Dependent Tasks Using Random Forest Algorithm, Third International Conference on Advances in Electronics, Computers and Communications (ICAECC), 2020.Google Scholar
- [20]Do Nascimento Priscilla Machado, Medeiros Inácio Gomes, Falcão Raul Maia, Stransky, Beatriz, de Souza Jorge Estefano Santana, A decision tree to improve identification of pathogenic mutations in clinical practice, BMC Medical Informatics & Decision Making, 2020.Google Scholar
- [21] B S Sharmila, Rohini Nagapadma, Intrusion Detection System using Naive Bayes algorithm, IEEE International WIE Conference on Electrical and Computer Engineering, 2019.Google Scholar
Recommendations
Machine Learning: The State of the Art
The two fundamental problems in machine learning (ML) are statistical analysis and algorithm design. The former tells us the principles of the mathematical models that we establish from the observation data. The latter defines the conditions on which ...
Big Data Processing using Machine Learning algorithms: MLlib and Mahout Use Case
SITA'18: Proceedings of the 12th International Conference on Intelligent Systems: Theories and ApplicationsMachine learning is a field within artificial intelligence that allows machines to learn on their own from existing information to make predictions or/and decisions. There are three main categories of machine learning techniques: Collaborative filtering ...
Dropout prediction in Moocs using deep learning and machine learning
AbstractThe nature of teaching and learning has evolved over the years, especially as technology has evolved. Innovative application of educational analytics has gained momentum. Indeed, predictive analytics have become increasingly salient in education. ...
Comments