ABSTRACT
Natural language processing is a field of computer science that focuses on understanding and analyzing textual data in any given language. Analyzing textual data is very tedious and leads to erroneous results due to unnecessary and noisy data in the corpus. Stop words are considered noisy data which the English language has already predefined corpus of stop words. Stop words in other languages such as Cebuano and Filipino are not yet supported in many NLP API. In the Philippines, users use different languages to post on Facebook. In this study, a corpus of Facebook posts was utilized in automatically detecting a stop word. A neural network was created based on Bidirectional Long Short term memory (BiLSTM). Word2vec was used to provide word embedding and representation from the corpus. The experimental result shows 72% accuracy in using the model.
- Schuster, M., & Paliwal, K. K. (1997). Bidirectional Recurrent Neural Networks. EEE TRANSACTIONS ON SIGNAL PROCESSING, 45(11).Google Scholar
- Olah, C. (2015). Understanding LSTM Networks. http://colah.github.io/posts/2015-08-Understanding-L STMs/.Google Scholar
- Wilbur, W. J., & Sirotkin, K. (1992). The automatic identification of stop words. Journal of information science, 18(1), 45-55.Google ScholarDigital Library
- Roman-Rangel, E., & Marchand-Maillet, S. (2014, November). Automatic removal of visual stop-words. In Proceedings of the 22nd ACM international conference on Multimedia (pp. 1145-1148).Google Scholar
- Namly, D., Bouzoubaa, K., Tajmout, R., & Laadimi, A. (2019, October). On Arabic Stop-Words: A Comprehensive List and a Dedicated Morphological Analyzer. In International Conference on Arabic Language Processing (pp. 149- 163). Springer, Cham.Google Scholar
- Saif, H., Fernandez, M., & Alani, H. (2014, October). Automatic stopword generation using contextual semantics for sentiment analysis of Twitter. In CEUR Workshop Proceedings (Vol. 1272). 7Google Scholar
- Aquino, A.M., & Niguidula, J.D. (2017). Analysis and Evaluation o 22 f the Technique Applied in Word Representation Using Word 2 vec Algorithm.Google Scholar
- Al-Amin, M., Islam, M., & Uzzal, S.D. (2017). Sentiment analysis of Bengali comments with Word2Vec and sentiment information of words. 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE), 186-190.Google ScholarCross Ref
- Gorro, K., Ancheta J., Capao, K., Oco, N., Roxas, R., Sabellano, M., Nonnecke, B., Mohanty, S., Crittenden, C., & Goldberg, K. 2017. Qualitative data analysis of disaster risk reduction suggestions assisted by topic modeling and word2vec Retrieved from http://ieeexplore.ieee.org/document/8300601/Google Scholar
- Ancheta, J. R., Gorro, K. D., & Uy, M. A. D. 2020. # Walangpasok on Twitter: Natural language processing as a method for analyzing tweets on class suspensions in the Philippines. 12th International Conference on Knowledge and Smart Technology (KST) (pp. 103-108). IEEEGoogle ScholarCross Ref
- Capao, K., Gorro, K. D., Gorro, K. D., Sabellano, M. J., Militante, C. L. A. G., & Manalili, J. P. C. (2018, April). Aspect Analysis of Cebu Establishments' Online Reviews using k-means Clustering and word2vec. In 2018 3rd International Conference on Computer and Communication Systems (ICCCS) (pp. 61-66). IEEE.Google Scholar
- Gorro, K., Gorro, K., Ilano, A., Sebial, A., Ranolo, E., & Vale, E. Qualitative Technology Acceptance Evaluation of JIRA in Software Development Using Machine Learning.In 2019 International Journal of Advanced Engineering Vol .02, No. 02.Google Scholar
- Chollet, F. (2015). keras. GitHub repository.Google Scholar
- Mikolov, T., Chen, K., Corrado, G.S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. CoRR, abs/1301.3781Google Scholar
- Gorro, K. D., Ali, M., Gorro, K. D., Ancheta, J. R., (2020, December) The 8th International Conference on Information Technology: IoT and Smart City, pp 69-73• https://doi.org/10.1145/3446999.3447012Google ScholarDigital Library
Recommendations
Punjabi Stop Words: A Gurmukhi, Shahmukhi and Roman Scripted Chronicle
WIR '16: Proceedings of the ACM Symposium on Women in Research 2016With advent of Unicode encoding, Punjabi language content, written using gurmukhi script as well as in shahmukhi script, is increasing day by day on internet. Processing textual information involves passing it to various pre-processing phases. Stop-word ...
Minimum word error training of long short-term memory recurrent neural network language models for speech recognition
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)This paper describes minimum word error (MWE) training of recurrent neural network language models (RNNLMs) for speech recognition. RNNLMs are usually trained to minimize a cross entropy of estimated word probabilities against the correct word sequence, ...
Automatic Language Identification for Romance Languages Using Stop Words and Diacritics
SYNASC '15: Proceedings of the 2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)Automatic language identification is a natural language processing problem that tries to determine the natural language of a given content. In this paper we present a statistical method for automatic language identification of written text using ...
Comments