research-article

Stop words detection using a long short term memory recurrent neural network

Authors:
Ken D. Gorro

Department of Industrial Technology, Cebu Technological University, Philippines

Department of Industrial Technology, Cebu Technological University, Philippines
View Profile

,
Moustafa F. Ali

Department of Computer,Information Sciences and Mathematics, University of San Carlos, Philippines

Department of Computer,Information Sciences and Mathematics, University of San Carlos, Philippines
View Profile

,
Leodivino A. Lawas

Department of Information Technology, Cebu Technological University, Philippines

Department of Information Technology, Cebu Technological University, Philippines
View Profile

,
Anthony S. Ilano

Department of Fisheries, Cebu Technological University, Philippines

Department of Fisheries, Cebu Technological University, Philippines
View Profile

ICIT '21: Proceedings of the 2021 9th International Conference on Information Technology: IoT and Smart CityDecember 2021Pages 199–202https://doi.org/10.1145/3512576.3512612

Published:11 April 2022Publication History

ICIT '21: Proceedings of the 2021 9th International Conference on Information Technology: IoT and Smart City

Pages 199–202

ABSTRACT

Natural language processing is a field of computer science that focuses on understanding and analyzing textual data in any given language. Analyzing textual data is very tedious and leads to erroneous results due to unnecessary and noisy data in the corpus. Stop words are considered noisy data which the English language has already predefined corpus of stop words. Stop words in other languages such as Cebuano and Filipino are not yet supported in many NLP API. In the Philippines, users use different languages to post on Facebook. In this study, a corpus of Facebook posts was utilized in automatically detecting a stop word. A neural network was created based on Bidirectional Long Short term memory (BiLSTM). Word2vec was used to provide word embedding and representation from the corpus. The experimental result shows 72% accuracy in using the model.

References

Schuster, M., & Paliwal, K. K. (1997). Bidirectional Recurrent Neural Networks. EEE TRANSACTIONS ON SIGNAL PROCESSING, 45(11).Google Scholar
Olah, C. (2015). Understanding LSTM Networks. http://colah.github.io/posts/2015-08-Understanding-L STMs/.Google Scholar
Wilbur, W. J., & Sirotkin, K. (1992). The automatic identification of stop words. Journal of information science, 18(1), 45-55.Google ScholarDigital Library
Roman-Rangel, E., & Marchand-Maillet, S. (2014, November). Automatic removal of visual stop-words. In Proceedings of the 22nd ACM international conference on Multimedia (pp. 1145-1148).Google Scholar
Namly, D., Bouzoubaa, K., Tajmout, R., & Laadimi, A. (2019, October). On Arabic Stop-Words: A Comprehensive List and a Dedicated Morphological Analyzer. In International Conference on Arabic Language Processing (pp. 149- 163). Springer, Cham.Google Scholar
Saif, H., Fernandez, M., & Alani, H. (2014, October). Automatic stopword generation using contextual semantics for sentiment analysis of Twitter. In CEUR Workshop Proceedings (Vol. 1272). 7Google Scholar
Aquino, A.M., & Niguidula, J.D. (2017). Analysis and Evaluation o 22 f the Technique Applied in Word Representation Using Word 2 vec Algorithm.Google Scholar
Al-Amin, M., Islam, M., & Uzzal, S.D. (2017). Sentiment analysis of Bengali comments with Word2Vec and sentiment information of words. 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE), 186-190.Google ScholarCross Ref
Gorro, K., Ancheta J., Capao, K., Oco, N., Roxas, R., Sabellano, M., Nonnecke, B., Mohanty, S., Crittenden, C., & Goldberg, K. 2017. Qualitative data analysis of disaster risk reduction suggestions assisted by topic modeling and word2vec Retrieved from http://ieeexplore.ieee.org/document/8300601/Google Scholar
Ancheta, J. R., Gorro, K. D., & Uy, M. A. D. 2020. # Walangpasok on Twitter: Natural language processing as a method for analyzing tweets on class suspensions in the Philippines. 12th International Conference on Knowledge and Smart Technology (KST) (pp. 103-108). IEEEGoogle ScholarCross Ref
Capao, K., Gorro, K. D., Gorro, K. D., Sabellano, M. J., Militante, C. L. A. G., & Manalili, J. P. C. (2018, April). Aspect Analysis of Cebu Establishments' Online Reviews using k-means Clustering and word2vec. In 2018 3rd International Conference on Computer and Communication Systems (ICCCS) (pp. 61-66). IEEE.Google Scholar
Gorro, K., Gorro, K., Ilano, A., Sebial, A., Ranolo, E., & Vale, E. Qualitative Technology Acceptance Evaluation of JIRA in Software Development Using Machine Learning.In 2019 International Journal of Advanced Engineering Vol .02, No. 02.Google Scholar
Chollet, F. (2015). keras. GitHub repository.Google Scholar
Mikolov, T., Chen, K., Corrado, G.S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. CoRR, abs/1301.3781Google Scholar
Gorro, K. D., Ali, M., Gorro, K. D., Ancheta, J. R., (2020, December) The 8th International Conference on Information Technology: IoT and Smart City, pp 69-73• https://doi.org/10.1145/3446999.3447012Google ScholarDigital Library

Recommendations

Punjabi Stop Words: A Gurmukhi, Shahmukhi and Roman Scripted Chronicle
WIR '16: Proceedings of the ACM Symposium on Women in Research 2016

With advent of Unicode encoding, Punjabi language content, written using gurmukhi script as well as in shahmukhi script, is increasing day by day on internet. Processing textual information involves passing it to various pre-processing phases. Stop-word ...
Read More
Minimum word error training of long short-term memory recurrent neural network language models for speech recognition
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper describes minimum word error (MWE) training of recurrent neural network language models (RNNLMs) for speech recognition. RNNLMs are usually trained to minimize a cross entropy of estimated word probabilities against the correct word sequence, ...
Read More
Automatic Language Identification for Romance Languages Using Stop Words and Diacritics
SYNASC '15: Proceedings of the 2015 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)

Automatic language identification is a natural language processing problem that tries to determine the natural language of a given content. In this paper we present a statistical method for automatic language identification of written text using ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICIT '21: Proceedings of the 2021 9th International Conference on Information Technology: IoT and Smart City
December 2021
584 pages
ISBN:9781450384971
DOI:10.1145/3512576

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 April 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 31
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Stop words detection using a long short term memory recurrent neural network

ICIT '21: Proceedings of the 2021 9th International Conference on Information Technology: IoT and Smart City

ABSTRACT

References

Cited By

Recommendations

Punjabi Stop Words: A Gurmukhi, Shahmukhi and Roman Scripted Chronicle

Minimum word error training of long short-term memory recurrent neural network language models for speech recognition

Automatic Language Identification for Romance Languages Using Stop Words and Diacritics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Stop words detection using a long short term memory recurrent neural network

ICIT '21: Proceedings of the 2021 9th International Conference on Information Technology: IoT and Smart City

ABSTRACT

References

Cited By

Recommendations

Punjabi Stop Words: A Gurmukhi, Shahmukhi and Roman Scripted Chronicle

Minimum word error training of long short-term memory recurrent neural network language models for speech recognition

Automatic Language Identification for Romance Languages Using Stop Words and Diacritics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media