Abstract:
Apache Spark allows us to write a distributed version of any machine learning algorithm, which can be easily scaled up for a larger dataset on a cluster of commodity hard...Show MoreMetadata
Abstract:
Apache Spark allows us to write a distributed version of any machine learning algorithm, which can be easily scaled up for a larger dataset on a cluster of commodity hardware. In this paper, we propose the hybridization of paragraph vector with distributed, parallel versions of well-known six machine learning techniques for sentiment analysis. We employed a distributed implementation of neural network language model to obtain paragraph vectors for a given corpus. On the paragraph vectors so obtained, we employed a host of distributed classification algorithms available in Apache Spark to perform sentiment classification. We considered two approaches viz. Bag-of-Words based document-term matrix (DTM) and hashing-trick based DTM as two baseline methods for comparison. We experimented with a movie review dataset of size 992 MB. Among the six classifiers employed, MLP turned out to be statistically the same as GBT and SVM, while it statistically significantly outperformed the rest of classifiers by yielding an area under of ROC curve (AUC) of 95.44%.
Published in: 2018 IEEE 17th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)
Date of Conference: 16-18 July 2018
Date Added to IEEE Xplore: 07 October 2018
ISBN Information: