Transportation sentiment analysis using word embedding and ontology-based topic modeling
Introduction
Recent advances in social media and textual resources have allowed realization of information retrieval and sentiment analysis in data mining and natural language processing (NLP) [1].However, extracting valuable information from online news articles and social media, such as Twitter, Facebook, and TripAdvisor, has become a new challenge for sentiment analysis. On one hand, the texts on social networks are unstructured and constantly increasing. On the other hand, online texts are short and have a lot of slang, idioms, jargon, and dynamic topics.
Intelligent transportation systems (ITSs) need social network data in order to examine transportation services and support traffic control and management systems. In social media, information about transportation networks, such as traffic jams and accidents, appears regularly with unexpected texts, and it would be a challenging task to extract these data and transform them into valuable information for analysis.
Text mining has gained much more attention amongresearchers, and has been proposed for automating information extraction from unstructured textual data. The rapid improvement in NLP and machine learning (ML) has developed two frameworks for text mining: one-hot encoding and word embedding. Statistical learning models exhibit good performance in document representation. Bag-of-words (BoW) is the first and the most popular model to represent a document in the field of NLP [2]. This model represents a document as a dictionary, and contains all words that occur in the document. The BoW model is easy to implement, works fast, and achieves good results with very little data. However, the dimensionality of a word vector is high, even for a single sentence, and neglects word order in the BoW model. Since it is not capable of representing large-scale data, the performance of the classifiers could not be improved. Therefore, a probabilistic approach has been proposed to overcome the limitations of BoW, such as latent Dirichlet allocation (LDA), latent semantic indexing (LSI), and principle component analysis (PCA).
Word embedding is a distributed representation approach, which is an alternative to BoW [1], [2]. It represents each word with a very low-dimensional vector and semantic meaning. In order to represent a word-vector for corpus data, a word embedding model, such as word2vec, doc2vec, and GloVe, must be trained using a large amount of social media data. However, word embedding models have some limitations. Using a pre-trained word embedding model with high dimensionality for a small amount of data is not the best way. For document representation, the two estimation methods of a word-vector miss the context of documents. In addition, word embedding neglects information on sentiment in any given content.
An LDA statistical model can automatically discover a latent topic from a large volume of transportation data. LDA disregards word order and groups semantically related words into the same topic based on their representation in the documents. However, LDA has three main limitations that affect the classification results. First, the generated topics under LDA comprise irrelevant features when other transportation-related text is in them. Second, it produces very noisy topics from short text, and misses valuable topics because of the limited dataset. Third, it neglects the relation between topic and document when a document has low-probability words. Ontologies are considered the best approach, and can enhance the performance of LDA to find appropriate topics along with features (words) in transportation data.
The goal of the proposed system is to improve the performance of document representation and sentiment classification. However, the accuracy of sentiment classification is dependent on the representation of text in documents. The existing text representation models examine imprecise words, which are not associated with the topics of the document, and neglect information on sentiment in any given content. Therefore, we propose ontology- and LDA-based topic modeling and a word embedding system to precisely represent texts to improve the accuracy of sentiment classification. The proposed model was trained using datasets from different social media networks, and an evaluation is conducted with ML classifiers. The results prove that the proposed approach is capable of correctly representing documents, and improves the accuracy of sentiment classification. The main contributions in this research are the following.
- •
We propose a novel framework that retrieves the most relevant documents, reviews, and Tweets from social media and news articles.
- •
We propose ontology- and LDA-based topic modeling called topic2vec that extracts the most appropriate topics and features of document, and neglects irrelevant words to enhance the document representation. The proposed ontology represents semantic knowledge that enriches an LDA model to extract more accurate features from transportation texts.
- •
We integrate a topic2vec with word2vec and generate a word embedding model that represents each word in the document with semantic meanings and a low-dimensional vector.
- •
We propose a new fuzzy ontology-based lexicon method, which is used with six other lexicons to enhance the accuracy of the pre-trained word embedding model in sentiment classification tasks.
- •
We compare the performance of string2vec, word2vec,doc2vec, glove2vec, and lexicon2vec with our proposed model. We use ML algorithms to classify the data from these models and present the results. The comparison results help understand the limitations and advantages of the document representation models.
This paper is structured as follows. Section 2 presents discussions of sentiment analysis, topic modeling, and document representation models. Section 3 illustrates our proposed framework and the procedure of data collection and filtration. Section 4 provides information about topic modeling and word embedding. Section 5 presents the experimental results. Finally, Section 6 concludes our work.
Section snippets
Related work
This section looks at sentiment analysis, topic modeling, and word embedding approaches. First, we discuss the general standpoint of sentiment analysis, and then focus on the domain of social data related to transportation. We also present a brief review of topic modeling and deep learning-based word embedding approaches in sentiment classification.
Proposed approach
This section briefly introduces different methods that are applied to develop the proposed OLDA-based topic modeling and word embedding system. The main focus of the proposed approach is to enhance the performance of topic modeling, document representation, and sentiment classification. We used different techniques (namely LDA, the ontology, and deep learning) to represent words along with the most relevant topics for opinion classification. LDA is applied to find the statistical relationships
Topic modeling and word embedding
In this section, we employ LDA and ontology-based topic modeling to identify transportation-related topics in preprocessed data. After that, word embedding algorithms (word2vec and glove2vec) along with lexicon2vec are used to convert words in the corpus into a vector format. The whole scenario is shown in Fig. 2.
Experiments
The dataset used in the evaluation process was discussed in Section 3. The proposed approach was presented in Section 4. Here, the validation procedure is defined and the obtained results are discussed.
Conclusion
In this paper, we presented an ontology and LDA-based topic modeling and word embedding system to enhance the performance of document representation and sentiment classification, and to facilitate mobility users and ITSs. Various sensible issues are discussed, including valuable-information extraction, transformation of extracted data into useful knowledge, generation of topics and features using an ontology and LDA, representation of documents under different approaches, and integration of
Acknowledgment
This research was supported by the Ministry of Science, ICT and Future Planning (MSIP) , South Korea, under the ITRC support program (IITP-2017-2014-0-00729) supervised by the Institute for Information & communications Technology Promotion (IITP).
References (75)
- et al.
Social analytics: Learning fuzzy product ontologies for aspect-oriented sentiment analysis
Decis. Support Syst.
(2014) - et al.
A hybrid model using logistic regression and wavelet transformation to detect traffic incidents
IATSS Res.
(2016) - et al.
A fuzzy-based strategy for multi-domain sentiment analysis
Int. J. Approx. Reason.
(2018) - et al.
Fuzzy ontology-based sentiment analysis of transportation and city feature reviews for safe traveling q
Transp. Res. Part C
(2017) - et al.
Consensus vote models for detecting and filtering neutrality in sentiment analysis
Inf. Fusion
(2018) - et al.
A topic-enhanced word embedding for twitter sentiment classification
Inform. Sci.
(2016) - et al.
Ontologies for transportation research: A survey
Transp. Res. Part C
(2018) - et al.
W2VLDA: Almost unsupervised system for aspect based sentiment analysis
Expert Syst. Appl.
(2018) - et al.
Content tree word embedding for document representation
Expert Syst. Appl.
(2017) - et al.
Opinion mining based on fuzzy domain ontology and support vector machine: A proposal to automate online review classification
Appl. Soft Comput.
(2016)
Fuzzy ontology representation using OWL 2
Internat. J. Approx. Reason.
Ontology-based annotation and retrieval of services in the cloud
Knowl.-Based Syst.
Incremental term representation learning for social network analysis
Future Gener. Comput. Syst.
Opinion mining based on fuzzy domain ontology and support vector machine: A proposal to automate online review classification
Appl. Soft Comput. J.
From social media to public health surveillance: Word embedding based clustering method for twitter classification
Distributed Representations of Sentences and Documents, Vol. 32
Sentiment analysis on tweets about diabetes: An aspect-level approach
Comput. Math. Methods Med.
Sentiment analysis: From opinion mining to human-agent interaction
IEEE Trans. Affect. Comput.
Comparative evaluation of algorithms for sentiment analysis over social networking services
J. UCS
Public Sentiment and Demand for Used Cars after A Large-Scale Disaster : Social Media Sentiment Analysis with Facebook Pages
Acquiring and Exploiting Lexical Knowledge for Twitter Sentiment Analysis, Vol. 1994
A novel classification approach based on Naïve Bayes for Twitter sentiment analysis, Vol. 11
Merged Ontology and SVM-Based Information Extraction and Recommendation System for Social Robots, Vol. 5
LIBSVM : A library for support vector machines
ACM Trans. Intell. Syst. Technol. (TIST)
Sentiment Analysis on Twitter about the Use of City Public Transportation Using Support Vector Machine Method
SentiWords: Deriving a high precision and high coverage lexicon for sentiment analysis
IEEE Trans. Affect. Comput.
Opinion mining of online product reviews from traditional LDA topic clusters using feature ontology tree and sentiwordnet
Int. J. Educ. Manag. Eng.
Weakly-supervised deep embedding for product review sentiment analysis
IEEE Trans. Knowl. Data Eng.
A neural word embeddings approach for multi-domain sentiment analysis
IEEE Trans. Affective Comput.
Transport overcrowding with internet data
IEEE Trans. Intell. Transp. Syst.
Enhancing Transport Data Collection Through Social Media Sources: Methods, Challenges and Opportunities for Textual Data
Text mining and topic modeling of compendiums of papers from transportation research board annual meetings
Transp. Res. Rec.: J. Transp. Res. Board
Modelling road congestion using ontologies for big data analytics in smart cities
Social Media Text Processing and Semantic Analysis for Smart Cities
The IoT: Exciting possibilities for bettering lives: Special application scenarios
IEEE Consum. Electron. Mag.
Sentiment analysis as a service: A social media based sentiment analysis framework
Cited by (149)
Transforming sentiment analysis for e-commerce product reviews: Hybrid deep learning model with an innovative term weighting and feature selection
2024, Information Processing and ManagementProgress, achievements, and challenges in multimodal sentiment analysis using deep learning: A survey
2024, Applied Soft ComputingBdSL47: A complete depth-based Bangla sign alphabet and digit dataset
2023, Data in BriefChanging or unchanging Chinese attitudes toward ride-hailing? A social media analytics perspective from 2018 to 2021
2023, Transportation Research Part A: Policy and PracticeTHAT-Net: Two-layer hidden state aggregation based two-stream network for traffic accident prediction
2023, Information Sciences