Elsevier

Knowledge-Based Systems

Volume 174, 15 June 2019, Pages 27-42
Knowledge-Based Systems

Transportation sentiment analysis using word embedding and ontology-based topic modeling

https://doi.org/10.1016/j.knosys.2019.02.033Get rights and content

Highlights

  • Social networks provide a new approach to collect data regarding transportation.

  • Sentiment analysis can make observations of social data to examine transportation.

  • Current text mining techniques are unable to generate the topics accurately.

  • Document representation is another challenging tasks in sentiment analysis.

  • We proposed a new topic modeling and word embedding system for sentiment analysis.

Abstract

Social networks play a key role in providing a new approach to collecting information regarding mobility and transportation services. To study this information, sentiment analysis can make decent observations to support intelligent transportation systems (ITSs) in examining traffic control and management systems. However, sentiment analysis faces technical challenges: extracting meaningful information from social network platforms, and the transformation of extracted data into valuable information. In addition, accurate topic modeling and document representation are other challenging tasks in sentiment analysis. We propose an ontology and latent Dirichlet allocation (OLDA)-based topic modeling and word embedding approach for sentiment classification. The proposed system retrieves transportation content from social networks, removes irrelevant content to extract meaningful information, and generates topics and features from extracted data using OLDA. It also represents documents using word embedding techniques, and then employs lexicon-based approaches to enhance the accuracy of the word embedding model. The proposed ontology and the intelligent model are developed using Web Ontology Language and Java, respectively. Machine learning classifiers are used to evaluate the proposed word embedding system. The method achieves accuracy of 93%, which shows that the proposed approach is effective for sentiment classification.

Introduction

Recent advances in social media and textual resources have allowed realization of information retrieval and sentiment analysis in data mining and natural language processing (NLP) [1].However, extracting valuable information from online news articles and social media, such as Twitter, Facebook, and TripAdvisor, has become a new challenge for sentiment analysis. On one hand, the texts on social networks are unstructured and constantly increasing. On the other hand, online texts are short and have a lot of slang, idioms, jargon, and dynamic topics.

Intelligent transportation systems (ITSs) need social network data in order to examine transportation services and support traffic control and management systems. In social media, information about transportation networks, such as traffic jams and accidents, appears regularly with unexpected texts, and it would be a challenging task to extract these data and transform them into valuable information for analysis.

Text mining has gained much more attention amongresearchers, and has been proposed for automating information extraction from unstructured textual data. The rapid improvement in NLP and machine learning (ML) has developed two frameworks for text mining: one-hot encoding and word embedding. Statistical learning models exhibit good performance in document representation. Bag-of-words (BoW) is the first and the most popular model to represent a document in the field of NLP [2]. This model represents a document as a dictionary, and contains all words that occur in the document. The BoW model is easy to implement, works fast, and achieves good results with very little data. However, the dimensionality of a word vector is high, even for a single sentence, and neglects word order in the BoW model. Since it is not capable of representing large-scale data, the performance of the classifiers could not be improved. Therefore, a probabilistic approach has been proposed to overcome the limitations of BoW, such as latent Dirichlet allocation (LDA), latent semantic indexing (LSI), and principle component analysis (PCA).

Word embedding is a distributed representation approach, which is an alternative to BoW [1], [2]. It represents each word with a very low-dimensional vector and semantic meaning. In order to represent a word-vector for corpus data, a word embedding model, such as word2vec, doc2vec, and GloVe, must be trained using a large amount of social media data. However, word embedding models have some limitations. Using a pre-trained word embedding model with high dimensionality for a small amount of data is not the best way. For document representation, the two estimation methods of a word-vector miss the context of documents. In addition, word embedding neglects information on sentiment in any given content.

An LDA statistical model can automatically discover a latent topic from a large volume of transportation data. LDA disregards word order and groups semantically related words into the same topic based on their representation in the documents. However, LDA has three main limitations that affect the classification results. First, the generated topics under LDA comprise irrelevant features when other transportation-related text is in them. Second, it produces very noisy topics from short text, and misses valuable topics because of the limited dataset. Third, it neglects the relation between topic and document when a document has low-probability words. Ontologies are considered the best approach, and can enhance the performance of LDA to find appropriate topics along with features (words) in transportation data.

The goal of the proposed system is to improve the performance of document representation and sentiment classification. However, the accuracy of sentiment classification is dependent on the representation of text in documents. The existing text representation models examine imprecise words, which are not associated with the topics of the document, and neglect information on sentiment in any given content. Therefore, we propose ontology- and LDA-based topic modeling and a word embedding system to precisely represent texts to improve the accuracy of sentiment classification. The proposed model was trained using datasets from different social media networks, and an evaluation is conducted with ML classifiers. The results prove that the proposed approach is capable of correctly representing documents, and improves the accuracy of sentiment classification. The main contributions in this research are the following.

  • We propose a novel framework that retrieves the most relevant documents, reviews, and Tweets from social media and news articles.

  • We propose ontology- and LDA-based topic modeling called topic2vec that extracts the most appropriate topics and features of document, and neglects irrelevant words to enhance the document representation. The proposed ontology represents semantic knowledge that enriches an LDA model to extract more accurate features from transportation texts.

  • We integrate a topic2vec with word2vec and generate a word embedding model that represents each word in the document with semantic meanings and a low-dimensional vector.

  • We propose a new fuzzy ontology-based lexicon method, which is used with six other lexicons to enhance the accuracy of the pre-trained word embedding model in sentiment classification tasks.

  • We compare the performance of string2vec, word2vec,doc2vec, glove2vec, and lexicon2vec with our proposed model. We use ML algorithms to classify the data from these models and present the results. The comparison results help understand the limitations and advantages of the document representation models.

This paper is structured as follows. Section 2 presents discussions of sentiment analysis, topic modeling, and document representation models. Section 3 illustrates our proposed framework and the procedure of data collection and filtration. Section 4 provides information about topic modeling and word embedding. Section 5 presents the experimental results. Finally, Section 6 concludes our work.

Section snippets

Related work

This section looks at sentiment analysis, topic modeling, and word embedding approaches. First, we discuss the general standpoint of sentiment analysis, and then focus on the domain of social data related to transportation. We also present a brief review of topic modeling and deep learning-based word embedding approaches in sentiment classification.

Proposed approach

This section briefly introduces different methods that are applied to develop the proposed OLDA-based topic modeling and word embedding system. The main focus of the proposed approach is to enhance the performance of topic modeling, document representation, and sentiment classification. We used different techniques (namely LDA, the ontology, and deep learning) to represent words along with the most relevant topics for opinion classification. LDA is applied to find the statistical relationships

Topic modeling and word embedding

In this section, we employ LDA and ontology-based topic modeling to identify transportation-related topics in preprocessed data. After that, word embedding algorithms (word2vec and glove2vec) along with lexicon2vec are used to convert words in the corpus into a vector format. The whole scenario is shown in Fig. 2.

Experiments

The dataset used in the evaluation process was discussed in Section 3. The proposed approach was presented in Section 4. Here, the validation procedure is defined and the obtained results are discussed.

Conclusion

In this paper, we presented an ontology and LDA-based topic modeling and word embedding system to enhance the performance of document representation and sentiment classification, and to facilitate mobility users and ITSs. Various sensible issues are discussed, including valuable-information extraction, transformation of extracted data into useful knowledge, generation of topics and features using an ontology and LDA, representation of documents under different approaches, and integration of

Acknowledgment

This research was supported by the Ministry of Science, ICT and Future Planning (MSIP) , South Korea, under the ITRC support program (IITP-2017-2014-0-00729) supervised by the Institute for Information & communications Technology Promotion (IITP).

References (75)

  • BobilloF. et al.

    Fuzzy ontology representation using OWL 2

    Internat. J. Approx. Reason.

    (2011)
  • Rodríguez-GarcíaM.Á. et al.

    Ontology-based annotation and retrieval of services in the cloud

    Knowl.-Based Syst.

    (2014)
  • PengH. et al.

    Incremental term representation learning for social network analysis

    Future Gener. Comput. Syst.

    (2018)
  • AliF. et al.

    Opinion mining based on fuzzy domain ontology and support vector machine: A proposal to automate online review classification

    Appl. Soft Comput. J.

    (2016)
  • DaiX. et al.

    From social media to public health surveillance: Word embedding based clustering method for twitter classification

  • LeQ.V. et al.

    Distributed Representations of Sentences and Documents, Vol. 32

    (2014)
  • Salas-ZárateM.D.P. et al.

    Sentiment analysis on tweets about diabetes: An aspect-level approach

    Comput. Math. Methods Med.

    (2017)
  • ClavelC. et al.

    Sentiment analysis: From opinion mining to human-agent interaction

    IEEE Trans. Affect. Comput.

    (2016)
  • KrouskaA. et al.

    Comparative evaluation of algorithms for sentiment analysis over social networking services

    J. UCS

    (2017)
  • ShibuyaY.

    Public Sentiment and Demand for Used Cars after A Large-Scale Disaster : Social Media Sentiment Analysis with Facebook Pages

    (2018)
  • A. Teixeira, Data extraction and preparation to perform a The example of a Facebook fashion brand page,...
  • MarquezF.B.

    Acquiring and Exploiting Lexical Knowledge for Twitter Sentiment Analysis, Vol. 1994

    (2017)
  • SongJ. et al.

    A novel classification approach based on Naïve Bayes for Twitter sentiment analysis, Vol. 11

    (2017)
  • AliF. et al.

    Merged Ontology and SVM-Based Information Extraction and Recommendation System for Social Robots, Vol. 5

    (2017)
  • ChangC. et al.

    LIBSVM : A library for support vector machines

    ACM Trans. Intell. Syst. Technol. (TIST)

    (2013)
  • EffendyV. et al.

    Sentiment Analysis on Twitter about the Use of City Public Transportation Using Support Vector Machine Method

    (2011)
  • GattiL. et al.

    SentiWords: Deriving a high precision and high coverage lexicon for sentiment analysis

    IEEE Trans. Affect. Comput.

    (2016)
  • SantoshD.T. et al.

    Opinion mining of online product reviews from traditional LDA topic clusters using feature ontology tree and sentiwordnet

    Int. J. Educ. Manag. Eng.

    (2016)
  • ZhaoW. et al.

    Weakly-supervised deep embedding for product review sentiment analysis

    IEEE Trans. Knowl. Data Eng.

    (2017)
  • DragoniM. et al.

    A neural word embeddings approach for multi-domain sentiment analysis

    IEEE Trans. Affective Comput.

    (2017)
  • PereiraF.C. et al.

    Transport overcrowding with internet data

    IEEE Trans. Intell. Transp. Syst.

    (2015)
  • Grant-mullerS.M. et al.

    Enhancing Transport Data Collection Through Social Media Sources: Methods, Challenges and Opportunities for Textual Data

    (2014)
  • DasS. et al.

    Text mining and topic modeling of compendiums of papers from transportation research board annual meetings

    Transp. Res. Rec.: J. Transp. Res. Board

    (2016)
  • AbberleyL. et al.

    Modelling road congestion using ontologies for big data analytics in smart cities

  • PereiraJ.F.F.

    Social Media Text Processing and Semantic Analysis for Smart Cities

    (2017)
  • Riazul IslamS.M. et al.

    The IoT: Exciting possibilities for bettering lives: Special application scenarios

    IEEE Consum. Electron. Mag.

    (2016)
  • AliK. et al.

    Sentiment analysis as a service: A social media based sentiment analysis framework

  • Cited by (149)

    View all citing articles on Scopus
    View full text