Elsevier

Knowledge-Based Systems

Volume 228, 27 September 2021, 107249
Knowledge-Based Systems

Multi-interest semantic changes over time in short-text microblogs

https://doi.org/10.1016/j.knosys.2021.107249Get rights and content

Highlights

  • User-representative profiling framework based on time-variational disseminated content.

  • Timeseries-based user interest modelling as categorical word embedding distributions.

  • Extraction of temporal semantic patterns in vectors for interests representation.

  • Semantic relevance validation on a demographically relevant control set.

  • Collective intelligence framework for short-text microblog third-party content providers.

Abstract

Consumption of content in short-text microblogs is necessitated to a large extent by individual users and their friendship network interests. Based on the dynamism in the data throughput on such platforms, e.g., Twitter, prevailing conditions are bound to determine the nature of consumed or disseminated content. Therefore, semantic interests differ over time even for individual users. Detecting this semantic change over time is integral in mapping user profiles over a time period, especially in microblogs where only the extrinsic user profile identifiers provide metadata that seldom evolve. This is vital in serving relevant third-party content as well as in the computation of topical interest variations over time. In essence, current, and relevant topics of interest to a user on such a platform may not be representative of the same users’ interests a few months later. In our quest to identify the most user-representative interests at any given time, each topical term was modelled as the inner product between word embeddings and a time-based embedding representation of assigned topics at varied time periods. The model was fitted onto tweets as time-series documents. To validate the model, changes in the extracted user-representative interests over time were semantically weighed against a mirrored, time-variant dataset. Interest weights across the time-variant datasets were computed and validated in five sub-topics for a period spanning two and a half years. Linearity in the relationships between the test and validation sets could be identified, more so in emerging topics. A Pearson correlation coefficient as high as 0.871 was achieved in interest change verification over the tested period.

Introduction

Currently, online content propagation has been proliferated by the surge in citizen journalism. This is partly attributed to the increase in the number of devices e.g. mobile phones, availability of internet as well as with the emergence of many social microblogging platforms like Twitter. Twitter as a short-text microblog has been instrumental in sharing near to real-time data in form of text, videos, hyperlinks and images. In essence, the number of disseminated tweets average about 500 billion per year, which roughly translates to approximately 6000 tweets per second.1 On the platform, users are able to re-share the disseminated tweets, in form of “retweets”, “comment” on and/or “like” the original tweet.

Generally, tweeters2 consume content on the platform based on their prevailing interests at the time. For example, in times of political campaigns, many demographically relevant tweeters are likely to express interest in political content. However, this interest is likely to decay over time when the political season is over. The same scenario could be replicated in a sports season where support of teams fades as the season winds up. Modelling and extraction of such user dissemination patterns and representative interests is a challenging task especially for legacy systems. This is attributed to two factors (i) The data volatility factoring its dissemination throughput (ii) Time-based variations in the nature of topics of interest.

The ability to extract and present time-sensitive and accurate user representative profiles from such evolving short texts is important in recommender systems design research. The design goal of recommender systems on such platforms is to personalize both the third-party content, and the user identification process. This in turn presents the most relevant users as follower–followee suggestions, as well as delivery of more personalized third-party content for the users. This personalization process is based on the extrinsic interests from the disseminated content.

The assumption in the design of the framework is that the semantics of disseminated content change over time for individual users. Therefore, we present a framework that is capable of discerning semantic user interests of short-text microblog users. This is challenging in short-text microblogs, as the level of expressivity is not always exhaustive due to character limitations per document/tweet. The text variable in a tweet’s metadata is limited to 280 characters though realistically, the average tweet length is much shorter. This is in addition to factors relating to throughput and content decay and gain over time on such platforms.

In the quest to design such a framework, there was need to first extract topical representations of the data at specific periods. Since documents were short texts, vector representations worked well in discerning their semantic relevance. Each token in the pre-processed tweet was vectorized and modelled as the inner product between the vocabulary, and topic embeddings across specific periods. Topics of interest at each period were captured with the overall semantic representation being user interest weights across the dataset collection period. This way, semantic divergence across several topics of interest were accurately represented.

With the personalizing goal, there is need to develop a framework that is able to capture user-representative interests on such platforms, but with a factor of time. Such a framework has the ability to present semantic profiles to third-party content curators from an informative point of view on how user interests evolved with time. This can serve specific purposes related to the generation of profile information for users of interest in certain topics at given times. Such interest patterns can then be used in recommendations of time specific content, as well as related follower–followee networks. In the development of this framework, the following research questions are addressed:-

  • Is it possible to extract user-representative interests in time-series based streaming short texts for profiling?

  • Are the extracted patterns in the short-text sufficient in making time-dependent user/content recommendations/ predictions for generalized short-text content disseminators?

This work is inspired by research in the computation of the degree of interest in selected topics as well as multi-interest user profiling in short-text microblogs [1], [2]. In follow-back recommendations, user-representative interests recommendations were generated among short-text microblog users with shared interests. The social theory of homophily was applied in validating the semantic correlations among the users [3]. In multi-interest profiling, the quantification of user interests across topics was computed by generating a responsibility matrix across users depicting their interest levels across the topics. Algorithmically, Expectation Maximization (EM) and Gaussian Mixed Models (GMM) were applied over the vector representations to extract soft clusters [4]. Furthermore, the semantic distance to the clusters per user aggregated vector representations was assumed to be the interest level in the topical cluster.

The below points highlight the novelty in this research. In addition, we have emphasized how this work differs from other works at the end of Section 2.

  • 1.

    The combination of word and topical embeddings as a time-variational distributional model differs from other works in the user profiling domain. Conventional modelling in related works in this domain, point to either word or topical representation models and not an amalgamation of the two.

  • 2.

    In our approach, time sensitivity is definitive in the representation of the generated user-representative profiles. Inferencing topics over word embeddings where topical vectors are generated per mini-batch of documents, per timestamp, highlighting the dynamicity in identification of representative user interests in short texts. This time-sensitive approach differs from keyword, concept and hybrid profiling approaches that make use of external knowledge-bases in extraction of such interests. Vocabulary sparsity, volume, variety and velocity in short texts dissemination makes the above profiling approaches insufficient in this scenario.

In addition to the above, the following contributions were made in the paper: -

  • 1.

    Formulation of semantically representative user profiling framework that considers the disseminated content as time-based entities.

  • 2.

    Each word in the framework is modelled as a categorical distribution of word embeddings and a time-based representation of the word’s assigned topic.

  • 3.

    The model and data ingestion framework is tested on a generic set of tweets geolocated to an area over time. This ensured accuracy in validation of the end results using a true class dataset over certain periods in the collection. User-representative interests over a generic test set were computed by the methodology that qualitatively outperformed the other approaches across a range of measures.

  • 4.

    Quantitatively, semantic weights between the control and test sets in five sub-topics across ten quarters were measured depicting their semantic correlations. Linearity in the correlations across the timestamps indicated the validity of our modelling approach.

The rest of the paper is organized as follows. Section 2 summarizes the background and related literature of our study. Our approach is described in Section 3. The experimental framework is presented in Section 4, with results discussed in Section 5. Discussion and application areas are elicited in Section 6.

Section snippets

Related works

Microblog user’s content consumption patterns vary over time. This is mostly influenced by events at the specific timeframes, more so in short-text microblogs like Twitter. Designing a framework that is able to incorporate these semantic changes over time in user-representative profiling is pertinent. There are two approaches in the formulation of user profiles especially for use in recommender systems. The design either follows the behavioural [5] or structural [6] patterns in the modelling

Our approach

In modelling evolving user interests, the proposed framework encompasses several processes related to the generation of topical interests, word embeddings and a time-variant distributional model. The topical interests are distributed over word representations as time variational topics in the model. Time variations are user-defined as the dataset is time-series documents. Latent Dirichlet Allocation (LDA) is used to generate the topical information at each timestamp [44]. Summaries of the most

Experimentation

This section validates the processes presented in Section 3. A few consecutive steps are followed in the modelling process in addition to the description of the time-variant dataset in the experimentation. The experimentation process aims to generate topics as interests that are sensitive to time spanning the dataset period. Therefore, the end result in this experimentation phase is a comparative evaluation of an agreement between the control set in Section 4.1.2 and the generated topics from

Results

The framework’s performance was measured quantitatively and qualitatively to ascertain that the results corroborated in both dimensions. Quantitatively, topical quality across the timestamps was the tested measure using the generated embeddings. Ideally, the best performing modelling algorithm was adapted in further qualitative evaluations. This related to intra-topical changes over time and correlations in topical interests, a key measure in the user profiling process in streaming texts.

Discussion

The goal of our work has been to identify evolving topical interests in short texts and eventually build representative user profiles. Prior research in evolution of interests in streaming platforms encompassed several factors. Usage of external data, annotation, and a mixture of features in the augmentation of user interests in the profiling process has been studied [31], [32], [35]. Word embeddings representations have also been utilized in the generation of dynamic user profiles [38].

Conclusion and future work

Microblogging platforms like Twitter help present intrinsic and extrinsic user profiles to third party content providers. This to a large extent is based on the nature of content that users and their friendship networks consume over time. A framework that factors variational timestamps on topical embeddings was proposed. Several embeddings techniques and baselines were modelled and tested for topical quality. FastText-based embeddings were selected for further computation based on their success

CRediT authorship contribution statement

Herman M. Wandabwa: Conceptualization, Methodology, Data curation, Formal analysis. M. Asif Naeem: Resources, Writing - review & editing, Supervision. Farhaan Mirza: Writing - review & editing, Supervision, Validation. Russel Pears: Supervision, Conceptualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (61)

  • WuW. et al.

    Tracking spatio-temporal variation of geo-tagged topics with social media in China: A case study of 2016 hefei rainstorm

    Int. J. Disas. Risk Reduc.

    (2020)
  • ŞerbanO. et al.

    Real-time processing of social media with SENTINEL: A syndromic surveillance system incorporating deep learning for health classification

    Inf. Process. Manage.

    (2019)
  • KorenčićD. et al.

    Document-based topic coherence measures for news media text

    Expert Syst. Appl.

    (2018)
  • H. Wandabwa, M.A. Naeem, F. Mirza, R. Pears, Follow-back recommendations for sports bettors: A Twitter-based approach,...
  • WandabwaH. et al.

    Multi-interest user profiling in short text microblogs

  • DempsterA.P. et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (1977)
  • YinH. et al.

    Dynamic user modeling in social media systems

    ACM Trans. Inform. Syst. (TOIS)

    (2015)
  • YangP. et al.

    Comparison and modelling of country-level microblog user and activity in cyber-physical-social systems using weibo and Twitter data

    ACM Trans. Intell. Syst. Technol. (TIST)

    (2019)
  • GhaemiZ. et al.

    A varied density-based clustering approach for event detection from heterogeneous twitter data

    ISPRS Int. J. Geo-Inf.

    (2019)
  • AggarwalC.C.

    Content-based recommender systems

  • PiaoG. et al.

    Inferring user interests in microblogging social networks: a survey

    User Model. User-Adapted Interact.

    (2018)
  • GauchS. et al.

    User profiles for personalized information access

  • AlsaeediA.

    A survey of term weighting schemes for text classification

    Int. J. Data Mining Modell. Manage.

    (2020)
  • P. Bhattacharya, M.B. Zafar, N. Ganguly, S. Ghosh, K.P. Gummadi, Inferring user interests in the twitter social...
  • PaulI. et al.

    Elites tweet? Characterizing the Twitter verified user network

  • J.R. Chowdhury, C. Caragea, D. Caragea, On identifying hashtags in disaster Twitter data, in: Proceedings of the AAAI...
  • Y. Wei, Z. Cheng, X. Yu, Z. Zhao, L. Zhu, L. Nie, Personalized hashtag recommendation for micro-videos, in: Proceedings...
  • CuiR. et al.

    Tweets can tell: activity recognition using hybrid gated recurrent neural networks

    Soc. Netw. Anal. Min.

    (2020)
  • ZhengX. et al.

    Collecting event-related tweets from twitter stream

    J. Assoc. Inform. Sci. Technol.

    (2019)
  • P. Dooley, B. Božić, Towards linked data for wikidata revisions and Twitter trending hashtags, in: Proceedings of the...
  • Cited by (4)

    • Identifying and Profiling User Interest over time using Social Data

      2022, 2022 24th International Multitopic Conference, INMIC 2022
    • A Machine Learning based Approach to Identify User Interests from Social Data

      2022, 2022 24th International Multitopic Conference, INMIC 2022

    The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals.

    View full text