Multi-interest semantic changes over time in short-text microblogs
Introduction
Currently, online content propagation has been proliferated by the surge in citizen journalism. This is partly attributed to the increase in the number of devices e.g. mobile phones, availability of internet as well as with the emergence of many social microblogging platforms like Twitter. Twitter as a short-text microblog has been instrumental in sharing near to real-time data in form of text, videos, hyperlinks and images. In essence, the number of disseminated tweets average about 500 billion per year, which roughly translates to approximately 6000 tweets per second.1 On the platform, users are able to re-share the disseminated tweets, in form of “retweets”, “comment” on and/or “like” the original tweet.
Generally, tweeters2 consume content on the platform based on their prevailing interests at the time. For example, in times of political campaigns, many demographically relevant tweeters are likely to express interest in political content. However, this interest is likely to decay over time when the political season is over. The same scenario could be replicated in a sports season where support of teams fades as the season winds up. Modelling and extraction of such user dissemination patterns and representative interests is a challenging task especially for legacy systems. This is attributed to two factors (i) The data volatility factoring its dissemination throughput (ii) Time-based variations in the nature of topics of interest.
The ability to extract and present time-sensitive and accurate user representative profiles from such evolving short texts is important in recommender systems design research. The design goal of recommender systems on such platforms is to personalize both the third-party content, and the user identification process. This in turn presents the most relevant users as follower–followee suggestions, as well as delivery of more personalized third-party content for the users. This personalization process is based on the extrinsic interests from the disseminated content.
The assumption in the design of the framework is that the semantics of disseminated content change over time for individual users. Therefore, we present a framework that is capable of discerning semantic user interests of short-text microblog users. This is challenging in short-text microblogs, as the level of expressivity is not always exhaustive due to character limitations per document/tweet. The text variable in a tweet’s metadata is limited to 280 characters though realistically, the average tweet length is much shorter. This is in addition to factors relating to throughput and content decay and gain over time on such platforms.
In the quest to design such a framework, there was need to first extract topical representations of the data at specific periods. Since documents were short texts, vector representations worked well in discerning their semantic relevance. Each token in the pre-processed tweet was vectorized and modelled as the inner product between the vocabulary, and topic embeddings across specific periods. Topics of interest at each period were captured with the overall semantic representation being user interest weights across the dataset collection period. This way, semantic divergence across several topics of interest were accurately represented.
With the personalizing goal, there is need to develop a framework that is able to capture user-representative interests on such platforms, but with a factor of time. Such a framework has the ability to present semantic profiles to third-party content curators from an informative point of view on how user interests evolved with time. This can serve specific purposes related to the generation of profile information for users of interest in certain topics at given times. Such interest patterns can then be used in recommendations of time specific content, as well as related follower–followee networks. In the development of this framework, the following research questions are addressed:-
- •
Is it possible to extract user-representative interests in time-series based streaming short texts for profiling?
- •
Are the extracted patterns in the short-text sufficient in making time-dependent user/content recommendations/ predictions for generalized short-text content disseminators?
This work is inspired by research in the computation of the degree of interest in selected topics as well as multi-interest user profiling in short-text microblogs [1], [2]. In follow-back recommendations, user-representative interests recommendations were generated among short-text microblog users with shared interests. The social theory of homophily was applied in validating the semantic correlations among the users [3]. In multi-interest profiling, the quantification of user interests across topics was computed by generating a responsibility matrix across users depicting their interest levels across the topics. Algorithmically, Expectation Maximization (EM) and Gaussian Mixed Models (GMM) were applied over the vector representations to extract soft clusters [4]. Furthermore, the semantic distance to the clusters per user aggregated vector representations was assumed to be the interest level in the topical cluster.
The below points highlight the novelty in this research. In addition, we have emphasized how this work differs from other works at the end of Section 2.
- 1.
The combination of word and topical embeddings as a time-variational distributional model differs from other works in the user profiling domain. Conventional modelling in related works in this domain, point to either word or topical representation models and not an amalgamation of the two.
- 2.
In our approach, time sensitivity is definitive in the representation of the generated user-representative profiles. Inferencing topics over word embeddings where topical vectors are generated per mini-batch of documents, per timestamp, highlighting the dynamicity in identification of representative user interests in short texts. This time-sensitive approach differs from keyword, concept and hybrid profiling approaches that make use of external knowledge-bases in extraction of such interests. Vocabulary sparsity, volume, variety and velocity in short texts dissemination makes the above profiling approaches insufficient in this scenario.
In addition to the above, the following contributions were made in the paper: -
- 1.
Formulation of semantically representative user profiling framework that considers the disseminated content as time-based entities.
- 2.
Each word in the framework is modelled as a categorical distribution of word embeddings and a time-based representation of the word’s assigned topic.
- 3.
The model and data ingestion framework is tested on a generic set of tweets geolocated to an area over time. This ensured accuracy in validation of the end results using a true class dataset over certain periods in the collection. User-representative interests over a generic test set were computed by the methodology that qualitatively outperformed the other approaches across a range of measures.
- 4.
Quantitatively, semantic weights between the control and test sets in five sub-topics across ten quarters were measured depicting their semantic correlations. Linearity in the correlations across the timestamps indicated the validity of our modelling approach.
The rest of the paper is organized as follows. Section 2 summarizes the background and related literature of our study. Our approach is described in Section 3. The experimental framework is presented in Section 4, with results discussed in Section 5. Discussion and application areas are elicited in Section 6.
Section snippets
Related works
Microblog user’s content consumption patterns vary over time. This is mostly influenced by events at the specific timeframes, more so in short-text microblogs like Twitter. Designing a framework that is able to incorporate these semantic changes over time in user-representative profiling is pertinent. There are two approaches in the formulation of user profiles especially for use in recommender systems. The design either follows the behavioural [5] or structural [6] patterns in the modelling
Our approach
In modelling evolving user interests, the proposed framework encompasses several processes related to the generation of topical interests, word embeddings and a time-variant distributional model. The topical interests are distributed over word representations as time variational topics in the model. Time variations are user-defined as the dataset is time-series documents. Latent Dirichlet Allocation (LDA) is used to generate the topical information at each timestamp [44]. Summaries of the most
Experimentation
This section validates the processes presented in Section 3. A few consecutive steps are followed in the modelling process in addition to the description of the time-variant dataset in the experimentation. The experimentation process aims to generate topics as interests that are sensitive to time spanning the dataset period. Therefore, the end result in this experimentation phase is a comparative evaluation of an agreement between the control set in Section 4.1.2 and the generated topics from
Results
The framework’s performance was measured quantitatively and qualitatively to ascertain that the results corroborated in both dimensions. Quantitatively, topical quality across the timestamps was the tested measure using the generated embeddings. Ideally, the best performing modelling algorithm was adapted in further qualitative evaluations. This related to intra-topical changes over time and correlations in topical interests, a key measure in the user profiling process in streaming texts.
Discussion
The goal of our work has been to identify evolving topical interests in short texts and eventually build representative user profiles. Prior research in evolution of interests in streaming platforms encompassed several factors. Usage of external data, annotation, and a mixture of features in the augmentation of user interests in the profiling process has been studied [31], [32], [35]. Word embeddings representations have also been utilized in the generation of dynamic user profiles [38].
Conclusion and future work
Microblogging platforms like Twitter help present intrinsic and extrinsic user profiles to third party content providers. This to a large extent is based on the nature of content that users and their friendship networks consume over time. A framework that factors variational timestamps on topical embeddings was proposed. Several embeddings techniques and baselines were modelled and tested for topical quality. FastText-based embeddings were selected for further computation based on their success
CRediT authorship contribution statement
Herman M. Wandabwa: Conceptualization, Methodology, Data curation, Formal analysis. M. Asif Naeem: Resources, Writing - review & editing, Supervision. Farhaan Mirza: Writing - review & editing, Supervision, Validation. Russel Pears: Supervision, Conceptualization.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (61)
- et al.
Homophily, group size, and the diffusion of political information in social networks: Evidence from Twitter
J. Publ. Econom.
(2016) - et al.
Forum latent Dirichlet allocation for user interest discovery
Knowl.-Based Syst.
(2017) - et al.
Hashtag homophily in twitter network: Examining a controversial cause-related marketing campaign
Comput. Hum. Behav.
(2020) - et al.
Modeling user interest in social media using news media and wikipedia
Inf. Syst.
(2017) - et al.
Personalized recommendation based on hierarchical interest overlapping community
Inform. Sci.
(2019) - et al.
Mining user interest based on personality-aware hybrid filtering in social networks
Knowl.-Based Syst.
(2020) - et al.
Hybrid approach for detection of malicious profiles in twitter
Comput. Electr. Eng.
(2019) - et al.
Modeling temporal dynamics of user interests in online social networks
Procedia Comput. Sci.
(2015) - et al.
A graph-oriented model for hierarchical user interest in precision social marketing
Electron. Commer. Res. Appl.
(2019) - et al.
User preferences modeling using dirichlet process mixture model for a content-based recommender system
Knowl.-Based Syst.
(2019)
Tracking spatio-temporal variation of geo-tagged topics with social media in China: A case study of 2016 hefei rainstorm
Int. J. Disas. Risk Reduc.
Real-time processing of social media with SENTINEL: A syndromic surveillance system incorporating deep learning for health classification
Inf. Process. Manage.
Document-based topic coherence measures for news media text
Expert Syst. Appl.
Multi-interest user profiling in short text microblogs
Maximum likelihood from incomplete data via the EM algorithm
J. R. Stat. Soc. Ser. B Stat. Methodol.
Dynamic user modeling in social media systems
ACM Trans. Inform. Syst. (TOIS)
Comparison and modelling of country-level microblog user and activity in cyber-physical-social systems using weibo and Twitter data
ACM Trans. Intell. Syst. Technol. (TIST)
A varied density-based clustering approach for event detection from heterogeneous twitter data
ISPRS Int. J. Geo-Inf.
Content-based recommender systems
Inferring user interests in microblogging social networks: a survey
User Model. User-Adapted Interact.
User profiles for personalized information access
A survey of term weighting schemes for text classification
Int. J. Data Mining Modell. Manage.
Elites tweet? Characterizing the Twitter verified user network
Tweets can tell: activity recognition using hybrid gated recurrent neural networks
Soc. Netw. Anal. Min.
Collecting event-related tweets from twitter stream
J. Assoc. Inform. Sci. Technol.
Cited by (4)
Big data-assisted urban governance: A comprehensive system for business documents classification of the government hotline
2024, Engineering Applications of Artificial IntelligenceIdentifying and Profiling User Interest over time using Social Data
2022, 2022 24th International Multitopic Conference, INMIC 2022A Machine Learning based Approach to Identify User Interests from Social Data
2022, 2022 24th International Multitopic Conference, INMIC 2022Interest points analysis for internet forum based on long-short windows similarity
2022, Computers, Materials and Continua
The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals.