Buzzword detection in the scientific scenario

doi:10.1016/j.patrec.2015.09.017

Pattern Recognition Letters

Volume 69, 1 January 2016, Pages 42-48

https://doi.org/10.1016/j.patrec.2015.09.017 Get rights and content

Highlights

•
Buzzword detection through a time-series analysis.
•
Identification of buzzwords in the DBLP database.
•
Use of clustering techniques in trend detection.
•
Evaluation of terms identified as buzzwords.

Abstract

This paper addresses a relatively new concept: the buzzword. Buzzwords are fashionable words that continue gaining popularity until a tipping point is reached and then their popularity declines. Our goal in this study is to identify buzzwords through their frequency of occurrence over the years, using two clustering techniques: k-means and the self-organizing map (SOM). We also used the DBLP database to run experiments with data from published papers in an attempt to find terms that could be classified as buzzwords, in accordance with the defined meaning. Clusters generated by both k-means and SOM are very similar, indicating that it is very likely that buzzwords were correctly identified as such. We were able to identify terms such as “android” and “mapreduce”, which were clearly buzzwords for 2012, as well as terms such as “pomdp”, which was not an obvious buzzword. As a contribution, we highlight common characteristics identified for buzzwords and make comparisons between the two methods for finding buzzwords which were analyzed in this paper.

Introduction

All fields of research have topics which are the major focus of studies by their communities. Sometimes, such topics arouse interest gradually and eventually become the most discussed topic in a certain area, but at other times their appearance already indicates the timing of the main exploration of the subject. Knowing in advance which words will become buzzwords can help enterprises to make strategic decisions about which fields are promising and deserve more attention, thus dictating a possible pioneering position in certain areas of knowledge.

Buzzwords are new terms or phrases (neologisms) created in one language that acquire great popularity as fashionable words [22]. Informally, a buzzword is a word or phrase related to a specialized field or group at a particular time, or in a particular context, used mostly to impress lay persons. Through the use of these terms it is possible to identify the latest trends of what is happening around the world; that is, what is being most discussed by the population or the most interesting topics at that moment. Buzzword detection consists of important information, especially in the areas of marketing, business, politics, and intelligence [21], [30]. Therefore, it is very useful to identify these words as early as possible.

The difference between buzzwords and most of the new terms in a language is the exponential growth of buzzwords. It is a difficult task to predict whether a new word used by a community is destined to become a common term in dictionaries or if it is heading towards a tipping point from which it will decline. Neuman et al. [22] cite the example of the term “Web 2.0”, created in 2001 by Tim O’Reilly to describe a turning point for the Web. After a year and a half, the term had gained huge popularity, being quoted in Google more than 9.5 million times and, in 2009, the number of citations in Google had reached 422 million [22]. Currently, however, the term “Web 2.0” seems to have lost popularity.

The popularity of buzzwords comes from their use in media such as TV, magazines, newspapers, and social networks. However, there is usually a smaller group that uses these terms before they become popular with the masses. In other words, buzzwords emerge from a restricted community and gradually spread to other communities, to then become widely known among most people. By identifying this type of behavior, it is possible to find potential buzzwords.

The term “buzzword” has its historical use related to the language of the business and technology sector [20]. Some studies about it have been done, mainly in the blogosphere [7], [19], [21], [22], [30], or with a view to finding ways to model bursts of topics [4], [12], [24]. This is justified due to the fact that blogs are sources of information in which users can express their opinions and interests in real time and, thus, reflect the most current trends. It also provides an ideal place to study the dynamics of a language’s environment. Studies on buzzword detection evaluate the possibility of a topic becoming popular by considering its temporal variation in the text of blogs, which allows researchers to observe the emergence of new topics and the concentration of interests over time [30]. Thus, a common approach in detecting buzzwords in blogs is to evaluate the growth rate for the citation of a topic in such communities.

An unexplored field in which it would be interesting to study buzzwords is the scientific scenario. Since it is possible to see trends in the development of innovations in this scenario, there is also a high propensity for the emergence of buzzwords. According to [3], buzzwords frequently appear in the titles of conference papers and in comments and questions addressed to conference speakers. This occurs mainly because of the strong relationship between innovations and buzzwords [3]. Moreover, in the academic context, a buzzword can represent the interests of a community in relation to a particular subject, and the frequency at which a particular term is used by the academic community in scientific publications should be accompanied over the years.

The identification of new buzzwords in the scientific field may indicate the rise of a new research or business area. To detect the emergence of these words, we can conduct technological forecasting studies, in which it would be possible to predict the impacts of a given innovation. Early buzzword detection is an important contribution to the decision-making process and market trend analysis.

The organization of the rest of this paper is as follows: Section 2 explains how we obtained and prepared the corpus for the experiments; Section 3 describes the clustering experiments; Section 4 analyzes and compares the results; and, finally, the conclusions of this work are presented in Section 5.

Section snippets

DBLP

For the present study, the DBLP database [15] was used. The DBLP project is currently maintained by the Universität Trier, in Germany. This database consists of more than 2,947,000 documents of bibliographic information in the computer science area, including conference papers, journals, series, books, and even Master’s and Doctorate degree theses.

Preprocessing

The preprocessing stage included data cleaning, data transcription to a database, and the selection of articles for use in this work.

Formatting tags

Clustering

The initial analysis presented in this work was the clustering of words extracted from the titles of articles. The goal of this analysis was to identify the buzzwords by evaluating the behavior of the cluster in which they had been grouped. A word’s frequency over the years can help to detect buzzwords. Generally, the use of these terms increases greatly at the specific point in time that they are buzzwords; for example, the word “mapreduce” (Fig. 2).

The dispersion curve over the years does not

Discussion

An interesting fact we noticed during our experiments is that several words were identified as candidates for 2012’s buzzwords by both the k-means and SOM algorithms. By analyzing these potential buzzwords, we could see words like “pomdp” and “pomdps”. The frequency over the years of these words is shown in Fig. 6. Indeed, the dispersion of the terms “pomdp” and “pomdps” is quite similar to that which is expected from a buzzword.

After searching whether or not this term is a buzzword, we

Conclusion

The aim of this work was to study the annual occurrence of terms found in the titles of articles in order to identify buzzwords. Through comparative analysis of the methods used, it was found that the results generated in relation to clustering executions performed with the k-means and the SOM approaches were consistent. The dispersion curves generated for the centroids of the clusters containing potential buzzwords for each year are similar for both approaches and describe growth behavior that

Acknowledgments

We would like to acknowledge the financial support from CNPq and Capes.

References (31)

J. Hoey et al.
Automated handwashing assistance for persons with dementia using video and a partially observable Markov decision process
Comput. Vis. Image Underst.
(2010)
M. Ley
Dblp: Some lessons learned
Proc. VLDB Endow.
(2009)
C. Thornton et al.
Auto-weka: Combined selection and hyperparameter optimization of classification algorithms
Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(2013)
J. Yi
Detecting buzz from time-sequenced document streams
Proceedings of the 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service, 2005. EEE’05.
(2005)
R. Agate et al.
Autonomous safety decision-making in intelligent robotic systems in the uncertain environments
Proceedings of Annual Meeting of the North American Fuzzy Information Processing Society, 2008. NAFIPS 2008
(2008)
E.N. Chaturvedi et al.
An improvement in k-mean clustering algorithm using better time and accuracy
Int. J. Program. Lang. Appl.
(2013)
E. Chell et al.
Handbook of Research on Small Business and Entrepreneurship
(2014)
Q. Diao et al.
Finding bursty topics from microblogs
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1
(2012)
L.d. Faria
Prospecção tecnológica em materiais: aumento da eficiência do tratamento bibliometrico: aplicação na análise de tratamentos de superfície resistentes ao desgaste. 2001. 213 f
(2001)
J.E. Gewehr et al.
Biowekaextending the weka framework for bioinformatics
Bioinformatics
(2007)

N. Glance et al.

Blogpulse: Automated trend discovery for weblogs

Proceedings of the WWW 2004 workshop on the weblogging ecosystem: Aggregation, analysis and dynamics

(2004)

V.L. Guedes et al.

Bibliometria: uma ferramenta estatística para a gestão da informação e do conhecimento, em sistemas de informação, de comunicação e de avaliação científica e tecnológica

Encontro Nacional de Ciência da Informação

(2005)

M. Hall et al.

The weka data mining software: An update

SIGKDD Explor. Newsl.

(2009)

J.E. Hirsch

An index to quantify an individual’s scientific research output

(2005)

J. Kleinberg

Bursty and hierarchical structure in streams

Data Min. Knowl. Discov.

(2003)

Cited by (3)

Feature selection using hybrid poor and rich optimization algorithm for text classification
2021, Pattern Recognition Letters
In order to reduce the high dimensional feature space in the text classification, feature selection plays a significant role. The dimension reduction of feature space reduces the computation cost and improves the text classification system accuracy. Hence, the identification of an optimal combination of features is an essential task in text classification. In this paper, the proposed work introduces a novel hybrid feature selection method based on binary poor and rich optimization algorithm (HBPRO) to obtain the appropriate subset of optimal features. The optimal feature subset which is selected by our proposed work is evaluated using Nave Bayes classifier with two popular benchmark text corpus datasets. The experimental results confirm that the proposed feature selection scheme (HBPRO) produces higher accuracy with a reduced number of features when compared with other feature selection techniques.
Artificial intelligence trend analysis in German business and politics: a web mining approach
2023, International Journal of Data Science and Analytics
Artificial intelligence trend analysis on healthcare podcasts using topic modeling and sentiment analysis: a data-driven approach
2023, Evolutionary Intelligence

^☆: This paper has been recommended for acceptance by Jie Zou.

View full text

Buzzword detection in the scientific scenario☆

Highlights

Abstract

Introduction

Section snippets

DBLP

Preprocessing

Clustering

Discussion

Conclusion

Acknowledgments

Comput. Vis. Image Underst.

Proc. VLDB Endow.

Autonomous safety decision-making in intelligent robotic systems in the uncertain environments

Proceedings of Annual Meeting of the North American Fuzzy Information Processing Society, 2008. NAFIPS 2008

An improvement in k-mean clustering algorithm using better time and accuracy

Int. J. Program. Lang. Appl.

Handbook of Research on Small Business and Entrepreneurship

Finding bursty topics from microblogs

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1

Prospecção tecnológica em materiais: aumento da eficiência do tratamento bibliometrico: aplicação na análise de tratamentos de superfície resistentes ao desgaste. 2001. 213 f

Biowekaextending the weka framework for bioinformatics

Bioinformatics