An empirical study of the structure of relevant keywords in a search engine using the minimum spanning tree

doi:10.1016/j.eswa.2011.09.147

Expert Systems with Applications

Volume 39, Issue 4, March 2012, Pages 4432-4443

https://doi.org/10.1016/j.eswa.2011.09.147 Get rights and content

Abstract

This paper provides a comprehensive study of the structure of relevant keywords in a search engine using the minimum spanning tree (MST) approach. In the process of constructing MST’s, we introduce a novel metric to measure a distance between keywords by applying an integration of the Pearson correlation and the query-based cosine similarity. From this work, we made several meaningful observations about the networks of relevant keywords. First, keyword networks in a search engine exhibit the small-world effect and the scale-free property. Second, only a few among relevant keywords in the order of popularity are positively correlated and there is no significantly positive or negative relationship for the rest of relevant keywords. Third, the degree of searching activity for relevant keywords varies depending on whether they are branded keywords or non-branded keywords as well as the characteristics of product categories. Fourth, the mean correlation coefficient for keyword impressions during slow season increases. Finally, both k_max and the betweenness centrality for high-involvement products are higher than those for low-involvement products.

Introduction

Online advertising is a form of promotion that is a key component of marketing strategies to increase sales in online marketplaces as well as search engines. It delivers marketing messages to attract consumers. A successful form of online advertising is the sponsored search (Wu et al., 2009). It is a very effective medium for advertising as it allows precise targeting of advertisements to users (Yang, Chen, & Chen, 2007). The sponsored search is the “pay-for-placement” service that displays text-only ads alongside search engine results. Billions of dollars are spent each year on sponsored search. According to a study by eMarketer, Internet advertising spend is expected to grow from $16.4 billion in 2006 to $36.5 billion in 2011 and 40% of this ad spend occurs on sponsored search (eMarketer). Sponsored search or Search Engine Marketing (SEM) is a form of advertising on the Internet where advertisers pay to appear alongside organic search results (Abhishek & Hosanagar, 2007). In sponsored search advertising, advertisers have to decide how to spread their budget across target keywords. Since advertisers typically have a fixed daily budget that should not exceeded, an advertiser must allocate the budget as productively as possible by selecting which keywords to use and then deciding how much to allocate for each keyword (Ozluk & Cholette, 2007). There have been many studies on the allocation of an advertising budget to maximize product sales (Borgs et al., 2007, Freuchter and Dou, 2005, Holthausen and Assmus, 1982). They have mainly focused on optimal budget allocation among a large set of heterogeneous keywords. Meanwhile, it is often difficult to even select a set of keywords from the set of all possible keywords appearing in the list of keywords someone typed in order to find a product. So, advertisers usually select keywords that are relevant to their products and then purchase those keywords from paid listing providers (Faber, Lee, & Nan, 2004). In this respect, top relevant keywords in a product category can be used for the purpose of online advertising. In fact, major search engines including Google automatically display a list of relevant keywords based on the same characters being typed when a user types a keyword query into the search engine. Furthermore, Abhishek and Hosanagar (2007) suggested a method to generate keywords for search engine advertising using semantic similarity between terms. The semantic similarity of the proposed model is obtained by computing the TFIDF (Term Frequency – Inverse Document Frequency) from the retrieved documents associated with a query word. However, advertising search keywords generated from a seed keyword do not guarantee the profitability of advertising which was the original object of their study because users may have different preferences on keyword search behavior so that they would choose a different seed keyword.

In this paper, we apply network theory to the analysis of relevant keywords in sponsored search because a network topology provides efficient ways of understanding its structural properties. Several attempts have been made to apply network theory for finding relevant keywords. For instance, Liu, Weichselbraun, Scharl, and Chang (2005) adopted the augmented semantic network for identifying the network’s most relevant keywords. However, there has been no research on the construction of relevant keywords by linking search keywords based on the minimum spanning tree (MST). We examine the network properties of the search keywords and interpret the network topology in marketing terms. The remainder of the paper is organized as follows. In Section 2, we discuss the previous works about constructing keyword correlation graphs. Section 3 introduces methodologies that are able to construct a network of search keywords based on their daily impressions and web logs identifying interactions with the rest of keywords. Section 4 presents some statistics and empirical results extracted from the MSTs of relevant keywords. Finally, we describe conclusion and future work of the study in Section 5.

Section snippets

Previous research on constructing keyword correlation graphs

Some previous studies on the relevancy of keywords attempted to visualize the structure of search keywords based on their correlation with each other. Bansal and Koudas (2007) introduced the BlogScope (www.blogscope.net), which is an analysis and visualization tool for blogosphere. It shows a list of relevant keywords using keyword correlation analysis so that users can easily jump from a keyword to related keywords. In the keyword correlation analysis, the notion of the correlation c(a, b)

Methodology

We construct a network of search keywords; each node (keyword) has a different number of links (connections) and weights (correlations). Since empirical correlation matrices are of great importance in data analysis in order to extract the underlying information contained in “experimental” signals and time series (Laloux, Cizeau, Bouchaud, & Potters, 1999), we use the cross-correlations in the change of impressions between search keywords. The data consist of daily impressions for each keyword

Small-world property

Many real-world networks have large clustering coefficients and short averaged path lengths, and networks with these two characteristics are called small-world networks (Souma, Fujiwara, & Aoyama, 2003). And they usually have power law degree distribution, and such networks are called scale-free networks (Barabasi & Albert, 1999). In fact, it has become clear that many social networks can be understood as small-world networks and scale-free networks. Even though many real-world networks possess

Conclusion

In recent years, sponsored search has become an important and fast-growing revenue source for Internet search engines (Feng, Bhargava, & Pennock, 2007). Advertisers require to bid on multiple keywords relevant to their business for buying sponsored search ads because their constraint budget must be allocated across multiple keywords. Before bidding on multiple keywords, they have to extract a set of keywords relevant to their business for effective online advertising. There have been many

References (29)

M. Newman
A measure of betweenness centrality based on random walks
Social Networks
(2005)
P. Sieczka et al.
Correlations in commodity markets
Physica A: Statistical Mechanics and its Applications
(2009)
W. Souma et al.
Complex networks and economics
Physica A: Statistical Mechanics and Its Applications
(2003)
Abhishek, V., & Hosanagar, K. (2007). Keyword generation for search engine advertising using semantic similarity...
Bansal, N., & Koudas, N. (2007). Searching the blogosphere. In...
A. Barabasi et al.
Emergence of scaling in random networks
Science
(1999)
A. Barrat et al.
The architecture of complex weighted networks
Proceedings of the National Academy of Sciences
(2004)
Borgs, C., Chayes, J., & Etesami, O. (2007). Dynamics of bid optimization in online advertisement auctions. In...
U. Brandes
A faster algorithm for betweenness centrality
Journal of Mathematical Sociology
(2001)
Chen, C., Jiang, J., & Hsiao, F. (2008). Two-phased information search and evaluation in e-consumers decision process...

R. Faber et al.

Advertising and the consumer information environment online

American Behavioral Scientist

(2004)

J. Feng et al.

Implementing sponsored search in web search engines: Computational evaluation of alternative mechanisms

INFORMS Journal on Computing

(2007)

G.E. Freuchter et al.

Optimal budget allocation over time for keyword ads in web portals

Journal of Optimization Theory and Applications

(2005)

R. Guimera et al.

The worldwide air transportation network: Anomalous centrality, community structure, and cities’ global roles

Proceedings of the National Academy of Sciences of the United States of America

(2005)

Cited by (4)

Knowledge-based personalized search engine for the Web-based Human Musculoskeletal System Resources (HMSR) in biomechanics
2013, Journal of Biomedical Informatics
Citation Excerpt :
A significant number of semantic search engines have been also developed [28–33]. New keywords-based [29,34] or map-based [30] or graph-based [35] search strategies have been developed recently to provide user-friendly query approaches as well as to improve the accuracy of the retrieved results. Moreover, the ontology has been used to determine and improve the semantic similarity between retrieved information [36,37].
Human musculoskeletal system resources of the human body are valuable for the learning and medical purposes. Internet-based information from conventional search engines such as Google or Yahoo cannot response to the need of useful, accurate, reliable and good-quality human musculoskeletal resources related to medical processes, pathological knowledge and practical expertise. In this present work, an advanced knowledge-based personalized search engine was developed. Our search engine was based on a client–server multi-layer multi-agent architecture and the principle of semantic web services to acquire dynamically accurate and reliable HMSR information by a semantic processing and visualization approach. A security-enhanced mechanism was applied to protect the medical information. A multi-agent crawler was implemented to develop a content-based database of HMSR information. A new semantic-based PageRank score with related mathematical formulas were also defined and implemented. As the results, semantic web service descriptions were presented in OWL, WSDL and OWL-S formats. Operational scenarios with related web-based interfaces for personal computers and mobile devices were presented and analyzed. Functional comparison between our knowledge-based search engine, a conventional search engine and a semantic search engine showed the originality and the robustness of our knowledge-based personalized search engine. In fact, our knowledge-based personalized search engine allows different users such as orthopedic patient and experts or healthcare system managers or medical students to access remotely into useful, accurate, reliable and good-quality HMSR information for their learning and medical purposes.
A random fuzzy minimum spanning tree problem through a possibility-based value at risk model
2012, Expert Systems with Applications
Citation Excerpt :
A minimum spanning tree problem is one of the most important combinatorial optimization problems, which is seen as real-world decision making problems (Ferreira, Ochi, Parada, & Uchoa, 2012; Kim, Park, Kwon, & Chang, 2012).
This paper considers a minimum spanning tree problem under the situation where costs for constructing edges in a network include both fuzziness and randomness. In particular, this article focuses on the case that the edge costs are expressed by random fuzzy variables. A new decision making model based on a possibility measure and a value at risk measure is proposed in order to find a solution which fully reflects random and fuzzy information. It is shown that an optimal solution of the proposed model is obtained by a polynomial-time algorithm.
Feature selection methods for event detection in Twitter: a text mining approach
2020, Social Network Analysis and Mining
A model-based method to improve the quality of ranking in keyword search systems using pseudo-relevance feedback
2019, Journal of Information Science

View full text

An empirical study of the structure of relevant keywords in a search engine using the minimum spanning tree

Abstract

Introduction

Section snippets

Previous research on constructing keyword correlation graphs

Methodology

Small-world property

Conclusion

Social Networks

Physica A: Statistical Mechanics and its Applications

Physica A: Statistical Mechanics and Its Applications

Emergence of scaling in random networks

Science

The architecture of complex weighted networks

Proceedings of the National Academy of Sciences

A faster algorithm for betweenness centrality

Journal of Mathematical Sociology