An empirical study of the structure of relevant keywords in a search engine using the minimum spanning tree

https://doi.org/10.1016/j.eswa.2011.09.147Get rights and content

Abstract

This paper provides a comprehensive study of the structure of relevant keywords in a search engine using the minimum spanning tree (MST) approach. In the process of constructing MST’s, we introduce a novel metric to measure a distance between keywords by applying an integration of the Pearson correlation and the query-based cosine similarity. From this work, we made several meaningful observations about the networks of relevant keywords. First, keyword networks in a search engine exhibit the small-world effect and the scale-free property. Second, only a few among relevant keywords in the order of popularity are positively correlated and there is no significantly positive or negative relationship for the rest of relevant keywords. Third, the degree of searching activity for relevant keywords varies depending on whether they are branded keywords or non-branded keywords as well as the characteristics of product categories. Fourth, the mean correlation coefficient for keyword impressions during slow season increases. Finally, both kmax and the betweenness centrality for high-involvement products are higher than those for low-involvement products.

Introduction

Online advertising is a form of promotion that is a key component of marketing strategies to increase sales in online marketplaces as well as search engines. It delivers marketing messages to attract consumers. A successful form of online advertising is the sponsored search (Wu et al., 2009). It is a very effective medium for advertising as it allows precise targeting of advertisements to users (Yang, Chen, & Chen, 2007). The sponsored search is the “pay-for-placement” service that displays text-only ads alongside search engine results. Billions of dollars are spent each year on sponsored search. According to a study by eMarketer, Internet advertising spend is expected to grow from $16.4 billion in 2006 to $36.5 billion in 2011 and 40% of this ad spend occurs on sponsored search (eMarketer). Sponsored search or Search Engine Marketing (SEM) is a form of advertising on the Internet where advertisers pay to appear alongside organic search results (Abhishek & Hosanagar, 2007). In sponsored search advertising, advertisers have to decide how to spread their budget across target keywords. Since advertisers typically have a fixed daily budget that should not exceeded, an advertiser must allocate the budget as productively as possible by selecting which keywords to use and then deciding how much to allocate for each keyword (Ozluk & Cholette, 2007). There have been many studies on the allocation of an advertising budget to maximize product sales (Borgs et al., 2007, Freuchter and Dou, 2005, Holthausen and Assmus, 1982). They have mainly focused on optimal budget allocation among a large set of heterogeneous keywords. Meanwhile, it is often difficult to even select a set of keywords from the set of all possible keywords appearing in the list of keywords someone typed in order to find a product. So, advertisers usually select keywords that are relevant to their products and then purchase those keywords from paid listing providers (Faber, Lee, & Nan, 2004). In this respect, top relevant keywords in a product category can be used for the purpose of online advertising. In fact, major search engines including Google automatically display a list of relevant keywords based on the same characters being typed when a user types a keyword query into the search engine. Furthermore, Abhishek and Hosanagar (2007) suggested a method to generate keywords for search engine advertising using semantic similarity between terms. The semantic similarity of the proposed model is obtained by computing the TFIDF (Term Frequency – Inverse Document Frequency) from the retrieved documents associated with a query word. However, advertising search keywords generated from a seed keyword do not guarantee the profitability of advertising which was the original object of their study because users may have different preferences on keyword search behavior so that they would choose a different seed keyword.

In this paper, we apply network theory to the analysis of relevant keywords in sponsored search because a network topology provides efficient ways of understanding its structural properties. Several attempts have been made to apply network theory for finding relevant keywords. For instance, Liu, Weichselbraun, Scharl, and Chang (2005) adopted the augmented semantic network for identifying the network’s most relevant keywords. However, there has been no research on the construction of relevant keywords by linking search keywords based on the minimum spanning tree (MST). We examine the network properties of the search keywords and interpret the network topology in marketing terms. The remainder of the paper is organized as follows. In Section 2, we discuss the previous works about constructing keyword correlation graphs. Section 3 introduces methodologies that are able to construct a network of search keywords based on their daily impressions and web logs identifying interactions with the rest of keywords. Section 4 presents some statistics and empirical results extracted from the MSTs of relevant keywords. Finally, we describe conclusion and future work of the study in Section 5.

Section snippets

Previous research on constructing keyword correlation graphs

Some previous studies on the relevancy of keywords attempted to visualize the structure of search keywords based on their correlation with each other. Bansal and Koudas (2007) introduced the BlogScope (www.blogscope.net), which is an analysis and visualization tool for blogosphere. It shows a list of relevant keywords using keyword correlation analysis so that users can easily jump from a keyword to related keywords. In the keyword correlation analysis, the notion of the correlation c(a, b)

Methodology

We construct a network of search keywords; each node (keyword) has a different number of links (connections) and weights (correlations). Since empirical correlation matrices are of great importance in data analysis in order to extract the underlying information contained in “experimental” signals and time series (Laloux, Cizeau, Bouchaud, & Potters, 1999), we use the cross-correlations in the change of impressions between search keywords. The data consist of daily impressions for each keyword

Small-world property

Many real-world networks have large clustering coefficients and short averaged path lengths, and networks with these two characteristics are called small-world networks (Souma, Fujiwara, & Aoyama, 2003). And they usually have power law degree distribution, and such networks are called scale-free networks (Barabasi & Albert, 1999). In fact, it has become clear that many social networks can be understood as small-world networks and scale-free networks. Even though many real-world networks possess

Conclusion

In recent years, sponsored search has become an important and fast-growing revenue source for Internet search engines (Feng, Bhargava, & Pennock, 2007). Advertisers require to bid on multiple keywords relevant to their business for buying sponsored search ads because their constraint budget must be allocated across multiple keywords. Before bidding on multiple keywords, they have to extract a set of keywords relevant to their business for effective online advertising. There have been many

References (29)

  • M. Newman

    A measure of betweenness centrality based on random walks

    Social Networks

    (2005)
  • P. Sieczka et al.

    Correlations in commodity markets

    Physica A: Statistical Mechanics and its Applications

    (2009)
  • W. Souma et al.

    Complex networks and economics

    Physica A: Statistical Mechanics and Its Applications

    (2003)
  • Abhishek, V., & Hosanagar, K. (2007). Keyword generation for search engine advertising using semantic similarity...
  • Bansal, N., & Koudas, N. (2007). Searching the blogosphere. In...
  • A. Barabasi et al.

    Emergence of scaling in random networks

    Science

    (1999)
  • A. Barrat et al.

    The architecture of complex weighted networks

    Proceedings of the National Academy of Sciences

    (2004)
  • Borgs, C., Chayes, J., & Etesami, O. (2007). Dynamics of bid optimization in online advertisement auctions. In...
  • U. Brandes

    A faster algorithm for betweenness centrality

    Journal of Mathematical Sociology

    (2001)
  • Chen, C., Jiang, J., & Hsiao, F. (2008). Two-phased information search and evaluation in e-consumers decision process...
  • R. Faber et al.

    Advertising and the consumer information environment online

    American Behavioral Scientist

    (2004)
  • J. Feng et al.

    Implementing sponsored search in web search engines: Computational evaluation of alternative mechanisms

    INFORMS Journal on Computing

    (2007)
  • G.E. Freuchter et al.

    Optimal budget allocation over time for keyword ads in web portals

    Journal of Optimization Theory and Applications

    (2005)
  • R. Guimera et al.

    The worldwide air transportation network: Anomalous centrality, community structure, and cities’ global roles

    Proceedings of the National Academy of Sciences of the United States of America

    (2005)
  • Cited by (4)

    • Knowledge-based personalized search engine for the Web-based Human Musculoskeletal System Resources (HMSR) in biomechanics

      2013, Journal of Biomedical Informatics
      Citation Excerpt :

      A significant number of semantic search engines have been also developed [28–33]. New keywords-based [29,34] or map-based [30] or graph-based [35] search strategies have been developed recently to provide user-friendly query approaches as well as to improve the accuracy of the retrieved results. Moreover, the ontology has been used to determine and improve the semantic similarity between retrieved information [36,37].

    • A random fuzzy minimum spanning tree problem through a possibility-based value at risk model

      2012, Expert Systems with Applications
      Citation Excerpt :

      A minimum spanning tree problem is one of the most important combinatorial optimization problems, which is seen as real-world decision making problems (Ferreira, Ochi, Parada, & Uchoa, 2012; Kim, Park, Kwon, & Chang, 2012).

    View full text