Discovering and understanding word level user intent in Web search queries

doi:10.1016/j.websem.2014.07.010

Journal of Web Semantics

Volume 30, January 2015, Pages 22-38

https://doi.org/10.1016/j.websem.2014.07.010 Get rights and content

Abstract

Identifying and interpreting user intent are fundamental to semantic search. In this paper, we investigate the association of intent with individual words of a search query. We propose that words in queries can be classified as either content or intent, where content words represent the central topic of the query, while users add intent words to make their requirements more explicit. We argue that intelligent processing of intent words can be vital to improving the result quality, and in this work we focus on intent word discovery and understanding. Our approach towards intent word detection is motivated by the hypotheses that query intent words satisfy certain distributional properties in large query logs similar to function words in natural language corpora. Following this idea, we first prove the effectiveness of our corpus distributional features, namely, word co-occurrence counts and entropies, towards function word detection for five natural languages. Next, we show that reliable detection of intent words in queries is possible using these same features computed from query logs. To make the distinction between content and intent words more tangible, we additionally provide operational definitions of content and intent words as those words that should match, and those that need not match, respectively, in the text of relevant documents. In addition to a standard evaluation against human annotations, we also provide an alternative validation of our ideas using clickthrough data. Concordance of the two orthogonal evaluation approaches provide further support to our original hypothesis of the existence of two distinct word classes in search queries. Finally, we provide a taxonomy of intent words derived through rigorous manual analysis of large query logs.

Introduction

Semantic search has attracted a good amount of research in recent years [1], [2], [3]. The goal of semantic search is to improve the result relevance by appropriately understanding user intent and using intelligent document retrieval techniques to leverage the knowledge of this intent. Thus, the ability to identify user intent is one of the first steps in semantic search. Most often, the search query is a translation of the user’s intent into a short sequence of keywords. This imposes great value on every word in the query from the aspect of a semantic search engine. Past research has mostly focused on inferring the intent of the query as a whole, and the most generic intent classes were found to be informational, navigational and transactional [4], [5], [6]. In this research, we take a deeper look at query intent, zooming in on individual words as possible indicators of user intent.

From an information retrieval (IR) perspective, the equivalence of a Web search query with an unordered sequence of words (or a “bag-of-words”) has long been challenged, with research on term dependence [7], [8], [9] and term proximity models [10], [11], [12], [13], [14] showing significant improvements in retrieval performance. Extending this idea of the presence of a query structure further, we propose that words or multiword units in queries basically belong to two classes—content words that represent the central topics of queries, and intent words, which are articulated by users to refine their information needs concerning the content words. The class of content units include, but are not restricted to named entities (like brad pitt, titanic and aurora borealis)—anything that is capable of being the topic of a query would be the content unit in the context of that query. For example, blood pressure, marriage laws and magnum opus are legitimate examples of content words or units. Intent words or intent units, on the other hand, present vital clues to the search engine regarding the specific information sought by the user about the content units. For instance, intent units like home page, pics and meaning, all specify unique information requests about the content units. The queries brad pitt website, brad pitt news and brad pitt videos all represent very different user intents. It is not hard to see that while content units need to be matched inside document text for relevance, it is possible to leverage the knowledge of intent units to improve user satisfaction in better ways. For example, words like pics, videos and map can all trigger relevant content formats to directly appear on the result page. Words like near and cheap may be used to sort result objects in the desired order. These ideas motivate us to focus on the discovery and understanding of query intent units in this research.

Appropriately understanding the distinction between the two classes of words and concretizing these notions of intent and content required rigorous manual analysis of large volumes of query logs on our part. During this process, we observed that intent units share corpus distributional properties similar to function words of natural language (NL). NLs generally contain two categories of words—content and function [15]. In English, nouns, verbs, adjectives and most adverbs constitute the class of content words. On the other hand, pronouns, determiners, prepositions, conjunctions, interjections and other particles are classified as function words. While content words express meaning or semantic content, function words express important grammatical relationships between various words within a sentence, and themselves have little lexical meaning. The distinction between content and function words, thus, plays an important role in characterizing the syntactic properties of sentences [16], [17], [18]. Distributional postulates that are valid for function word detection, like the co-occurrence patterns of function words being more diverse and unbiased than content words, seemed to be valid for query intent units as well. Following these leads, we first segment queries to identify possible multiword units using a state-of-the-art query segmentation algorithm [19], and compute the relevant distributional properties, namely, co-occurrence counts and entropies, for the obtained query units. We found that the units which exhibit high values of these indicators indeed satisfy our notions about the class of intent units. Subsequently, we systematically evaluated our findings against human annotations and clickthrough data (which represent functional evidence of user intent) and substantiate our hypotheses.

In hindsight, we understand that while NL function words have little describable meaning (like in, of and what) and only serve to specify relationships among content words, well-defined semantic interpretations can be attributed to most intent words (like map, pics and videos). Intent words, even though effectively lacking purpose without the presence of a content word(s) in the same query, carry weight of their own within the query. Thus, content and intent units play slightly different roles in the query from the roles of content and function words in NL sentences. It simply turns out that function words in NL and intent words in queries share similar statistical behavior. Function words and intent words are still not fully comparable, and an important difference between the two is the fact that the definition of a function word is not context-dependent, whereas intent words can also behave as content words depending on the context (Section 4).

The objective of this paper is to identify and characterize intent words in Web search queries, words that are explicit indicators of user intent, and it is organized as follows. In Section 2, we begin with a verification of the efficacy of corpus-based distributional statistics towards function word identification and through rigorous experimentation over five languages, discover that co-occurrence counts and entropies are the most robust indicators of function words in NL. Having convinced ourselves of the power of co-occurrence statistics in detecting function words across diverse languages, we apply similar techniques to discover intent units in Web search queries (Section 3). This is followed by a simple algorithm to label intent units in the context of individual queries and subsequent evaluations using human annotations and clickthrough data (Section 4). Observing that co-occurrence statistics locate quite a diverse set of intent units, we attempt to provide a taxonomy of such units based on their relationships with content words that we believe can be very useful in semantic search (Section 5). Finally, we present concluding remarks and open directions for future work (Section 6).

Section snippets

Distributional properties of NL function words

Function words play a crucial role in many Natural Language Processing (NLP) applications. They are used as features for unsupervised POS induction and also provide vital clues for grammar checking and machine translation. In this section, we first re-examine this popular hypothesis that the most frequent words in a language are the function words. By function words or units we refer to all the closed-class lexical items in a language, e.g., pronouns, determiners, prepositions, conjunctions,

Intent units of Web search queries

Web search queries are issued by users to communicate their information needs to search engines. Thus, their function is similar to languages [25], [26]. Past research [27], [28], [29] suggests that Web queries have a distinct structure where the units are not always single words but segments comprising one or more words. For example, not all permutations of the query nokia n96 gprs config telstra australia are meaningful—only three permutable units make sense, which are nokia n96, gprs config

Labeling intent units in query context

A segment can act as content or intent in a query depending upon the context. For example, while the segment video behaves as an intent unit in most queries, like, (us open) (video) (specifying that the desired content type is a video), it is the content unit in the query (definition of) (video). Thus, a labeling scheme is practically useful only if it can label segments as content or intent within a query, and not just in a context-agnostic standalone fashion. We note here that this is not

A taxonomy of intent units in Web search queries

Roles of units: In order to better understand the roles of intent units in queries, we went through the list of intent units and several hundreds of queries in which they occur. Our study reveals that intent units in Web search queries can be broadly thought of as performing one of two tasks, namely, restrict or rank. The restrict task is concerned with filtering the pool of relevant documents from which the final results are presented. The rank task determines the order in which the final

Conclusions and future work

In this paper, we have proposed that intents units can act as indicators of user intent in Web search queries. We have shown that co-occurrence distributions of units can be leveraged for unsupervised mining of intent units from query logs. We have established the effectiveness of our method by using similar techniques for detecting function words in NL text, which share similar corpus distributional properties with intent words of search queries. As our techniques do not use any specific

Acknowledgments

The first author was supported by Microsoft Corporation and Microsoft Research India under the Microsoft Research India Ph.D. Fellowship Award (Grant No. 8901901/IITKGP-Fship-2011). We would like to thank Nikita Mishra, Amritayan Nayak, Dastagiri Reddy and Anusha Suresh (Indian Institute of Technology Kharagpur, India), and Pavan Kumar and Rohit Kumar (National Institute of Technology Durgapur, India), for valuable inputs at various stages of this work. Finally, we thank anonymous reviewers of

References (58)

B.J. Jansen et al.
Determining the informational, navigational, and transactional intent of Web queries
Inf. Process. Manage.
(2008)
P. Haase et al.
Semantic Wiki search
D. Tumer et al.
An empirical evaluation on semantic search performance of keyword-based and semantic search engines: Google, Yahoo, MSN and Hakia
S. Ferré et al.
Semantic search: Reconciling expressive querying and exploratory search
A. Broder
A taxonomy of Web search
SIGIR Forum
(2002)
U. Lee et al.
Automatic identification of user goals in Web search
J. Gao et al.
Dependence language model for information retrieval
D. Metzler et al.
A Markov random field model for term dependencies
M. Bendersky et al.
Two-stage query segmentation for information retrieval
T. Tao et al.
An exploration of proximity measures in information retrieval

R. Cummins et al.

Learning in a pairwise term–term proximity framework for information retrieval

R. Song et al.

Viewing term proximity from a different perspective

J. Bai et al.

Investigation of partial query proximity in Web search

B. He et al.

Modeling term proximity for probabilistic information retrieval models

Inform. Sci.

(2011)

J.L. Morgan et al.

Signal to Sintax: Bootstrapping from Speech to Grammar in Early Acquisition: [chapters Presented at a Conference Held Feb. 19-21, 1993, Brown University, Providence, RI]

(1996)

R. Jackendoff

X-Bar Syntax

(1977)

N. Chomsky

Barriers

(1986)

N. Fukui, M. Speas, Specifiers and projection, in: MIT Working papers in Linguistics, Vol. 8, No. 128,...

R. Saha Roy et al.

G. Salton et al.

Introduction to Modern Information Retrieval

(1986)

R. Baeza-Yates et al.

Modern Information Retrieval

(1999)

C.E. Shannon

A mathematical theory of communication

Bell Syst. Tech. J.

(1948)

G. Salton

The SMART Retrieval System: Experiments in Automatic Document Processing

(1971)

P. Koehn, Europarl: A parallel corpus for statistical machine translation, in: MT Summit, vol. 5, 2005, pp....

J. Huang et al.

Exploring Web scale language models for search query processing

R. Saha Roy et al.

Are Web search queries an evolving protolanguage?

S. Bergsma, Q.I. Wang, Learning noun phrase query segmentation, in: EMNLP-CoNLL’07, 2007, pp....

M. Manshadi et al.

Semantic tagging of Web search queries

M. Hagen et al.

Query segmentation revisited

Cited by (16)

Extending WordNet with UFO foundational ontology
2019, Journal of Web Semantics
Citation Excerpt :
Finally, the evaluation of the benefits of using the semantically extended Wordnet for different scenarios, such as contributing for the identification of intention of a word [61] or ontology matching, will be performed.
WordNet is a large lexical database used by an uncountable number of applications for computational linguistics. Many proposals have attempted to better describe it in a semantic perspective, especially addressing synonymy, taxonomy and mereology properties, which led to very good results in domain-specific applications. A philosophical shift on this semantic description could, however, improve the scope of these results across different domains. In this direction, this work extends WordNet’s semantic knowledge by addressing philosophical meta-properties. Specifically, we apply the notion of Semantic Types to propose mapping rules between the noun synsets of Wordnet and the top-level constructs of a foundational ontology. For this task we have chosen the Unified Foundational Ontology (UFO), which explicitly exposes philosophical meta-properties of concepts in its structure, leading to a well-founded semantically-enriched version of Wordnet. The proposed rules were validated through an experiment over approximately 5,200 sample mappings, obtaining an average accuracy of 93.5% Furthermore, to show its applicability, the proposal was applied to the task of automatically learning a well-founded domain ontology.
Syntactic complexity of Web search queries through the lenses of language models, networks and users
2016, Information Processing and Management
Citation Excerpt :
The other common reason for box motifs in queries is spelling mistakes or spelling variations (like pituitary and pitutiary) at two opposite ends, and related words like hormone and tumor forming the other two opposite ends of the box motif. We observed star motifs in query logs arise due to a content unit (e.g., titanic) in the centre and three intent units (Saha Roy, Katare, Ganguly, Laxman, & Choudhury, 2015) (e.g., cast, mp3 and review) connected to it. Thus, motifs occur due to syntactic and distributional constraints in the linguistic system, and hence, they are relatively harder to capture through a generative model.
Across the world, millions of users interact with search engines every day to satisfy their information needs. As the Web grows bigger over time, such information needs, manifested through user search queries, also become more complex. However, there has been no systematic study that quantifies the structural complexity of Web search queries. In this research, we make an attempt towards understanding and characterizing the syntactic complexity of search queries using a multi-pronged approach. We use traditional statistical language modeling techniques to quantify and compare the perplexity of queries with natural language (NL). We then use complex network analysis for a comparative analysis of the topological properties of queries issued by real Web users and those generated by statistical models. Finally, we conduct experiments to study whether search engine users are able to identify real queries, when presented along with model-generated ones. The three complementary studies show that the syntactic structure of Web queries is more complex than what n-grams can capture, but simpler than NL. Queries, thus, seem to represent an intermediate stage between syntactic and non-syntactic communication.
A Novel Hybrid Approach for Intent Creation and Detection Using K-Means-Based Topic Clustering and Heuristic-Based Capsule Network
2023, International Journal of Information Technology and Decision Making
A Study of Transliteration Approaches
2023, Proceedings - 4th IEEE 2023 International Conference on Computing, Communication, and Intelligent Systems, ICCCIS 2023
A formal framework for robot to understand compound concepts
2021, Journal of Physics: Conference Series
Towards query logs for privacy studies: On deriving search queries from questions
2020, arXiv

View all citing articles on Scopus

¹: This work was done while the author was at Microsoft Research India.

View full text