Query Log Mining for Inferring User Tasks and Needs

Mehrotra, Rishabh; Yilmaz, Emine

doi:10.1007/978-3-319-46131-1_36

Rishabh Mehrotra²⁰ &
Emine Yilmaz²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9853))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3121 Accesses
2 Citations

Abstract

Search behavior, and information seeking behavior more generally, is often motivated by tasks that prompt search processes that are often lengthy, iterative, and intermittent, and are characterized by distinct stages, shifting goals and multitasking. Current search systems do not provide adequate support for users tackling complex tasks due to which the cognitive burden of keeping track of such tasks is placed on the searcher. In this note, we summarize our recent efforts towards extracting search tasks from search logs. Based on recent advancements in Bayesian Nonparametrics and distributional semantics, we propose novel algorithms to extract task and subtasks from a query collection. The models discussed can inform the design of the next generation of task-based search systems that leverage user’s task behavior for better support and personalization.

You have full access to this open access chapter, Download conference paper PDF

A Probabilistic Model for Information Retrieval by Mining User Behaviors

Article 29 January 2016

Search Logs Mining: Survey

Query intent inference via search engine log

Article 18 January 2016

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Search behavior, and more generally, information-seeking behavior is often motivated by tasks that prompt search processes that are often lengthy, iterative, intermittent, and characterized by distinct stages, shifting goals and multitasking. Current search engines do not provide adequate support for tackling complex tasks (e.g. planning a trip, surveying a topic), due to which the cognitive burden of keeping track of such tasks and completing them is placed on the searcher. Ideally, a search engine should be able to decipher the underlying reason that led the user to submit a query (i.e., the actual task that caused the query to be issued), and be able to guide the user to achieve their task by incorporating this knowledge about the actual information need.

In this research, we hypothesize that developing a comprehensive understanding of user’s tasks would help in providing better support and recommendations to users based on their contextual information and as a result, help users accomplish the task. As part of the proposed research, we consider the challenge of extracting tasks from a given collection of search log data and present task extraction techniques which rely on recent advancements in bayesian non parametrics and word embeddings. We evaluate the performance of such techniques using a number of techniques based on crowdsourced judgments as well as labelled ground truth data.

2 Task Based Information Retrieval

Our efforts at developing task based retrieval systems have focussed around three major themes, (i) understanding searcher’s behaviors, (ii) developing task extraction techniques and (iii) showing the benefits of task information via improved personalization. We next describe each of them in detail.

2.1 Understanding Searcher’s Task Behavior

While a major share of prior work have considered search sessions as the focal unit of analysis for seeking behavioral insights [7–9], search tasks are emerging as a competing perspective in this space. In a recent work [1], we quantify multi-tasking behavior of web search users and show that over 50 % of search sessions have more than 2 tasks. Further, we provide a method to categorize users into focused, multi-taskers or supertaskers depending on their level of task-multiplicity and show that the search effort expended by these users varies across the groups. Additionally, in a follow up work [3] we relate user’s multitasking propensities to tasks and topics. Specifically, we analyze user-disposition, topic and user-interest level heterogeneities that are prevalent in search task behavior. We find that not only do users have varying propensities to multi-task, they also search for distinct topics across single-task and multi-task sessions. The findings from our analysis provide useful insights about task-multiplicity in an online search environment and hold potential value for search engines that wish to personalize and support search experiences of users based on their task behavior.

2.2 Extracting Hierarchies

An important first step in developing task based systems is task extraction. In a recently published work [4], we considered the challenge of extracting hierarchies of search tasks and their associated subtasks from a given search log given just the log data without the need of any manual annotation of any sort. We present an efficient Bayesian nonparametric model for discovering task hierarchies and propose a tree based bayesian hierarchical task construction algorithm to discover this rich hierarchical structure embedded within search logs. Our model organises the queries into a nested hierarchy T of tasks/subtasks, with all queries in one node at the root and singleton queries at the leaves. We interpret a tree (T) as a mixture of partitions over those group of queries (Q). We define the probability of a group of such queries as:

$$\begin{aligned} p(Q|T) = \sum _{\phi } p(\phi (t)) p(Q|\phi (t)) \end{aligned}$$

(1)

where $p(\phi (T))$ is the mixing proportion of partition $\phi (T)$, and $p(Q|\phi (t))$ is the probability of the group of queries Q given a partitioning by $\phi (T)$. In general the number of partitions consistent with T can be exponentially large. To make computations tractable, we define the mixture model in such a way that $p(Q|\phi (t))$ can be computed using dynamic programming over T:

$$\begin{aligned} p(Q | T) = \pi _T f(Q) + (1 - \pi _t) \prod _{T_i \in ch(T)} p(leaves(T_i)|T_i) \end{aligned}$$

(2)

In the beginning, each query is regarded as a tree on its own. For each step, the algorithm selects two trees $T_i$ and $T_j$ and merges them into a new tree $T_m$. Unlike binary hierarchical clustering, we allow three possible merging operations: (i) Join: $T_m = \lbrace T_i, T_j\rbrace $, such that the tree $T_m$ has two children now; (ii) Absorb: $T_m = \lbrace children(T_i) \cup T_j\rbrace $, i.e., the children of one tree gets absorbed into the other tree forming an absorbed tree with >2 children; and (iii) Collapse: $T_m = \lbrace children(T_i) \cup children(T_j)\rbrace $, all the children of both the sub-tree get combined together at the same level. Such a setting allows each task to be composed of an arbitrary number of sub-tasks without restricting tasks to contain only binary subtasks.

The tree is built in a bottom-up greedy agglomerative fashion, and the algorithm finishes when just one tree remains. At each iteration a pair of trees in the forest F is chosen to be merged by considering the pair and type of merger that yields the largest Bayes factor improvement over the current model. Further details of the work are available in our research paper [4].

2.3 Decomposing Complex Search Tasks

Quite often, search tasks (e.g. planing a trip) are complex and conceptually decompose into a set of sub-tasks (e.g. booking flights, finding places of interest etc.), each of which warrants the user to further issue multiple queries to solve. Given a collection of on-task queries (extracted using standard task extraction algorithm), we proposed a distance dependent Chinese Restaurant process model to extract these sub-tasks from a given collection of on-task queries.

In our sub-task extraction problem, each task is associated with a dd-CRP and its tables are embellished with IID draws from a base distribution over mixture component parameters. Let $z_i$ denote the ith query assignment, the index of the query with whom the ith query is linked. Let $d_{ij}$ denote the distance measurement between queries i and j, let D denote the set of all distance measurements between queries, and let f be a decay function. The distance dependent CRP independently draws the query assignments to sub-tasks conditioned on the distance measurements,

$$\begin{aligned} p(z_i = j | D,\alpha ) \propto {\left\{ \begin{array}{ll} f(d_{ij}) &{} \text {if } j \ne i \\ \alpha &{} \text {if } j = i \end{array}\right. } \end{aligned}$$

Here, $d_{ij}$ is an externally specified distance between queries i and j, and $\alpha $ determines the probability that a customer links to themselves rather than another customer. Given a decay function f, distances between queries D, scaling parameter $\alpha $, and an exchangeable Dirichlet distribution with parameter $\lambda $, N M-word queries are drawn as follows,

1.
For $i \in [1, N]$, draw $z_i \sim dist-CRP(\alpha , f, D)$.
2.
For $i \in [1, N]$,
1. (a)
  If $z_i \notin R^{*}_{q_{1:N}}$, set the parameter for the ith query to $\theta _i = \theta _{q_i}$. Otherwise draw the parameter from the base distribution, $\theta _i \sim Dirichlet(\lambda )$.
2. (b)
  Draw the ith query terms, $w_i \sim Mult(M, \theta _i)$.

Further details of the work are available in our research paper [2].

2.4 Task Based Personalization

In order to demonstrate the usefulness of a task based system, in recent work [5, 6] we presented a novel approach to couple user’s topical interest information with their search task information & their term usage behavior to learn a joint user representation technique. We demonstrated that coupling user’s task information with their topical interests indeed helps us build better user models. We show through extensive experimentation that our task based method outperforms existing query term based and topical interest based user representation methods. By evaluating the quality of our approach on a variety of tasks for personalisation including collaborative query recommendation, cluster based recommendation and user cohort analysis, we demonstrate that the proposed methods result in better user profiles.

3 Conclusion

In this note, we offered insights about the shift in focus from sessions to tasks and presented a brief summary of our recent work aimed at extracting tasks from search logs. We believe that the task-based personalization and recommendation has the potential to shape the future of user interaction systems for the upcoming era of intelligent Web, and there is much to be done on this emerging topic. Some of the key problems to investigate in the future include using task based systems for improved recommendations and better predicting contextual needs of users for proactive recommendations.

References

Mehrotra, R., Bhattacharya, P., Yilmaz, E.: Characterizing users’ multi-tasking behavior in web search. In: Proceedings of the ACM on Conference on Human Information Interaction and Retrieval (2016)
Google Scholar
Mehrotra, R., Bhattacharya, P., Yilmaz, E.: Deconstructing complex search tasks: a bayesian nonparametric approach for extracting sub-tasks. In: Proceedings of NAACL-HLT, pp. 599–605 (2016)
Google Scholar
Mehrotra, R., Bhattacharya, P., Yilmaz, E.: Sessions; tasks & topics - uncovering behavioral heterogeneities in online search behavior. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (2016)
Google Scholar
Mehrotra, R., Yilmaz, E.: Towards hierarchies of search tasks & subtasks. In: WWW (2015)
Google Scholar
Mehrotra, R., Yilmaz, E.: Terms, topics & tasks: enhanced user modelling for better personalization. In: Proceedings of the International Conference on the Theory of Information Retrieval, pp. 131–140. ACM (2015)
Google Scholar
Mehrotra, R., Yilmaz, E., Verma, M.: Task-based user modelling for personalization via probabilistic matrix factorization. In: RecSys Posters (2014)
Google Scholar
Odijk, D., White, R.W., Hassan Awadallah, A., Dumais, S.T.: Struggling and success in web search. In: CIKM (2015)
Google Scholar
White, R.W., Bennett, P.N., Dumais, S.T.: Predicting short-term interests using activity-based search context. In: CIKM (2010)
Google Scholar
Xiang, B., Jiang, D., Pei, J., Sun, X., Chen, E., Li, H.: Context-aware ranking in web search. In: SIGIR (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University College London, London, UK
Rishabh Mehrotra & Emine Yilmaz

Authors

Rishabh Mehrotra
View author publications
You can also search for this author in PubMed Google Scholar
Emine Yilmaz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rishabh Mehrotra .

Editor information

Editors and Affiliations

Department of Computer Science, KU Leuven, Leuven, Belgium
Bettina Berendt
Deloitte GmbH, München, Germany
Björn Bringmann
Laboratoire Hubert Curien, Jean Monnet University, Saint-Etienne, France
Élisa Fromont
Allianz SE, Munich, Germany
Gemma Garriga
Max-Planck-Institute for Informatics, Saarbrücken, Germany
Pauli Miettinen
Aalto University School of Science, Espoo, Finland
Nikolaj Tatti
Siemens AG & Lud. Max. Univ. of Munich, Munich, Germany
Volker Tresp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mehrotra, R., Yilmaz, E. (2016). Query Log Mining for Inferring User Tasks and Needs. In: Berendt, B., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9853. Springer, Cham. https://doi.org/10.1007/978-3-319-46131-1_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-46131-1_36
Published: 03 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46130-4
Online ISBN: 978-3-319-46131-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics