Automatic new topic identification using multiple linear regression

https://doi.org/10.1016/j.ipm.2005.10.002Get rights and content

Abstract

The purpose of this study is to provide automatic new topic identification of search engine query logs, and estimate the effect of statistical characteristics of search engine queries on new topic identification. By applying multiple linear regression and multi-factor ANOVA on a sample data log from the Excite search engine, we demonstrated that the statistical characteristics of Web search queries, such as time interval, search pattern and position of a query in a user session, are effective on shifting to a new topic. Multiple linear regression is also a successful tool for estimating topic shifts and continuations. The findings of this study provide statistical proof for the relationship between the non-semantic characteristics of Web search queries and the occurrence of topic shifts and continuations.

Introduction

Search engines are becoming a major tool to access information over the Web for many people. It is important, for this reason, to study the behavior of search engine users. One dimension of search engine user profile is content-based behavior. Currently, search engines are not designed to differentiate according to the user’s profile and the content that the user is interested in, and personalization and context of search engine use have been recognized as major research challenges in a number of workshops (Liu, Croft, Oh, & Hart, 2004). However, exploiting the user’s interest in various topics has the potential to improve Web retrieval systems (Goker, 1997, Talja et al., 1999). The capability of understanding or at least estimating the user interests could be a significant step towards the development of intelligent search engines.

One of the main elements in developing an intelligent search engine is new topic identification. New topic identification is discovering when the user has switched from one topic to another during a single search session. If the search engine is aware that the user’s new query is on the same topic as the previous query, the search engine could provide the results from the document cluster relevant to the previous query, or alternatively, if the user is on a new topic, the search engine could resort to searching other document clusters. Consequently, search engines can decrease the time and effort required to process the query. In addition, custom-tailored graphical user interfaces can be offered to the Web search engine user, if topic changes were estimated correctly by the search engine. Ozmutlu, Ozmutlu, and Spink (2003a) mention that users interested in different topics could benefit more from such IR systems designed according to their searching needs. Some features of custom-tailored IR systems, which are more sensitive to users’ various information needs are mapped out by Ozmutlu et al. (2003a). Had topic identification been successfully performed, sophisticated graphical user interfaces could be offered by search engines that can help users (a) enable the reformulation of multiple queries on different or related topics, and facilitate task switching, i.e. allowing the tracking, storing and manipulating of retrieved results and printouts related to different topics over multiple searches, (b) provide the ability to create multiple sets of working notes related to different or related search topics, i.e., sketching and note creation tools, (c) enable Web users to submit and track multiple queries concurrently on different or related topics, (d) allow for searching multiple search engines or collections concurrently on multiple topics, (e) review search histories from various searches and topics, and provide the ability to create clusters of retrieved information related to different or related topics.

There are few studies on query clustering and new topic identification, presented in more detail in the related research section. The studies generally analyzed the queries semantically. Semantic analysis of queries is a promising line of research, but is a complicated task, hence its current success is ambiguous. In our previous studies, we applied content-ignorant methodologies for automatic new topic identification, such as Dempster–Shafer Theory and genetic algorithms (Ozmutlu and Cavdur, 2005a, Ozmutlu et al., submitted for publication) and neural networks (Ozmutlu and Cavdur, 2005b, Ozmutlu et al., 2004a). These methodologies rely on the statistical characteristics of the queries, such as the time between query submissions and the reformulation of the subsequent queries, instead of the meaning of the queries. The initial indications of the relation between statistical characteristics of queries and topic change were shown in Spink et al., 2002b, Goker and He, 2000 and He and Goker (2000). However, none of these studies demonstrate the statistical significance of the relationship between the non-semantic characteristics of queries and the timing of topic shifts and continuations.

In this study, we aim to estimate topic shifts in search engine query logs using multiple linear regression and demonstrate the statistical significance of the relationship between non-semantic characteristics of query logs and topic shifts/continuations. Using the characteristics of the search queries as independent factors and the existence of topic shifts as the dependent factor, multiple linear regression is applied to investigate the relationship between statistical characteristics and topic shifts. We also apply ANOVA to examine the structure of the variance of the topic shifts with respect to the statistical characteristics of the search queries. These studies will be helpful in identifying whether there is a relationship between statistical characteristics of the search queries and topic shifts/continuations. If such a relationship exists, content-ignorant methodologies can be expected to be successful.

We initially present the literature review related to topic identification, followed by the description of the methodology, results and the conclusion.

Section snippets

Related research

Many researchers worked on large scaled studies on search engine datalogs, such as Silverstein et al., 1999, Cooley et al., 1999, Spink et al., 1999, Spink et al., 2001, Spink et al., 2002a, Ozmutlu et al., 2002b, Ozmutlu et al., 2003b, Ozmutlu et al., 2003c, Ozmultu and Spink, 2002. Most of the studies are based on statistical or linguistic characteristics of the search queries (Pu, Chuang, Shui-Lung, & Yang, 2002). The number of studies on content analysis is few, the reason generally being

Research question

The research question in this study is to observe whether there is statistical relationship between topic shifts within consecutive queries and characteristics of search engine user queries. In addition, we aim to provide successful estimation of topic shifts in consecutive queries within a user session. In order to perform these tasks, we apply multiple linear regression (Montgomery, 1991) on a search engine query log. We also apply ANOVA to examine the structure of the variance of the topic

Results and discussion

The multiple linear regression equation, where topic shifts are the dependent factor and the characteristics of the query log are the independent factors is as follows:Y=0.99262-0.026733TI+0.00799SP-0.0001427QN+0.020393TISP-0.0000419TIQN+0.00026SPQN

Using this regression equation, it may be possible to identify topic shifts and continuations in a Web search query log. To test the validity of the regression equation, hence perform the hypothesis test in Eq. (11), the F value for the regression

Comparison with other methods of classification

In order to measure the performance of the regression approach in estimating topic shifts, it would be beneficial to compare its estimation power to those of other methodologies. We, therefore, compare the regression approach to a popular statistical learning method; support vector machines (SVM).

The estimation of topic shifts and continuation can be seen as a problem of text classification. Currently, SVMs are the most accurate classifiers for text (Chakrabarti, 2003). The main principle of

Conclusion

This study uses multiple linear regression and multiple factor ANOVA to identify the relationships between topic shifts and the non-semantic characteristics of the search queries, and successfully estimate topic shifts and continuations. The non-semantic characteristics of the search queries are the time interval of queries, the search pattern of queries and the order of a query in a search session.

Hypothesis testing showed that the multiple linear regression equation is statistically valid,

References (68)

  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining and Knowledge Discovery

    (1998)
  • Chai, K. M. A., Ng, H. T., & Chieu, H. L. (2002). Bayesian online classifiers for text classfication and filtering. In...
  • Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines. Available from...
  • S. Chakrabarti

    Mining the Web

    (2003)
  • R. Cooley et al.

    Data preparation for mining world wide web browsing patterns

    Knowledge and Information Systems

    (1999)
  • Feng, A., & Allan, J. (2005). Hierarchical topic detection in TDT. CIIR Technical Report, University of...
  • A. Goker

    Context learning in Okapi

    Journal of Documentation

    (1997)
  • Goker, A., & He, D. (2000). Analysing Intranet logs to determine session boundaries for user-oriented learning. In...
  • He, D., & Goker, A. (2000). Detecting session boundaries from Web user logs. In Proceedings of the BCS-IRSG 22nd annual...
  • Hu, X., Bandhakavi, S., & Zhai, C. (2003), Error analysis of difficult TREC topics. In Proceedings of 26th ACM...
  • Jin, R., Sin, L., & Zhai, C. (2003), Preference-based graphic models for collaborative filtering. In Proceedings of the...
  • Joachims, T. (1998). Text categorization with support vector machines. In Proceedings of the 10th European conference...
  • D. Kelly et al.

    A user-centered approach to evaluating topic models

    Lecture Notes in Computer Science

    (2004)
  • Kumaran, G., & Allan, J. (2004). Text classification and named entities for new event detection. In Proceedings of 27th...
  • Kumaran, G., & Allan, J. (2005). Using names and topics for new event detection. In Proceedings of human language...
  • Larkey, L. S., Feng, F., Connell, M., & Lavrenko, V. (2004). Language-specific models in multilingual topic tracking....
  • Lawrie, D., Croft, W. B., & Rosenberg, A. (2001). Finding topic words for hierarchical summarization. In Proceedings of...
  • Li, W., & McCallum, A. (2005). Semi-supervised sequence modeling with syntactic topic models. In Proceedings of the...
  • Liu, X., Croft, W. B., Oh, P., & Hart, D. (2004). Automatic recognition of reading levels from user queries. In:...
  • Mei, Q., & Zhai, C. (2005). Discovering evolutionary theme patterns from text an exploration of temporal text mining....
  • D. Metzler et al.

    Analysis of statistical question classification for fact-based questions

    Information Retrieval

    (2005)
  • Metzler, D., & Croft, W. B. (2005b). A Markov random field model for term dependencies. In Proceedings of the 28th...
  • Miwa. (2001). User situations and multiple levels of users goals in information problem solving processes of AskERIC...
  • D.C. Montgomery

    Design and analysis of experiments

    (1991)
  • Cited by (44)

    • Modeling and simulation for the impact of EGC strategies on the negative UGC diffusion

      2019, Telematics and Informatics
      Citation Excerpt :

      With the rapid development of internet technology, particularly the emergence of Web 2.0 such as BBS, blog, microblog, and social network services, people are not only exchanging information online but are also expressing their ideas by virtue of user-generated content (UGC) (Ozmutlu, 2006).

    • Exploratory study of cross-device search tasks

      2019, Information Processing and Management
      Citation Excerpt :

      Topic identification is also an important aspect of search topic, and can result in improvement in the efficiency of search engines. Ozmutlu (2006) statistically tested the relationship between query characteristics and topic shift and continuation. An automatic method of topic identification using multiple linear regression was presented.

    • Indices of novelty for emerging topic detection

      2012, Information Processing and Management
      Citation Excerpt :

      Morinaga and Yamanishi improved Kleinberg’s approach (Morinaga & Yamanishi, 2004). Related work can be roughly divided into three groups, those that use: (1) text mining and data mining approaches (Aurora, Rafael, & Jose, 2007; Chou & Chen, 2008; Clifton, Cooley, & Rennie, 2004; Franz & McCarley, 2001; Hatzivassiloglou, Gravano, & Maganti, 2000; Kollios, Gunopulos, Koudas, & Berchtold, 2003; Kuramochi & Karypis, 2004; Ozmutlu, 2006); (2) those that use time-line burst detection of feature terms and measurements (Chen, Luesukprasert, & Chou, 2007; Manmatha, Feng, & Allan, 2002; Wang, Zhai, Hu, & Sproat, 2007; Yang, Yoo, Zhang, & Kisiel, 2005); and (3) those that use combined content analysis or link analysis (Jin, Myaeng, & Jung, 2007; Jo, Lagoze, & Giles, 2007; Nallapati, Ahmed, Xing, & Cohen, 2008; Ontrup, Ritter, Scholz, & Wagner, 2008; Ozmutlu & Cavdur 2005; Steyvers, Smyth, & Griffiths, 2004; Stokes & Carthy, 2001; Wu, Chen, & Sun, 2004; Yang, Zhang, Carbonell, & Jin, 2002; Zhang, Surendran, Platt, & Narasimhan, 2008). The principal task of time-line burst detection of feature terms and measurement is to determine when or whether a topic is emerging, whereas others focus on detecting the burst of a new topic.

    • Detecting tag spams for social bookmarking Websites using a text mining approach

      2014, International Journal of Information Technology and Decision Making
    • CROSS-DEVICE WEB SEARCH

      2022, Cross-Device Web Search
    View all citing articles on Scopus
    View full text