Automatic new topic identification using multiple linear regression

doi:10.1016/j.ipm.2005.10.002

Information Processing & Management

Volume 42, Issue 4, July 2006, Pages 934-950

https://doi.org/10.1016/j.ipm.2005.10.002 Get rights and content

Abstract

The purpose of this study is to provide automatic new topic identification of search engine query logs, and estimate the effect of statistical characteristics of search engine queries on new topic identification. By applying multiple linear regression and multi-factor ANOVA on a sample data log from the Excite search engine, we demonstrated that the statistical characteristics of Web search queries, such as time interval, search pattern and position of a query in a user session, are effective on shifting to a new topic. Multiple linear regression is also a successful tool for estimating topic shifts and continuations. The findings of this study provide statistical proof for the relationship between the non-semantic characteristics of Web search queries and the occurrence of topic shifts and continuations.

Introduction

Search engines are becoming a major tool to access information over the Web for many people. It is important, for this reason, to study the behavior of search engine users. One dimension of search engine user profile is content-based behavior. Currently, search engines are not designed to differentiate according to the user’s profile and the content that the user is interested in, and personalization and context of search engine use have been recognized as major research challenges in a number of workshops (Liu, Croft, Oh, & Hart, 2004). However, exploiting the user’s interest in various topics has the potential to improve Web retrieval systems (Goker, 1997, Talja et al., 1999). The capability of understanding or at least estimating the user interests could be a significant step towards the development of intelligent search engines.

One of the main elements in developing an intelligent search engine is new topic identification. New topic identification is discovering when the user has switched from one topic to another during a single search session. If the search engine is aware that the user’s new query is on the same topic as the previous query, the search engine could provide the results from the document cluster relevant to the previous query, or alternatively, if the user is on a new topic, the search engine could resort to searching other document clusters. Consequently, search engines can decrease the time and effort required to process the query. In addition, custom-tailored graphical user interfaces can be offered to the Web search engine user, if topic changes were estimated correctly by the search engine. Ozmutlu, Ozmutlu, and Spink (2003a) mention that users interested in different topics could benefit more from such IR systems designed according to their searching needs. Some features of custom-tailored IR systems, which are more sensitive to users’ various information needs are mapped out by Ozmutlu et al. (2003a). Had topic identification been successfully performed, sophisticated graphical user interfaces could be offered by search engines that can help users (a) enable the reformulation of multiple queries on different or related topics, and facilitate task switching, i.e. allowing the tracking, storing and manipulating of retrieved results and printouts related to different topics over multiple searches, (b) provide the ability to create multiple sets of working notes related to different or related search topics, i.e., sketching and note creation tools, (c) enable Web users to submit and track multiple queries concurrently on different or related topics, (d) allow for searching multiple search engines or collections concurrently on multiple topics, (e) review search histories from various searches and topics, and provide the ability to create clusters of retrieved information related to different or related topics.

There are few studies on query clustering and new topic identification, presented in more detail in the related research section. The studies generally analyzed the queries semantically. Semantic analysis of queries is a promising line of research, but is a complicated task, hence its current success is ambiguous. In our previous studies, we applied content-ignorant methodologies for automatic new topic identification, such as Dempster–Shafer Theory and genetic algorithms (Ozmutlu and Cavdur, 2005a, Ozmutlu et al., submitted for publication) and neural networks (Ozmutlu and Cavdur, 2005b, Ozmutlu et al., 2004a). These methodologies rely on the statistical characteristics of the queries, such as the time between query submissions and the reformulation of the subsequent queries, instead of the meaning of the queries. The initial indications of the relation between statistical characteristics of queries and topic change were shown in Spink et al., 2002b, Goker and He, 2000 and He and Goker (2000). However, none of these studies demonstrate the statistical significance of the relationship between the non-semantic characteristics of queries and the timing of topic shifts and continuations.

In this study, we aim to estimate topic shifts in search engine query logs using multiple linear regression and demonstrate the statistical significance of the relationship between non-semantic characteristics of query logs and topic shifts/continuations. Using the characteristics of the search queries as independent factors and the existence of topic shifts as the dependent factor, multiple linear regression is applied to investigate the relationship between statistical characteristics and topic shifts. We also apply ANOVA to examine the structure of the variance of the topic shifts with respect to the statistical characteristics of the search queries. These studies will be helpful in identifying whether there is a relationship between statistical characteristics of the search queries and topic shifts/continuations. If such a relationship exists, content-ignorant methodologies can be expected to be successful.

We initially present the literature review related to topic identification, followed by the description of the methodology, results and the conclusion.

Section snippets

Related research

Many researchers worked on large scaled studies on search engine datalogs, such as Silverstein et al., 1999, Cooley et al., 1999, Spink et al., 1999, Spink et al., 2001, Spink et al., 2002a, Ozmutlu et al., 2002b, Ozmutlu et al., 2003b, Ozmutlu et al., 2003c, Ozmultu and Spink, 2002. Most of the studies are based on statistical or linguistic characteristics of the search queries (Pu, Chuang, Shui-Lung, & Yang, 2002). The number of studies on content analysis is few, the reason generally being

Research question

The research question in this study is to observe whether there is statistical relationship between topic shifts within consecutive queries and characteristics of search engine user queries. In addition, we aim to provide successful estimation of topic shifts in consecutive queries within a user session. In order to perform these tasks, we apply multiple linear regression (Montgomery, 1991) on a search engine query log. We also apply ANOVA to examine the structure of the variance of the topic

Results and discussion

The multiple linear regression equation, where topic shifts are the dependent factor and the characteristics of the query log are the independent factors is as follows: $Y = 0.99262 - 0.026733 TI + 0.00799 SP - 0.0001427 QN + 0.020393 TI * SP - 0.0000419 TI * QN + 0.00026 SP * QN$

Using this regression equation, it may be possible to identify topic shifts and continuations in a Web search query log. To test the validity of the regression equation, hence perform the hypothesis test in Eq. (11), the F value for the regression

Comparison with other methods of classification

In order to measure the performance of the regression approach in estimating topic shifts, it would be beneficial to compare its estimation power to those of other methodologies. We, therefore, compare the regression approach to a popular statistical learning method; support vector machines (SVM).

The estimation of topic shifts and continuation can be seen as a problem of text classification. Currently, SVMs are the most accurate classifiers for text (Chakrabarti, 2003). The main principle of

Conclusion

This study uses multiple linear regression and multiple factor ANOVA to identify the relationships between topic shifts and the non-semantic characteristics of the search queries, and successfully estimate topic shifts and continuations. The non-semantic characteristics of the search queries are the time interval of queries, the search pattern of queries and the order of a query in a search session.

Hypothesis testing showed that the multiple linear regression equation is statistically valid,

References (68)

D. He et al.
Combining evidence for automatic Web session identification
Information Processing and Management
(2002)
B.J. Jansen et al.
Real life, real users, and real needs: A study and analysis of user queries on the Web
Information Processing and Management
(2000)
H.C. Ozmutlu et al.
Application of automatic topic identification on excite web search engine data logs
Information Processing and Management
(2005)
S. Ozmutlu et al.
A day in the life of Web searching: An exploratory study
Information Processing and Management
(2004)
S. Ozmutlu et al.
A real-time methodology for minimizing mean flowtime in FMSs with routing flexibility: Threshold-based alternate routing
European Journal of Operational Research
(2005)
S. Ozmutlu et al.
Trends in multimedia web searching: 1997–2001
Information Processing and Management
(2003)
S. Talja et al.
The production of ‘context ’in information seeking research: A metatheoretical view
Information Processing and Management
(1999)
J. Allan
Modeling topics for detection and tracking
Beeferman, D., & Berger, A. (2000). Agglomerative clustering of a search engine query log. In Proceedings of the 6th...
Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., & Frieder, O. (2004). Efficiency and scaling: Hourly...

C.J.C. Burges

A tutorial on support vector machines for pattern recognition

Data Mining and Knowledge Discovery

(1998)

Chai, K. M. A., Ng, H. T., & Chieu, H. L. (2002). Bayesian online classifiers for text classfication and filtering. In...

Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines. Available from...

S. Chakrabarti

Mining the Web

(2003)

R. Cooley et al.

Data preparation for mining world wide web browsing patterns

Knowledge and Information Systems

(1999)

Feng, A., & Allan, J. (2005). Hierarchical topic detection in TDT. CIIR Technical Report, University of...

A. Goker

Context learning in Okapi

Journal of Documentation

(1997)

Goker, A., & He, D. (2000). Analysing Intranet logs to determine session boundaries for user-oriented learning. In...

He, D., & Goker, A. (2000). Detecting session boundaries from Web user logs. In Proceedings of the BCS-IRSG 22nd annual...

Hu, X., Bandhakavi, S., & Zhai, C. (2003), Error analysis of difficult TREC topics. In Proceedings of 26th ACM...

Jin, R., Sin, L., & Zhai, C. (2003), Preference-based graphic models for collaborative filtering. In Proceedings of the...

Joachims, T. (1998). Text categorization with support vector machines. In Proceedings of the 10th European conference...

D. Kelly et al.

A user-centered approach to evaluating topic models

Lecture Notes in Computer Science

(2004)

Kumaran, G., & Allan, J. (2004). Text classification and named entities for new event detection. In Proceedings of 27th...

Kumaran, G., & Allan, J. (2005). Using names and topics for new event detection. In Proceedings of human language...

Larkey, L. S., Feng, F., Connell, M., & Lavrenko, V. (2004). Language-specific models in multilingual topic tracking....

Lawrie, D., Croft, W. B., & Rosenberg, A. (2001). Finding topic words for hierarchical summarization. In Proceedings of...

Li, W., & McCallum, A. (2005). Semi-supervised sequence modeling with syntactic topic models. In Proceedings of the...

Liu, X., Croft, W. B., Oh, P., & Hart, D. (2004). Automatic recognition of reading levels from user queries. In:...

Mei, Q., & Zhai, C. (2005). Discovering evolutionary theme patterns from text an exploration of temporal text mining....

D. Metzler et al.

Analysis of statistical question classification for fact-based questions

Information Retrieval

(2005)

Metzler, D., & Croft, W. B. (2005b). A Markov random field model for term dependencies. In Proceedings of the 28th...

Miwa. (2001). User situations and multiple levels of users goals in information problem solving processes of AskERIC...

D.C. Montgomery

Design and analysis of experiments

(1991)

Cited by (44)

Modeling and simulation for the impact of EGC strategies on the negative UGC diffusion
2019, Telematics and Informatics
Citation Excerpt :
With the rapid development of internet technology, particularly the emergence of Web 2.0 such as BBS, blog, microblog, and social network services, people are not only exchanging information online but are also expressing their ideas by virtue of user-generated content (UGC) (Ozmutlu, 2006).
The impact of user-generated content (UGC) on enterprises, particularly negative UGC, is well recognized. From the perspective of enterprises, different strategies of enterprise-generated content (EGC) have also been adapted as a response to unexpected UGC; however, few studies have investigated the influence of such strategies on the UGC propagation. This research examines which strategy on the negative UGC propagation is optimal by proposing an EGC-UGC interaction model. It aims to understand the interaction between UGC and EGC in the context of the social network. Using a simulation analysis method to measure the effect of such EGC factors as the first time of issuing EGC, the EGC quantity and the interactive frequency on UGC propagation, this study finds that interactive frequency is the key factor in defending against negative UGC propagation. This research further explores the effect of different strategy combinations referring to the aforementioned three factors on the two types of negative UGC propagation based on deviation distance. The results present two optimal strategies for the two types of negative UGC propagation. Overall, these findings offer some unique implications for UGC management and the information diffusion model of competitive, coexisting information.
Exploratory study of cross-device search tasks
2019, Information Processing and Management
Citation Excerpt :
Topic identification is also an important aspect of search topic, and can result in improvement in the efficiency of search engines. Ozmutlu (2006) statistically tested the relationship between query characteristics and topic shift and continuation. An automatic method of topic identification using multiple linear regression was presented.
Cross-device search is an emerging subject in the study of information retrieval. This paper explores cross-device search behavior through the characteristics of cross-device search tasks. Unlike previous research on transaction log analysis, this paper extracted cross-device search tasks from descriptions of real-situation cross-device search experiences collected by a crowdsourcing survey targeting global users. A total of 343 valid responses were used for the content analysis, and the coding scheme was grounded in the Multiple Information Seeking Episodes (MISE) model, which was proposed for explaining successive multiple-episode search. Characteristics of cross-device search tasks were uncovered by coded categories of Topic, Type, Complexity of Knowledge Dimension, Complexity of Cognitive Dimension, Environment, Device Switch, and Switching Demand. The results show the most frequently searched topics are Arts, Shopping, Reference, and Computers. Task types focus on factual tasks, indicating a clear information need. Task complexity depends heavily on the user's cognition. Eight reasons for switching device are identified in understanding device switch demand. Finally, implications for designing cross-device search tasks are proposed based on the correlation among task attributes. Limitations on the degree to which respondents answered recall-based questions accurately have been acknowledged.
Indices of novelty for emerging topic detection
2012, Information Processing and Management
Citation Excerpt :
Morinaga and Yamanishi improved Kleinberg’s approach (Morinaga & Yamanishi, 2004). Related work can be roughly divided into three groups, those that use: (1) text mining and data mining approaches (Aurora, Rafael, & Jose, 2007; Chou & Chen, 2008; Clifton, Cooley, & Rennie, 2004; Franz & McCarley, 2001; Hatzivassiloglou, Gravano, & Maganti, 2000; Kollios, Gunopulos, Koudas, & Berchtold, 2003; Kuramochi & Karypis, 2004; Ozmutlu, 2006); (2) those that use time-line burst detection of feature terms and measurements (Chen, Luesukprasert, & Chou, 2007; Manmatha, Feng, & Allan, 2002; Wang, Zhai, Hu, & Sproat, 2007; Yang, Yoo, Zhang, & Kisiel, 2005); and (3) those that use combined content analysis or link analysis (Jin, Myaeng, & Jung, 2007; Jo, Lagoze, & Giles, 2007; Nallapati, Ahmed, Xing, & Cohen, 2008; Ontrup, Ritter, Scholz, & Wagner, 2008; Ozmutlu & Cavdur 2005; Steyvers, Smyth, & Griffiths, 2004; Stokes & Carthy, 2001; Wu, Chen, & Sun, 2004; Yang, Zhang, Carbonell, & Jin, 2002; Zhang, Surendran, Platt, & Narasimhan, 2008). The principal task of time-line burst detection of feature terms and measurement is to determine when or whether a topic is emerging, whereas others focus on detecting the burst of a new topic.
Emerging topic detection is a vital research area for researchers and scholars interested in searching for and tracking new research trends and topics. The current methods of text mining and data mining used for this purpose focus only on the frequency of which subjects are mentioned, and ignore the novelty of the subject which is also critical, but beyond the scope of a frequency study. This work tackles this inadequacy to propose a new set of indices for emerging topic detection. They are the novelty index (NI) and the published volume index (PVI). This new set of indices is created based on time, volume, frequency and represents a resolution to provide a more precise set of prediction indices. They are then utilized to determine the detection point (DP) of new emerging topics. Following the detection point, the intersection decides the worth of a new topic. The algorithms presented in this paper can be used to decide the novelty and life span of an emerging topic in a specific field. The entire comprehensive collection of the ACM Digital Library is examined in the experiments. The application of the NI and PVI gives a promising indication of emerging topics in conferences and journals.
Identifying the optimal set of parameters for new topic identification through experimental design
2010, Expert Systems with Applications
Users are interested in multiple topics during a search session, and identifying the boundaries of search sessions is an important task. This study proposes to use neural networks for defining the topic boundaries in search engine transaction logs, and is a part of ongoing research on automatic new topic identification. The objective of the study is to determine the best set of parameters for neural networks that are designed to perform automatic new topic identification. Sample data logs from FAST (currently owned by Yahoo) and Excite (currently owned by IAC Search & Media) search engines were analyzed. The findings show that neural networks are fairly successful in identifying topic continuations and shifts in search engine transaction logs. The choice of the neural network structure depends on which performance measure is more important to the user. For a certain performance measure, there is a set of parameters of neural networks that will increase the performance of new topic identification in search engine transaction logs. In addition, the threshold value of the output level of neural networks is the most influential parameter on the performance of new topic identification.
Detecting tag spams for social bookmarking Websites using a text mining approach
2014, International Journal of Information Technology and Decision Making
CROSS-DEVICE WEB SEARCH
2022, Cross-Device Web Search

View all citing articles on Scopus

View full text

Automatic new topic identification using multiple linear regression

Abstract

Introduction

Section snippets

Related research

Research question

Results and discussion

Comparison with other methods of classification

Conclusion

Information Processing and Management

Information Processing and Management

Information Processing and Management

Information Processing and Management

European Journal of Operational Research

Information Processing and Management

Information Processing and Management

Modeling topics for detection and tracking

A tutorial on support vector machines for pattern recognition

Data Mining and Knowledge Discovery

Mining the Web

Data preparation for mining world wide web browsing patterns

Knowledge and Information Systems

Context learning in Okapi

Journal of Documentation

A user-centered approach to evaluating topic models

Lecture Notes in Computer Science

Analysis of statistical question classification for fact-based questions

Information Retrieval

Design and analysis of experiments