Automatic new topic identification using multiple linear regression
Introduction
Search engines are becoming a major tool to access information over the Web for many people. It is important, for this reason, to study the behavior of search engine users. One dimension of search engine user profile is content-based behavior. Currently, search engines are not designed to differentiate according to the user’s profile and the content that the user is interested in, and personalization and context of search engine use have been recognized as major research challenges in a number of workshops (Liu, Croft, Oh, & Hart, 2004). However, exploiting the user’s interest in various topics has the potential to improve Web retrieval systems (Goker, 1997, Talja et al., 1999). The capability of understanding or at least estimating the user interests could be a significant step towards the development of intelligent search engines.
One of the main elements in developing an intelligent search engine is new topic identification. New topic identification is discovering when the user has switched from one topic to another during a single search session. If the search engine is aware that the user’s new query is on the same topic as the previous query, the search engine could provide the results from the document cluster relevant to the previous query, or alternatively, if the user is on a new topic, the search engine could resort to searching other document clusters. Consequently, search engines can decrease the time and effort required to process the query. In addition, custom-tailored graphical user interfaces can be offered to the Web search engine user, if topic changes were estimated correctly by the search engine. Ozmutlu, Ozmutlu, and Spink (2003a) mention that users interested in different topics could benefit more from such IR systems designed according to their searching needs. Some features of custom-tailored IR systems, which are more sensitive to users’ various information needs are mapped out by Ozmutlu et al. (2003a). Had topic identification been successfully performed, sophisticated graphical user interfaces could be offered by search engines that can help users (a) enable the reformulation of multiple queries on different or related topics, and facilitate task switching, i.e. allowing the tracking, storing and manipulating of retrieved results and printouts related to different topics over multiple searches, (b) provide the ability to create multiple sets of working notes related to different or related search topics, i.e., sketching and note creation tools, (c) enable Web users to submit and track multiple queries concurrently on different or related topics, (d) allow for searching multiple search engines or collections concurrently on multiple topics, (e) review search histories from various searches and topics, and provide the ability to create clusters of retrieved information related to different or related topics.
There are few studies on query clustering and new topic identification, presented in more detail in the related research section. The studies generally analyzed the queries semantically. Semantic analysis of queries is a promising line of research, but is a complicated task, hence its current success is ambiguous. In our previous studies, we applied content-ignorant methodologies for automatic new topic identification, such as Dempster–Shafer Theory and genetic algorithms (Ozmutlu and Cavdur, 2005a, Ozmutlu et al., submitted for publication) and neural networks (Ozmutlu and Cavdur, 2005b, Ozmutlu et al., 2004a). These methodologies rely on the statistical characteristics of the queries, such as the time between query submissions and the reformulation of the subsequent queries, instead of the meaning of the queries. The initial indications of the relation between statistical characteristics of queries and topic change were shown in Spink et al., 2002b, Goker and He, 2000 and He and Goker (2000). However, none of these studies demonstrate the statistical significance of the relationship between the non-semantic characteristics of queries and the timing of topic shifts and continuations.
In this study, we aim to estimate topic shifts in search engine query logs using multiple linear regression and demonstrate the statistical significance of the relationship between non-semantic characteristics of query logs and topic shifts/continuations. Using the characteristics of the search queries as independent factors and the existence of topic shifts as the dependent factor, multiple linear regression is applied to investigate the relationship between statistical characteristics and topic shifts. We also apply ANOVA to examine the structure of the variance of the topic shifts with respect to the statistical characteristics of the search queries. These studies will be helpful in identifying whether there is a relationship between statistical characteristics of the search queries and topic shifts/continuations. If such a relationship exists, content-ignorant methodologies can be expected to be successful.
We initially present the literature review related to topic identification, followed by the description of the methodology, results and the conclusion.
Section snippets
Related research
Many researchers worked on large scaled studies on search engine datalogs, such as Silverstein et al., 1999, Cooley et al., 1999, Spink et al., 1999, Spink et al., 2001, Spink et al., 2002a, Ozmutlu et al., 2002b, Ozmutlu et al., 2003b, Ozmutlu et al., 2003c, Ozmultu and Spink, 2002. Most of the studies are based on statistical or linguistic characteristics of the search queries (Pu, Chuang, Shui-Lung, & Yang, 2002). The number of studies on content analysis is few, the reason generally being
Research question
The research question in this study is to observe whether there is statistical relationship between topic shifts within consecutive queries and characteristics of search engine user queries. In addition, we aim to provide successful estimation of topic shifts in consecutive queries within a user session. In order to perform these tasks, we apply multiple linear regression (Montgomery, 1991) on a search engine query log. We also apply ANOVA to examine the structure of the variance of the topic
Results and discussion
The multiple linear regression equation, where topic shifts are the dependent factor and the characteristics of the query log are the independent factors is as follows:
Using this regression equation, it may be possible to identify topic shifts and continuations in a Web search query log. To test the validity of the regression equation, hence perform the hypothesis test in Eq. (11), the F value for the regression
Comparison with other methods of classification
In order to measure the performance of the regression approach in estimating topic shifts, it would be beneficial to compare its estimation power to those of other methodologies. We, therefore, compare the regression approach to a popular statistical learning method; support vector machines (SVM).
The estimation of topic shifts and continuation can be seen as a problem of text classification. Currently, SVMs are the most accurate classifiers for text (Chakrabarti, 2003). The main principle of
Conclusion
This study uses multiple linear regression and multiple factor ANOVA to identify the relationships between topic shifts and the non-semantic characteristics of the search queries, and successfully estimate topic shifts and continuations. The non-semantic characteristics of the search queries are the time interval of queries, the search pattern of queries and the order of a query in a search session.
Hypothesis testing showed that the multiple linear regression equation is statistically valid,
References (68)
- et al.
Combining evidence for automatic Web session identification
Information Processing and Management
(2002) - et al.
Real life, real users, and real needs: A study and analysis of user queries on the Web
Information Processing and Management
(2000) - et al.
Application of automatic topic identification on excite web search engine data logs
Information Processing and Management
(2005) - et al.
A day in the life of Web searching: An exploratory study
Information Processing and Management
(2004) - et al.
A real-time methodology for minimizing mean flowtime in FMSs with routing flexibility: Threshold-based alternate routing
European Journal of Operational Research
(2005) - et al.
Trends in multimedia web searching: 1997–2001
Information Processing and Management
(2003) - et al.
The production of ‘context ’in information seeking research: A metatheoretical view
Information Processing and Management
(1999) Modeling topics for detection and tracking
- Beeferman, D., & Berger, A. (2000). Agglomerative clustering of a search engine query log. In Proceedings of the 6th...
- Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., & Frieder, O. (2004). Efficiency and scaling: Hourly...
A tutorial on support vector machines for pattern recognition
Data Mining and Knowledge Discovery
Mining the Web
Data preparation for mining world wide web browsing patterns
Knowledge and Information Systems
Context learning in Okapi
Journal of Documentation
A user-centered approach to evaluating topic models
Lecture Notes in Computer Science
Analysis of statistical question classification for fact-based questions
Information Retrieval
Design and analysis of experiments
Cited by (44)
Modeling and simulation for the impact of EGC strategies on the negative UGC diffusion
2019, Telematics and InformaticsCitation Excerpt :With the rapid development of internet technology, particularly the emergence of Web 2.0 such as BBS, blog, microblog, and social network services, people are not only exchanging information online but are also expressing their ideas by virtue of user-generated content (UGC) (Ozmutlu, 2006).
Exploratory study of cross-device search tasks
2019, Information Processing and ManagementCitation Excerpt :Topic identification is also an important aspect of search topic, and can result in improvement in the efficiency of search engines. Ozmutlu (2006) statistically tested the relationship between query characteristics and topic shift and continuation. An automatic method of topic identification using multiple linear regression was presented.
Indices of novelty for emerging topic detection
2012, Information Processing and ManagementCitation Excerpt :Morinaga and Yamanishi improved Kleinberg’s approach (Morinaga & Yamanishi, 2004). Related work can be roughly divided into three groups, those that use: (1) text mining and data mining approaches (Aurora, Rafael, & Jose, 2007; Chou & Chen, 2008; Clifton, Cooley, & Rennie, 2004; Franz & McCarley, 2001; Hatzivassiloglou, Gravano, & Maganti, 2000; Kollios, Gunopulos, Koudas, & Berchtold, 2003; Kuramochi & Karypis, 2004; Ozmutlu, 2006); (2) those that use time-line burst detection of feature terms and measurements (Chen, Luesukprasert, & Chou, 2007; Manmatha, Feng, & Allan, 2002; Wang, Zhai, Hu, & Sproat, 2007; Yang, Yoo, Zhang, & Kisiel, 2005); and (3) those that use combined content analysis or link analysis (Jin, Myaeng, & Jung, 2007; Jo, Lagoze, & Giles, 2007; Nallapati, Ahmed, Xing, & Cohen, 2008; Ontrup, Ritter, Scholz, & Wagner, 2008; Ozmutlu & Cavdur 2005; Steyvers, Smyth, & Griffiths, 2004; Stokes & Carthy, 2001; Wu, Chen, & Sun, 2004; Yang, Zhang, Carbonell, & Jin, 2002; Zhang, Surendran, Platt, & Narasimhan, 2008). The principal task of time-line burst detection of feature terms and measurement is to determine when or whether a topic is emerging, whereas others focus on detecting the burst of a new topic.
Identifying the optimal set of parameters for new topic identification through experimental design
2010, Expert Systems with ApplicationsDetecting tag spams for social bookmarking Websites using a text mining approach
2014, International Journal of Information Technology and Decision MakingCROSS-DEVICE WEB SEARCH
2022, Cross-Device Web Search