Discovering better navigation sequences for the session construction problem

doi:10.1016/j.datak.2011.11.005

Data & Knowledge Engineering

Volume 73, March 2012, Pages 58-72

https://doi.org/10.1016/j.datak.2011.11.005 Get rights and content

Abstract

In this paper, we propose a novel page view based session model and session construction method to address the Web Usage Mining (WUM) problem. Unlike the simple session models, where sessions are sequences of web pages requested from the server (or served from a browser/proxy cache) and viewed in the browser (which may not guarantee a direct relationship between subsequent web pages in the session), we define a more realistic session model in which a session is a set of paths traversed in the web graph that corresponds to a user navigation performed by following links on web pages. We define the session construction process from raw server logs as a new graph problem and present a novel algorithm, Smart-SRA (Smart Session Reconstruction Algorithm), to solve this problem efficiently. An experimental evaluation based on data collected from real web access scenarios showed that Smart-SRA produces more accurate user sessions than the session construction methods found in the literature.

Introduction

In the last fifteen years, the World Wide Web (WWW) has become one of the largest information sources, with amazing growth in the number of web sites, the number of web pages, and the multimedia content provided (e.g., pictures, music and videos). The term “Web” usually refers to hypertext information transmitted via html pages, PDF files and other such documents. As in classical data mining, web mining aims to discover significant patterns from the WWW. The data available on the WWW can be mined mainly in three different dimensions: web content mining, web structure mining and web usage mining.

This work investigates the third dimension of web mining, namely, web usage mining (WUM) [8], [15], [26], [29], [31], which deals with extracting interesting knowledge from web log data produced by web servers. Web usage mining (WUM) has various application areas, such as predicting future requests [21], [22], web personalization [2], navigation sequence clustering [25], and providing guidance learned from user access behaviors [19]. Several e-commerce web sites use these applications to enhance purchasing experiences and customer satisfaction and thus increase their profits.

The success of the WUM applications mentioned above significantly depends on session construction because the quality (i.e., accuracy) of the constructed sessions affects later phases of WUM, such as pattern discovery. If page-view requests are not grouped correctly by session construction methods (in the first phase of web usage mining), the performance of these WUM applications suffers. To address these issues, we focus on the session construction problem alone.

The typical raw data for WUM are obtained from the access logs of web servers. In any web server, when a user agent (Internet Explorer, Mozilla, Safari, etc.) clicks a URL in a web server's domain, the information related to that request is recorded in the web server's access log file. Most access log files keep their data in the Common Log Format (CLF), where each page view request is recorded as a line in the access log file. Each CLF record is a tuple containing the following attributes:

•
Client machine's IP address
•
Access date and time
•
Request method (GET or POST)
•
URL of the page accessed
•
HTTP protocol version (HTTP 1.0, HTTP 1.1)
•
Success or return code
•
Number of bytes transmitted

For session construction, the IP address, request time, and requested URL are the only data needed from the user web access log file to obtain the users' navigation paths. Most of the session construction methods consider only these three fields when processing web server logs. Session construction heuristics differ because, during session processing, some use time information and others use the user's navigation information [30], [32].

Producing accurate user sessions and navigation patterns is not an easy task since the HTTP protocol is stateless and connectionless. Additionally, in reactive session construction [9], [10], where it is impossible to know (or generate) client data (e.g., cookies) to identify individual users, all users behind a proxy server have the same IP number and will thus be seen as a single client on the server side. These problems can be handled by proactive strategies [16], [27], e.g., using cookies or adding client-specific information into each page request using dynamic server page codes. However, to employ proactive strategies, the internal structure and content of web pages must be changed, either by inserting JavaScript codes (called page tagging) or with dynamic server page codes. Several web sites use session tracking systems (Web Analytics Tools) provided by external services, usually by including third-party JavaScript codes in their web pages. In this case, usage data are forwarded to the third-party's servers and processed there. However, some site owners may prefer to avoid the use of proactive approaches because of security concerns or resisting modifications to a web site's internal structure; they instead process only the raw server logs containing access requests. We therefore focus this work on reactive approaches and propose a new session construction method to meet these demands. The contributions of this paper are listed below:

•
We categorize previous session construction methods and explain the drawbacks of their session models. We then propose a new session model, called the “link-based session model”, and introduce its formal properties.
•
We propose a new session construction method, called Smart-SRA (Smart Session Reconstruction Algorithm). It generates link-based web user sessions by inserting missing link information (pages served from the client/proxy cache). We also prove that the sessions produced by Smart-SRA satisfy the properties defined in the link-based session model.
•
We perform extensive experiments on a real data set to determine the accuracy and quality of sessions constructed by Smart-SRA. Our experiments show that Smart-SRA produces at least 30% more accurate sessions than the best-known reactive session construction methods. We also conclude that link-based sessions significantly improve page prediction performance.

This paper is organized as follows. In Section 2, we summarize the session construction heuristics studied in the literature and describe the drawbacks of these methods. Section 3 introduces the link-based session model and provides the motivation for this work. Section 4 introduces the Smart-SRA algorithm and gives a detailed description. We present the experimental results in Section 5. Finally, our conclusions are discussed in Section 6.

Section snippets

Time-oriented heuristics

Time-oriented heuristics [9] are based on limitations of total session time or page-stay time. They are divided into three categories, according to the threshold values they use:

•
In the first time-oriented heuristic, the total duration of a session is limited by a predefined upper bound, usually 30 min according to [6]. In this type of session reconstruction, a new page can be appended to the current session if the time difference does not violate this total session duration. Otherwise, a new

The link-based session model and motivation for this work

As web users surf the web, they may navigate to new web pages by selecting links on the current page. They can also return to a previously visited page with the browser's “back” button or links on the current page. Previously visited pages are often provided by the browser cache or proxy servers to reduce the network traffic and/or serve page requests more quickly. In general, forward movements by web users correspond to information searches. During forward movements, the contents of two

Smart-SRA

In this section, we propose a new algorithm, named Smart-SRA (Smart Session Reconstruction Algorithm), for producing link-based sessions from raw server log files. Smart-SRA uses four important rules of the link-based session model while processing page views in server logs. Smart-SRA notably eliminates backward browser movements and it preserves the timestamp order of web pages. The two main phases of Smart-SRA are explained below:

•
In the first phase, the access streams of web users are

Experimental results

This section begins with a description of the accuracy metric for comparing different session construction methods. In the second subsection, we compare the accuracies of the sessions generated both by Smart-SRA and previous heuristics using real-world data. In the third subsection, we compare the referrer-based version and the original version of Smart-SRA in accuracy and session length using large-scale data collected from server logs of www.ceng.metu.edu.tr. In the last section, we present a

Conclusions

In this study, we introduced a new session construction heuristic, called Smart-SRA, to address the web usage mining problem. Our work makes several novel contributions to solving this problem, such as the classification of session construction methods, using link information for session construction and evaluation methods for comparing different session construction methods. We verified the quality and accuracy of the sessions generated by the Smart-SRA algorithm in experiments on both small-

References (36)

Murat Ali Bayir et al.
Integration of topological measures for eliminating non-specific interactions in protein interaction networks
Discrete Applied Mathematics
(2009)
Lara D. Catledge et al.
Characterizing browsing strategies in the world-wide web
Computer Networks and ISDN Systems
(1995)
Federico Michele Facca et al.
Mining interesting knowledge from weblogs: a survey
Data & Knowledge Engineering
(2005)
Haibin Liu et al.
Combined mining of web server logs and web contents for classifying user navigation patterns and predicting users' future requests
Data and Knowledge Engineering
(2007)
Sungjune Park et al.
Sequence-based clustering for web usage mining: a new experimental framework and ann-enhanced k-means algorithm
Data and Knowledge Engineering
(2008)
Yongqiao Xiao et al.
Efficient mining of traversal patterns
Data & Knowledge Engineering
(2001)
Rakesh Agrawal et al.
Fast algorithms for mining association rules in large databases
Ranieri Baraglia et al.
Dynamic personalization of web sites without user intervention
Communications of the ACM
(February 2007)
José Borges et al.
Generating dynamic higher-order Markov models in web usage mining
Brohée Sylvain et al.
Evaluation of clustering algorithms for protein–protein interaction networks
BMC Bioinformatics
(2006)

Ming-Syan Chen et al.

Efficient data mining for path traversal patterns

IEEE Transactions on Knowledge and Data Engineering

(1998)

Robert Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. PhD thesis, Dept. of...

Robert Cooley et al.

Data preparation for mining world wide web browsing patterns

Knowledge and Information Systems

(1999)

Cooley Robert et al.

Discovery of interesting usage patterns from web data

Robert F. Dell et al.

Web user session reconstruction using integer programming

Robert F. Dell et al.

Fast combinatorial algorithm for web user session reconstruction

Robert F. Dell et al.

Web user session reconstruction with back button browsing

Debora Donato

The web as a graph: how far we are

ACM Transactions on Internet Technology

(2007)

Cited by (25)

Identifying web sessions with simulated annealing
2014, Expert Systems with Applications
Citation Excerpt :
The ideal situation would consider the existence of an algorithm that can process the information in real time by requiring a short computing time. A novel algorithm for solving the WSP was presented in Bayir, Toroloslu, Demirbas, and Cosar (2012). The algorithm is based on graph modeling of the sessions that are constructed considering maximal path length, hyperlink topology and back button browsing.
Delivery of efficient service through a web site makes it compulsory in the redesigning stage to take into account the behavior of the users, which can be studied by means of a web log file that partially records information about user visits. The reconstruction of all of the sequences of pages that are visited by users who browse a web site is known as the web sessionization problem, and it has been formulated by means of an integer programming model; however, because a web log can accumulate a large amount of information, it is necessary to reconstruct the sessions over a period of weeks or months, thus the solution to this problem requires a long computational processing time. This paper presents a heuristic approach based on simulated annealing for the sessionization problem. Using this approach, it has been possible to reduce the processing time up to 166 times compared to the time that is required for the integer programming model. Furthermore, the metaheuristic solution finds new optimum values, which achieve increases on the order of 17% in the best cases.
Web usage mining for analysing elder self-care behavior patterns
2013, Expert Systems with Applications
Citation Excerpt :
Web Usage Mining is an area of Web Mining that deals with extracting interesting and useful knowledge from logging information produced by Web servers (Facca & Lanzi 2005; Sajid, Zafar, & Asghar, 2010; Wang & Lee, 2011). Many researchers have applied Web usage mining for characterizing usage based on navigation patterns (Bayir, Toroslu, Demirbas, & Cosar 2012; Chen, Bhowmick, & Nejdl, 2009), for behavior prediction (Dimopoulos, Makris, Panagis, Theodoridis, & Tsakalidis, 2010), for personalized recommendation (Mobasher, Cooley, & Srivastava, 2000; Park, Kim, Choi, & Kim, 2012; Pierrakos, Paliouras, Papatheodorou, & Spyropoulos, 2003) and for web service improvement (Carmona et al., 2012). The main purpose of this study is to apply data mining techniques, including statistical analysis, clustering, association rules and sequential pattern discovery, for mining Web usage information from ComCare server logs to understand elder self-care behavior patterns.
The rapid growth of the elderly population has increased the need to support elders in maintaining independent and healthy lifestyles in their homes rather than through more expensive and isolated care facilities. Self-care can improve the competence of elderly participants in managing their own health conditions without leaving home. This main purpose of this study is to understand the self-care behavior of elderly participants in a developed self-care service system that provides self-care service and to analyze the daily self-care activities and health status of elders who live at home alone.
To understand elder self-care patterns, log data from actual cases of elder self-care service were collected and analysed by Web usage mining. This study analysed 3391 sessions of 157 elders for the month of March, 2012. First, self-care use cycle, time, function numbers, and the depth and extent (range) of services were statistically analysed. Association rules were then used for data mining to find relationship between these functions of self-care behavior. Second, data from interest-based representation schemes were used to construct elder sessions. The ART2-enhance K-mean algorithm was then used to mine cluster patterns. Finally, sequential profiles for elder self-care behavior patterns were captured by applying sequence-based representation schemes in association with Markov models and ART2-enhanced K-mean clustering algorithms for sequence behavior mining cluster patterns for the elders. The analysis results can be used for research in medicine, public health, nursing and psychology and for policy-making in the health care domain.
IRPDP_HT2: a scalable data pre-processing method in web usage mining using Hadoop MapReduce
2023, Soft Computing
Predictive Behavior Modeling Through Web Graphs: Enhancing Next Page Prediction Using Dynamic Link Repository
2023, Proceedings - 2023 22nd IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2023
Maximal paths recipe for constructing Web user sessions
2022, World Wide Web
IRPDP_HT2: A Scalable Data Pre-processing Method in Web Usage Mining using Hadoop-MapReduce
2022, Research Square

View all citing articles on Scopus

Murat Ali Bayir is currently a member of technical staff at Google Inc. He got his Ph.D. degree from CSE Department of the SUNY at Buffalo in 2010. He received his BS and MS degrees from Computer Engineering Department of Middle East Technical University with minor degree in Math in 2003 and 2006, respectively. His main research areas are Data Mining, Mobile Computing, Graph Theory and Social Network applications. He has published papers in international conferences, workshops and journals including WWW, WISE, WoWMoM, IEEE Transactions on SMC, Discrete Applied Mathematics, and Elsevier Mobile and Pervasive Computing. His MS Thesis titled “A New Reactive Method for Processing Web Usage Data” has led to an industrial research project sponsored by National Science Foundation of Turkey.

Ismail H. Toroslu is with the Computer Engineering Department of the Middle East Technical University since 1993. Prof. Toroslu received his Ph.D. (1993) degree in Computer Science from Northwestern University and B.S. (1987) and M.S. (1989) degrees in Computer Engineering from the Middle East Technical University and Bilkent University, respectively. Between 2000 and 2002, he was a visiting associate professor at the University of Central Florida. Dr. Toroslu's research focuses on Data Mining, Database Systems, Graph Theory, Logic Programming and Algorithms. He has published several papers in prestigious conferences and journals including WWW, ICDE, VLDB, IEEE TKDE, Information Systems and Bioinformatics. Dr. Toroslu has received IBM Faculty award in 2009. He was in the organizing committee of ICDE 2007 and program committee co-chair of ISCIS 2009 and ISCIS 2010.

Murat Demirbas is currently an Associate Professor at CSE Department of SUNY Buffalo. He received his Master's and Ph.D. degrees in Computer Science from The Ohio State University in 2000 and 2004. While at the Ohio State University Murat was involved in the development and deployment of a large scale wireless sensor network, “Line in the Sand”, for detection, classification, and tracking, which paved the way to the “ExScal” network with 1000 nodes. After a one year post-doc with the Theory of Computing Group at MIT, Murat joined the Computer Science and Engineering Department of the SUNY Buffalo. His research interests are in the areas of distributed systems, social networks and mobile computing. Murat received an NSF CAREER award in 2008, an Exceptional Scholars-Young Investigator award in 2010 and Google Research Award in 2010 and 2011.

Ahmet Cosar got his BS, MS, and PhD degrees, all in computer engineering, from Middle East Technical University (METU), Bilkent University, and University of Minnesota, respectively. He has been a faculty member in METU Computer Engineering department since 1996. His research interests are in distributed databases, data mining, e-commerce, and web-based software architectures. Dr. Cosar has also worked as a visiting faculty member in University of Sharjah (UAE) and Manas University (Kyrgyzstan) and also taught a course at American University of Central Asia.

¹: The work was done during PhD Study at SUNY at Buffalo; the author is currently with Google Inc.

²: Author is supported by The Scientific and Technical Research Council of Turkey with project no 109E239.

View full text

Discovering better navigation sequences for the session construction problem

Abstract

Introduction

Section snippets

Time-oriented heuristics

The link-based session model and motivation for this work

Smart-SRA

Experimental results

Conclusions

Discrete Applied Mathematics

Computer Networks and ISDN Systems

Data & Knowledge Engineering

Data and Knowledge Engineering

Data and Knowledge Engineering

Data & Knowledge Engineering

Fast algorithms for mining association rules in large databases

Dynamic personalization of web sites without user intervention

Communications of the ACM

Generating dynamic higher-order Markov models in web usage mining

Evaluation of clustering algorithms for protein–protein interaction networks

BMC Bioinformatics

Efficient data mining for path traversal patterns

IEEE Transactions on Knowledge and Data Engineering

Data preparation for mining world wide web browsing patterns

Knowledge and Information Systems

Discovery of interesting usage patterns from web data

Web user session reconstruction using integer programming

Fast combinatorial algorithm for web user session reconstruction

Web user session reconstruction with back button browsing

The web as a graph: how far we are

ACM Transactions on Internet Technology