Elsevier

Data & Knowledge Engineering

Volume 73, March 2012, Pages 58-72
Data & Knowledge Engineering

Discovering better navigation sequences for the session construction problem

https://doi.org/10.1016/j.datak.2011.11.005Get rights and content

Abstract

In this paper, we propose a novel page view based session model and session construction method to address the Web Usage Mining (WUM) problem. Unlike the simple session models, where sessions are sequences of web pages requested from the server (or served from a browser/proxy cache) and viewed in the browser (which may not guarantee a direct relationship between subsequent web pages in the session), we define a more realistic session model in which a session is a set of paths traversed in the web graph that corresponds to a user navigation performed by following links on web pages. We define the session construction process from raw server logs as a new graph problem and present a novel algorithm, Smart-SRA (Smart Session Reconstruction Algorithm), to solve this problem efficiently. An experimental evaluation based on data collected from real web access scenarios showed that Smart-SRA produces more accurate user sessions than the session construction methods found in the literature.

Introduction

In the last fifteen years, the World Wide Web (WWW) has become one of the largest information sources, with amazing growth in the number of web sites, the number of web pages, and the multimedia content provided (e.g., pictures, music and videos). The term “Web” usually refers to hypertext information transmitted via html pages, PDF files and other such documents. As in classical data mining, web mining aims to discover significant patterns from the WWW. The data available on the WWW can be mined mainly in three different dimensions: web content mining, web structure mining and web usage mining.

This work investigates the third dimension of web mining, namely, web usage mining (WUM) [8], [15], [26], [29], [31], which deals with extracting interesting knowledge from web log data produced by web servers. Web usage mining (WUM) has various application areas, such as predicting future requests [21], [22], web personalization [2], navigation sequence clustering [25], and providing guidance learned from user access behaviors [19]. Several e-commerce web sites use these applications to enhance purchasing experiences and customer satisfaction and thus increase their profits.

The success of the WUM applications mentioned above significantly depends on session construction because the quality (i.e., accuracy) of the constructed sessions affects later phases of WUM, such as pattern discovery. If page-view requests are not grouped correctly by session construction methods (in the first phase of web usage mining), the performance of these WUM applications suffers. To address these issues, we focus on the session construction problem alone.

The typical raw data for WUM are obtained from the access logs of web servers. In any web server, when a user agent (Internet Explorer, Mozilla, Safari, etc.) clicks a URL in a web server's domain, the information related to that request is recorded in the web server's access log file. Most access log files keep their data in the Common Log Format (CLF), where each page view request is recorded as a line in the access log file. Each CLF record is a tuple containing the following attributes:

  • Client machine's IP address

  • Access date and time

  • Request method (GET or POST)

  • URL of the page accessed

  • HTTP protocol version (HTTP 1.0, HTTP 1.1)

  • Success or return code

  • Number of bytes transmitted

For session construction, the IP address, request time, and requested URL are the only data needed from the user web access log file to obtain the users' navigation paths. Most of the session construction methods consider only these three fields when processing web server logs. Session construction heuristics differ because, during session processing, some use time information and others use the user's navigation information [30], [32].

Producing accurate user sessions and navigation patterns is not an easy task since the HTTP protocol is stateless and connectionless. Additionally, in reactive session construction [9], [10], where it is impossible to know (or generate) client data (e.g., cookies) to identify individual users, all users behind a proxy server have the same IP number and will thus be seen as a single client on the server side. These problems can be handled by proactive strategies [16], [27], e.g., using cookies or adding client-specific information into each page request using dynamic server page codes. However, to employ proactive strategies, the internal structure and content of web pages must be changed, either by inserting JavaScript codes (called page tagging) or with dynamic server page codes. Several web sites use session tracking systems (Web Analytics Tools) provided by external services, usually by including third-party JavaScript codes in their web pages. In this case, usage data are forwarded to the third-party's servers and processed there. However, some site owners may prefer to avoid the use of proactive approaches because of security concerns or resisting modifications to a web site's internal structure; they instead process only the raw server logs containing access requests. We therefore focus this work on reactive approaches and propose a new session construction method to meet these demands. The contributions of this paper are listed below:

  • We categorize previous session construction methods and explain the drawbacks of their session models. We then propose a new session model, called the “link-based session model”, and introduce its formal properties.

  • We propose a new session construction method, called Smart-SRA (Smart Session Reconstruction Algorithm). It generates link-based web user sessions by inserting missing link information (pages served from the client/proxy cache). We also prove that the sessions produced by Smart-SRA satisfy the properties defined in the link-based session model.

  • We perform extensive experiments on a real data set to determine the accuracy and quality of sessions constructed by Smart-SRA. Our experiments show that Smart-SRA produces at least 30% more accurate sessions than the best-known reactive session construction methods. We also conclude that link-based sessions significantly improve page prediction performance.

This paper is organized as follows. In Section 2, we summarize the session construction heuristics studied in the literature and describe the drawbacks of these methods. Section 3 introduces the link-based session model and provides the motivation for this work. Section 4 introduces the Smart-SRA algorithm and gives a detailed description. We present the experimental results in Section 5. Finally, our conclusions are discussed in Section 6.

Section snippets

Time-oriented heuristics

Time-oriented heuristics [9] are based on limitations of total session time or page-stay time. They are divided into three categories, according to the threshold values they use:

  • In the first time-oriented heuristic, the total duration of a session is limited by a predefined upper bound, usually 30 min according to [6]. In this type of session reconstruction, a new page can be appended to the current session if the time difference does not violate this total session duration. Otherwise, a new

The link-based session model and motivation for this work

As web users surf the web, they may navigate to new web pages by selecting links on the current page. They can also return to a previously visited page with the browser's “back” button or links on the current page. Previously visited pages are often provided by the browser cache or proxy servers to reduce the network traffic and/or serve page requests more quickly. In general, forward movements by web users correspond to information searches. During forward movements, the contents of two

Smart-SRA

In this section, we propose a new algorithm, named Smart-SRA (Smart Session Reconstruction Algorithm), for producing link-based sessions from raw server log files. Smart-SRA uses four important rules of the link-based session model while processing page views in server logs. Smart-SRA notably eliminates backward browser movements and it preserves the timestamp order of web pages. The two main phases of Smart-SRA are explained below:

  • In the first phase, the access streams of web users are

Experimental results

This section begins with a description of the accuracy metric for comparing different session construction methods. In the second subsection, we compare the accuracies of the sessions generated both by Smart-SRA and previous heuristics using real-world data. In the third subsection, we compare the referrer-based version and the original version of Smart-SRA in accuracy and session length using large-scale data collected from server logs of www.ceng.metu.edu.tr. In the last section, we present a

Conclusions

In this study, we introduced a new session construction heuristic, called Smart-SRA, to address the web usage mining problem. Our work makes several novel contributions to solving this problem, such as the classification of session construction methods, using link information for session construction and evaluation methods for comparing different session construction methods. We verified the quality and accuracy of the sessions generated by the Smart-SRA algorithm in experiments on both small-

Murat Ali Bayir is currently a member of technical staff at Google Inc. He got his Ph.D. degree from CSE Department of the SUNY at Buffalo in 2010. He received his BS and MS degrees from Computer Engineering Department of Middle East Technical University with minor degree in Math in 2003 and 2006, respectively. His main research areas are Data Mining, Mobile Computing, Graph Theory and Social Network applications. He has published papers in international conferences, workshops and journals

References (36)

  • Ming-Syan Chen et al.

    Efficient data mining for path traversal patterns

    IEEE Transactions on Knowledge and Data Engineering

    (1998)
  • Robert Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. PhD thesis, Dept. of...
  • Robert Cooley et al.

    Data preparation for mining world wide web browsing patterns

    Knowledge and Information Systems

    (1999)
  • Cooley Robert et al.

    Discovery of interesting usage patterns from web data

  • Robert F. Dell et al.

    Web user session reconstruction using integer programming

  • Robert F. Dell et al.

    Fast combinatorial algorithm for web user session reconstruction

  • Robert F. Dell et al.

    Web user session reconstruction with back button browsing

  • Debora Donato

    The web as a graph: how far we are

    ACM Transactions on Internet Technology

    (2007)
  • Cited by (25)

    • Identifying web sessions with simulated annealing

      2014, Expert Systems with Applications
      Citation Excerpt :

      The ideal situation would consider the existence of an algorithm that can process the information in real time by requiring a short computing time. A novel algorithm for solving the WSP was presented in Bayir, Toroloslu, Demirbas, and Cosar (2012). The algorithm is based on graph modeling of the sessions that are constructed considering maximal path length, hyperlink topology and back button browsing.

    • Web usage mining for analysing elder self-care behavior patterns

      2013, Expert Systems with Applications
      Citation Excerpt :

      Web Usage Mining is an area of Web Mining that deals with extracting interesting and useful knowledge from logging information produced by Web servers (Facca & Lanzi 2005; Sajid, Zafar, & Asghar, 2010; Wang & Lee, 2011). Many researchers have applied Web usage mining for characterizing usage based on navigation patterns (Bayir, Toroslu, Demirbas, & Cosar 2012; Chen, Bhowmick, & Nejdl, 2009), for behavior prediction (Dimopoulos, Makris, Panagis, Theodoridis, & Tsakalidis, 2010), for personalized recommendation (Mobasher, Cooley, & Srivastava, 2000; Park, Kim, Choi, & Kim, 2012; Pierrakos, Paliouras, Papatheodorou, & Spyropoulos, 2003) and for web service improvement (Carmona et al., 2012). The main purpose of this study is to apply data mining techniques, including statistical analysis, clustering, association rules and sequential pattern discovery, for mining Web usage information from ComCare server logs to understand elder self-care behavior patterns.

    • Predictive Behavior Modeling Through Web Graphs: Enhancing Next Page Prediction Using Dynamic Link Repository

      2023, Proceedings - 2023 22nd IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2023
    View all citing articles on Scopus

    Murat Ali Bayir is currently a member of technical staff at Google Inc. He got his Ph.D. degree from CSE Department of the SUNY at Buffalo in 2010. He received his BS and MS degrees from Computer Engineering Department of Middle East Technical University with minor degree in Math in 2003 and 2006, respectively. His main research areas are Data Mining, Mobile Computing, Graph Theory and Social Network applications. He has published papers in international conferences, workshops and journals including WWW, WISE, WoWMoM, IEEE Transactions on SMC, Discrete Applied Mathematics, and Elsevier Mobile and Pervasive Computing. His MS Thesis titled “A New Reactive Method for Processing Web Usage Data” has led to an industrial research project sponsored by National Science Foundation of Turkey.

    Ismail H. Toroslu is with the Computer Engineering Department of the Middle East Technical University since 1993. Prof. Toroslu received his Ph.D. (1993) degree in Computer Science from Northwestern University and B.S. (1987) and M.S. (1989) degrees in Computer Engineering from the Middle East Technical University and Bilkent University, respectively. Between 2000 and 2002, he was a visiting associate professor at the University of Central Florida. Dr. Toroslu's research focuses on Data Mining, Database Systems, Graph Theory, Logic Programming and Algorithms. He has published several papers in prestigious conferences and journals including WWW, ICDE, VLDB, IEEE TKDE, Information Systems and Bioinformatics. Dr. Toroslu has received IBM Faculty award in 2009. He was in the organizing committee of ICDE 2007 and program committee co-chair of ISCIS 2009 and ISCIS 2010.

    Murat Demirbas is currently an Associate Professor at CSE Department of SUNY Buffalo. He received his Master's and Ph.D. degrees in Computer Science from The Ohio State University in 2000 and 2004. While at the Ohio State University Murat was involved in the development and deployment of a large scale wireless sensor network, “Line in the Sand”, for detection, classification, and tracking, which paved the way to the “ExScal” network with 1000 nodes. After a one year post-doc with the Theory of Computing Group at MIT, Murat joined the Computer Science and Engineering Department of the SUNY Buffalo. His research interests are in the areas of distributed systems, social networks and mobile computing. Murat received an NSF CAREER award in 2008, an Exceptional Scholars-Young Investigator award in 2010 and Google Research Award in 2010 and 2011.

    Ahmet Cosar got his BS, MS, and PhD degrees, all in computer engineering, from Middle East Technical University (METU), Bilkent University, and University of Minnesota, respectively. He has been a faculty member in METU Computer Engineering department since 1996. His research interests are in distributed databases, data mining, e-commerce, and web-based software architectures. Dr. Cosar has also worked as a visiting faculty member in University of Sharjah (UAE) and Manas University (Kyrgyzstan) and also taught a course at American University of Central Asia.

    1

    The work was done during PhD Study at SUNY at Buffalo; the author is currently with Google Inc.

    2

    Author is supported by The Scientific and Technical Research Council of Turkey with project no 109E239.

    View full text