Efficient mining of traversal patterns

https://doi.org/10.1016/S0169-023X(01)00039-8Get rights and content

Abstract

A new problem of mining traversal patterns from Web access logs is introduced. The traversal patterns are defined to keep duplicates as well as consecutive ordering in the sessions. Then an efficient algorithm is proposed. The algorithm is online, which allows the user to see the incremental results with respect to the scanned part of the database. The algorithm also adapts to large databases through dynamic compressions and effective pruning. Finally the algorithm is evaluated through experiments with real Web logs.

Introduction

Mining of Web access patterns made by Web users has recently attracted much research interest [5], [6], [9], [16]. Mining of this clickstream data can be based on the access made by one person's usage or could be performed by a site examining how all users access the pages there. The former technique could be used to develop user profiles which can assist in improving the precision of search engines. The second function can be used to assist Web masters in creating a more user friendly set of Web pages and to improve the Web server performance by prefetching or pushing. Since we are primarily interested in this type of Web mining, we illustrate this with Example 1.

Example 1

The Web master at ABC Corp. finds out that a high percentage of users have the following pattern of reference to pages 〈A,B,A,C〉. This means that a user accesses page A, then page B, then back to page A, and finally to page C. Based on this observation, he determines that a link is needed directly to page C from page B. He then adds this link.

Notice that in this example, we are interested in sequences of page references including any backward traversals. Here a backward traversal is a reference of a page earlier visited. We are also interested in contiguous page references, not just a set of pages that are visited. Such page references can be obtained from the raw Web logs. A typical log on an Apache Web server contains the source IP address, timestamp, the page accessed, the status and the size of the page, which is shown below.

216.250.143.112 - - [29/Feb/2000:02:05:00-0600] “GET /cse/ HTTP/1.1” 200 3043

From the raw Web logs, we can extract access sessions, which consist of the list of pages visited by the users in a time interval. Traversal patterns which are accessed frequently by the users are then mined from the access sessions. More details about data preparation for traversal pattern mining are given in Section 6 of this paper.

In this paper we investigate techniques to discover frequently used contiguous sequences of page references, which we call maximal frequent sequences (MFS). Although there has been previous research into the mining of traversal patterns, the mining of this type of pattern is new. In this paper we first define the problem and then relate it to previous research. In Section 3 we provide an overview of our algorithm, called online adaptive traversal (OAT) pattern mining, to mine MFS. Two major features of this algorithm, i.e., being both online and adaptive, are described in 4 Online pattern mining, 5 Adaptive pattern mining, respectively. In Section 6 we investigate the performance of the algorithm as implemented against a real set of traversal data. We then conclude the paper.

Section snippets

Problem statement and related work

In this section we first provide definitions needed to more formally state the problem. We then discuss how this problem relates to previous research in the area of mining traversal patterns.

Overview of algorithm

We propose an online adaptive algorithm, OAT, to detect MFS. The online property ensures that the sessions can be examined and mined incrementally as they arrive. To be online, the algorithm adopts a suffix tree data structure. The adaptive property allows the algorithm to use the available main memory efficiently. If there is not enough main memory then the algorithm reduces its usage requirements and continues. To be adaptive, the algorithm uses two pruning techniques and compresses the

Online pattern mining

The online feature of the algorithm includes online construction of the suffix tree and online extraction of patterns, which are described in 4.1 Online suffix tree construction, 4.2 Online pattern extraction.

Adaptive pattern mining

Whenever there is insufficient main memory, the algorithm reduces its main memory requirement as shown in function OAT-Compress in Function 2. The reduction is achieved in three ways: local pruning, cumulative pruning, and compression of the suffix tree. We call each such main memory reduction as a compression. We also call the scanned part of the database between two consecutive compressions as a partition. The notations in Table 3 are used in the following description.

Experimental results

We report on the experiments using the Web logs collected on our School of Engineering Web server. We first describe the data preparation, then report the response time and the size of suffix trees. All experiments were conducted on a Dell PC with a 650 MHz Pentium III processor and 128 MB of RAM and running Redhat Linux 6.2. OAT was implemented in C++.

Conclusion

We introduced a new problem of mining traversal patterns, which keep duplicates as well as consecutive ordering in the sessions. We then proposed an online and adaptive algorithm for this. Suffix trees are used for efficient counting. The adaptability to large databases is achieved through dynamic compressions and effective pruning. Experiments with real Web logs demonstrated the effectiveness of our approach.

Future research will include looking at applying the OAT technique to uncover

Yongqiao Xiao received his Ph.D. degree in Computer Science from Southern Methodist University in 2000, M.S. degree in Computer Science from Zhongshan University in 1995, and BS degree in Accounting and Information Systems from Renmin University of China in 1992, respectively. He is currently working with Net Perceptions, Inc. His research interests include Data Mining, Clickstream Analysis, Parallel Computing, and Electronic Commerce.

References (19)

  • R Agrawal et al.

    Mining association rules between sets of items in large databases

  • R Agrawal et al.

    Fast algorithms for mining association rules in large databases

  • R Agrawal et al.

    Mining sequential patterns

  • P Bieganski et al.

    Generalized suffix trees for biological sequence data: applications and implementation

  • A.G Buchner et al.

    Navigation pattern discovery from internet data

  • M.-S Chen et al.

    Efficient data mining for path traversal patterns

    IEEE Trans. Knowledge Data Eng.

    (1998)
  • C Hidber

    Online association rule mining

  • L.C Kwong Hui

    Color set size problem with application to string matching

  • B Lan et al.

    Making web servers pushier

There are more references available in the full text version of this article.

Cited by (55)

  • Mining implicit 3D modeling patterns from unstructured temporal BIM log text data

    2017, Automation in Construction
    Citation Excerpt :

    GST can extract such patterns since it preserves the original order of recorded transactions. Xiao and Dunham [39] first proposed applying GST data structure to mine web access log data. The authors analyzed clickstream data generated based on the access made by Internet users to find frequent web page traversal patterns.

  • Discovering better navigation sequences for the session construction problem

    2012, Data and Knowledge Engineering
    Citation Excerpt :

    In the navigation-oriented approach, a session subsequence, generated by clicking the back button in a browser, is the major problem because these subsequences do not imply strong correlations between consecutive pages. Previous works [7,34] about navigational modeling of web users show that most pages appear in backward page views due to their locations in the web graph rather than their contents. The lack of content correlation introduces noise for applications such as recommendation systems and web site structure optimization, where correlations between web pages are needed.

  • Discovering multi-label temporal patterns in sequence databases

    2011, Information Sciences
    Citation Excerpt :

    For example, in analyzing web traversal behaviors, if we record when users visit pages but not how long they stay, then every visit can be treated as a point-based event. In the past, research on web usage mining has investigated how we can transform users’ browsing data into point-based sequence data through data preprocessing techniques, and how interesting web traversal behaviors can be discovered by applying sequential pattern mining methods [10–13,32,33,40]. In many practical situations, however, events cannot be represented as points.

View all citing articles on Scopus

Yongqiao Xiao received his Ph.D. degree in Computer Science from Southern Methodist University in 2000, M.S. degree in Computer Science from Zhongshan University in 1995, and BS degree in Accounting and Information Systems from Renmin University of China in 1992, respectively. He is currently working with Net Perceptions, Inc. His research interests include Data Mining, Clickstream Analysis, Parallel Computing, and Electronic Commerce.

Margaret (Maggie) H. Dunham received the B.A. and M.S. degrees in mathematics from Miami University, Oxford, Ohio, and the Ph.D. degree in computer science from Southern Methodist University in 1970, 1972, and 1984, respectively. From August 1984 to the present, she has been first an assistant professor, an associate professor, and now a Full Professor in the department of Computer Science and Engineering at Southern Methodist University in Dallas. Professor Dunham's research interests encompass Main Memory Databases, Data Mining, Temporal Databases, and Mobile Computing. Dr. Dunham served as editor of the ACM SIGMOD Record from 1986 to 1988. She has served on the program and organizing committees for many ACM and IEEE conferences. She served as guest editor for a special section of IEEE Transactions on Knowledge and Data Engineering devoted to Main Memory Databases as well as a special issue of the ACM SIGMOD Record devoted to Mobile Computing in databases. She served as the general conference chair for the ACM SIGMOD/PODS held in Dallas in May 2000. She is currently an associate editor for IEEE Transactions on Knowledge and Data Engineering. She has published over seventy technical papers in such research areas as database concurrency control and recovery, database machines, main memory databases, and mobile computing. Professor Dunham lives in Dallas with husband Jim, daughters Stephanie and Kristina, and cat Missy.

This material is based upon work supported by the National Science Foundation under Grant No. IIS-9820841.

View full text