Abstract
In this study, we experiment with several multiobjective evolutionary algorithms to determine a suitable approach for clustering Web user sessions, which consist of sequences of Web pages visited by the users. Our experimental results show that the multiobjective evolutionary algorithm-based approaches are successful for sequence clustering. We look at a commonly used cluster validity index to verify our findings. The results for this index indicate that the clustering solutions are of high quality. As a case study, the obtained clusters are then used in a Web recommender system for representing usage patterns. As a result of the experiments, we see that these approaches can successfully be applied for generating clustering solutions that lead to a high recommendation accuracy in the recommender model we used in this paper.
Similar content being viewed by others
Notes
This term will be used interchangeably with “user session”.
The path that has the highest similarity to the active user session is defined as the best matching path.
References
Bleuler S, Laumanns M, Thiele L, Zitzler E (2003) PISA—a platform and programming language independent interface for search algorithms. In: Proceeding of evolutionary multi-criterion optimization (EMO 2003). Lecture notes in computer science, vol 2632. Springer, Berlin, pp 494–508
Branke J, Deb K, Dierolf H, Osswald M (2004) Finding knees in multi-objective optimization. In: Proceedings of the parallel problem solving from nature (PPSN 2004), pp 722–731
Charter K, Schaeffer J, Szafron D (2000) Sequence alignment using FastLSA. In: Proceedings of international conference on mathematics and engineering techniques in medicine and biological sciences, pp 48–57
Cheng C-K, Wei YA (1991) An improved two-way partitioning algorithm with stable performance. IEEE Trans Comput-Aided Design Integr Circuits Syst 10:1502–1511
Coello Coello CA, Lamont GB, Van Veldhuizen DA (2007) Evolutionary algorithms for solving multi-objective problems, 2nd edn. Springer, Berlin
Cole RM (1998) Clustering with genetic algorithms. Master’s thesis, University of Western Australia, Nedlands 6907, Australia
Conover W (1999) Practical nonparametric statistics, 3rd edn. Wiley, New York
Cooley R, Mobasher B, Srivastava J (1999) Data preparation for mining world wide web browsing patterns. J Knowl Inform Syst 1(1):5–32
Corne DW, Jerram NR, Knowles JD, Oates MJ (2001) PESA-II: region based selection in evolutionary multiobjective optimization. In: Proceedings of the genetic and evolutionary computation conference (GECCO-2001). Morgan Kaufmann, Menlo Park, pp 283–290
Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227
Deb K, Pratab A, Agrawal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197
Demir GN, Uyar AŞ, Gündüz Ögüdücü Ş (2007) Graph-based sequence clustering through multiobjective evolutionary algorithms for web recommender systems. In: Proceedings of the genetic and evolutionary computation conference (GECCO-2007). ACM, New York, pp 1943–1950
Demir GN, Göksedef M, Uyar AŞ (2007) Effects of session representation models on the performance of web recommender systems. In: Proceedings of the workshop on data mining and business intelligence, pp 931–936
Ding C, Xiaofeng H, Hongyuan Z, Ming G, Simon H (2001) A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the IEEE international conference on data mining, pp 107–114
Du J, Korkmaz E, Alhajj R, Barker K (2004) Novel clustering approach that employs genetic algorithm with new representation scheme and multiple objectives. In: Proceedings of the 6th international conference on data warehousing and knowledge discovery (DAWAK 2004). Lecture notes in computer science, vol 3181. Springer, Berlin, pp 219–233
Eiben AE, Smith JE (2003) Introduction to evolutionary computing. Springer, Berlin
Faceli K, de Carvalho ACPLF, de Souto MCP (2007) Multi-objective clustering ensemble. Int J Hybrid Intell Syst 4(3):145–156
Garcia S, Molina D, Lozano M, Herrera F (2008) A study on the use of nonparametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the cec 2005 special session on real parameter optimization. J Heuristics. doi:10.1007/s10732-008-9080-4
Göksedef M, Gündüz Ögüdücü Ş (2007) A consensus recommender for web users. In: Proceedings of the 3rd international conference on advanced data mining and applications. Lecture notes in artificial intelligence, vol 4632. Springer, Berlin, pp 287–299
Gündüz Ş, Özsu MT (2003) A web page prediction model based on click-stream tree representation of user behavior. In: Proceedings of ninth ACM international conference on knowledge discovery and data mining (KDD), pp 535–540
Gündüz Ş, Özsu MT (2006) Incremental click-stream tree model: learning from new users for web page prediction. Distributed Parallel Databases 19(1):5–27
Gündüz Öğüdücü Ş, Uyar AŞ (2004) A graph based clustering method using a hybrid evolutionary algorithm. WSEAS Trans Math 3(3):731–736
Günter S, Bunke H (2003) Validation indices for graph clustering. Pattern Recognit Lett 24(8):1107–1113
Handl J, Knowles J (2005) Multiobjective clustering around medoids. In: Proceedings of the congress on evolutionary computation (CEC-2005). IEEE, New York, pp 632–639
Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11(1):56–76
Hartuv E, Shamir R (2000) A clustering algorithm based on graph connectivity. Inform Process Lett 76:175–181
Horn J, Nafpliotis N, Goldberg DE (1994) A niched pareto genetic algorithm for multiobjective optimization. In: Proceedings of the congress on evolutionary computation (CEC-1994). IEEE, New York, pp 82–87
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Kaufman L, Rousseeuw PJ (1987) Clustering by means of medoids. In: Statistical data analysis based on the L1 norm and related methods, pp 405–416
Kim S (2003) Computational biology and genome informatics. World Scientific, Singapore
Kim S, Lee J (2006) BAG: a graph theoretic sequence clustering algorithm. Int J Data Min Bioinform 1(2):178–200
Knowles J, Thiele L, Zitzler E (2006) A tutorial on the performance assessment of stochastic multiobjective optimizers. TIK Report 214, Computer Engineering and Networks Laboratory (TIK). ETH Zurich
Korkmaz E (2006) A two-level clustering method using linear linkage encoding. In: Proceedings of the parallel problem solving from nature (PPSN 2006). Lecture notes in computer science, vol 4193. Springer, Berlin, pp 681–690
Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621
Law MHC, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 424–430
Manouselis N, Costopoulou C (2007) Analysis and classification of multi-criteria recommender systems. World Wide Web 10(4):415–441
Mobasher B, Dai H, Luo T, Nakagawa M (2002) Discovery of aggregate usage profiles for web personalization. Data Min Knowl Discov 6(1):61–82
Mohr G, Kimpton M, Stack M, Ranitovic I (2004) Introduction to Heritrix : an open source archival quality web crawler. In: Proceedings of the 4th international web archiving workshop
Ozyer T, Liu Y, Alhajj R, Barker K (2004) Multi-objective genetic algorithm based clustering approach and its application to gene expression data. In: Proceedings of the advances in information systems (ADVIS 2004). Lecture notes in computer science, vol 3261. Springer, Berlin, pp 451–461
Park YJ, Song MS (1998) A genetic algorithm for clustering problems. In: Proceedings of the 3rd annual conference on genetic programming, pp 568–575
Perugini S, Gonçalves MA, Fox EA (2004) Recommender systems research: a connection-centric survey. J Intell Inform Syst 23(2):107–143
Rosen KH (1991) Discrete mathematics and its applications, 2nd edn. McGraw-Hill, New York
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Patterns Anal Mach Intell (PAMI) 22(8):888–905
Speer N, Spieth C, Zell A (2005) Biological cluster validity indices based on the gene ontology. In: Proceedings of advances in intelligent data anaylsis VI: 6th international symposium on intelligent data analysis (IDA 2005). Lecture notes in computer science, vol 3646. Springer, Berlin, pp 429–439
Srivastava J, Cooley R, Deshpande M, Tan P-N (2000) Web usage mining: Discovery and applications of usage patterns from web data. ACM SIGKDD Explor Newsl 1(2):12–23
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Uyar AŞ, Gündüz Öğüdücü Ş (2005) A new graph-based evolutionary approach to sequence clustering. In: Proceedings of fourth international conference of machine learning and applications, pp 273–278
Yan TW, Jacobsen M, Garcia-Molina H, Dayal U (1996) From user access patterns to dynamic hypertext linking. In: Proceedings of the fifth world wide web conference (WWW5), pp 1007–1014
Zitzler E, Laumanns M, Thiele L (2001) SPEA2: improving the strength pareto evolutionary algorithm for multiobjective optimization. In: Proceedings of the EUROGEN 2001—evolutionary methods for design, optimisation and control with applications to industrial problems, pp 95–100
Zitzler E, Thiele L (1999) Multiobjective evolutionary algorithms: a comparative case study and the strength pareto evolutionary algorithm. IEEE Trans Evol Comput 3(4):257–271
Acknowledgments
This work was supported by the Scientific and Technological Research Council of Turkey (TUBITAK) EEEAG project 105E162. The authors would like to thank Murat Göksedef for his help in the recommendation engine and H. Turgut Uyar for his useful suggestions and careful reading of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Appendix: Data preparation and cleaning
Appendix: Data preparation and cleaning
A Web server log is an important source for performing Web usage mining because it explicitly records the browsing behavior of site visitors. The data recorded in the server logs reflects the (possibly concurrent) access of a Web site by multiple users. These log files can be stored in various formats such as common log or extended log formats.
Basically an entry in common log format consists of (1) the user’s IP address; (2) the access date and time; (3) the request method (GET, POST, ...); (4) the URL of the page accessed, (5) the protocol (HTTP 1.0, HTTP 1.1, ...); (6) the return code; and (7) the number of bytes transmitted. A few lines of a typical access log in the common log file format for a sample Web site are presented in Table 11.
An extended log format file is a variant of the common log format file that simply adds two extra fields to the end of the line, the referrer and the user agent fields.
The information provided by the Web server can all be used to construct a data model consisting of several abstractions, such as users, pages, click-streams and server sessions. In order to provide some consistency in the definition of these terms, the World Wide Web Consortium (or W3C)Footnote 8 has published a draft of Web term definitions relevant to analyzing Web usage. A Web user is a single individual who accesses files from one or more Web servers through a browser. A page file is the file that is served through a Hypertext Transfer Protocol (HTTP) to a user. The set of page files that contribute to a single display in a Web browser constitutes a Web page. A browser is a client site software application that interprets Hypertext Markup Language (HTML), the programming language of the Internet, into the words and graphics that the user sees when viewing a Web page. The click-stream is the sequence of pages followed by a user. A server session consists of a set of pages that a user requests from a single Web server during her/his single visit to that Web site.
However, a log file does not contain all of the information required for Web usage mining. Regardless of the application, data preparation and cleaning steps should be completed in order to create server sessions. Data preparation and cleaning tasks performed in this study consist of the following steps: (1) data cleaning; and (2) user and session identification. These preprocessing steps are the same for any Web usage mining problem and fundamental methods have been well studied in Cooley et al. (1999) and Srivastava et al. (2000).
In the data cleaning step, first the irrelevant log entries, which correspond to URLs of embedded objects with filename suffixes such as, gif, jpeg, GIF, JPEG, jpg, JPG are eliminated. Next, the URLs in the log file are normalized in order to determine the same Web pages, which are represented by syntactically different URLs. Most Web servers treat a request for a directory as a request for a default file, e.g., “\({\tt index.html}\)” or “\({\tt home.html}\)”. In this case, a common form for each Web page should be chosen. This can be done using a Web crawler. Web crawlers start by parsing a specified Web page, noting any hypertext links on that page that point to other Web pages. They then parse those pages for new links, and so on, recursively. Only links that point to the Web pages within the site can be added to the list of pages to explore using a Web crawler (Mohr et al. 2004). Comparing the content of pages provides a way to determine different URLs belonging to the same Web page. The visiting page time, which is defined as the time difference between consecutive page requests, is calculated for each page.
In the cases where cookies are not available in the Web logs, a heuristic method can be used to identify a unique IP as a user. A new session is created when a new IP-address is encountered or if the visiting page time exceeds 30 min for the same IP-address. Thus, a session consists of an ordered sequence of page visits. It is important to note that these are only heuristics to identify users and user sessions.
Rights and permissions
About this article
Cite this article
Demir, G.N., Uyar, A.Ş. & Gündüz-Öğüdücü, Ş. Multiobjective evolutionary clustering of Web user sessions: a case study in Web page recommendation. Soft Comput 14, 579–597 (2010). https://doi.org/10.1007/s00500-009-0428-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-009-0428-y