Skip to main content
Log in

Multiobjective evolutionary clustering of Web user sessions: a case study in Web page recommendation

  • Original Paper
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In this study, we experiment with several multiobjective evolutionary algorithms to determine a suitable approach for clustering Web user sessions, which consist of sequences of Web pages visited by the users. Our experimental results show that the multiobjective evolutionary algorithm-based approaches are successful for sequence clustering. We look at a commonly used cluster validity index to verify our findings. The results for this index indicate that the clustering solutions are of high quality. As a case study, the obtained clusters are then used in a Web recommender system for representing usage patterns. As a result of the experiments, we see that these approaches can successfully be applied for generating clustering solutions that lead to a high recommendation accuracy in the recommender model we used in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. This term will be used interchangeably with “user session”.

  2. http://www.tik.ee.ethz.ch/sop/pisa/

  3. The path that has the highest similarity to the active user session is defined as the best matching path.

  4. http://www.bidb.itu.edu.tr/

  5. http://www.ce.itu.edu.tr/

  6. http://ita.ee.lbl.gov/html/contrib/Sask-HTTP.html

  7. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview

  8. http://www.w3.org/

References

  • Bleuler S, Laumanns M, Thiele L, Zitzler E (2003) PISA—a platform and programming language independent interface for search algorithms. In: Proceeding of evolutionary multi-criterion optimization (EMO 2003). Lecture notes in computer science, vol 2632. Springer, Berlin, pp 494–508

  • Branke J, Deb K, Dierolf H, Osswald M (2004) Finding knees in multi-objective optimization. In: Proceedings of the parallel problem solving from nature (PPSN 2004), pp 722–731

  • Charter K, Schaeffer J, Szafron D (2000) Sequence alignment using FastLSA. In: Proceedings of international conference on mathematics and engineering techniques in medicine and biological sciences, pp 48–57

  • Cheng C-K, Wei YA (1991) An improved two-way partitioning algorithm with stable performance. IEEE Trans Comput-Aided Design Integr Circuits Syst 10:1502–1511

    Article  Google Scholar 

  • Coello Coello CA, Lamont GB, Van Veldhuizen DA (2007) Evolutionary algorithms for solving multi-objective problems, 2nd edn. Springer, Berlin

  • Cole RM (1998) Clustering with genetic algorithms. Master’s thesis, University of Western Australia, Nedlands 6907, Australia

  • Conover W (1999) Practical nonparametric statistics, 3rd edn. Wiley, New York

  • Cooley R, Mobasher B, Srivastava J (1999) Data preparation for mining world wide web browsing patterns. J Knowl Inform Syst 1(1):5–32

    Google Scholar 

  • Corne DW, Jerram NR, Knowles JD, Oates MJ (2001) PESA-II: region based selection in evolutionary multiobjective optimization. In: Proceedings of the genetic and evolutionary computation conference (GECCO-2001). Morgan Kaufmann, Menlo Park, pp 283–290

  • Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227

    Article  Google Scholar 

  • Deb K, Pratab A, Agrawal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197

    Article  Google Scholar 

  • Demir GN, Uyar AŞ, Gündüz Ögüdücü Ş (2007) Graph-based sequence clustering through multiobjective evolutionary algorithms for web recommender systems. In: Proceedings of the genetic and evolutionary computation conference (GECCO-2007). ACM, New York, pp 1943–1950

  • Demir GN, Göksedef M, Uyar AŞ (2007) Effects of session representation models on the performance of web recommender systems. In: Proceedings of the workshop on data mining and business intelligence, pp 931–936

  • Ding C, Xiaofeng H, Hongyuan Z, Ming G, Simon H (2001) A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the IEEE international conference on data mining, pp 107–114

  • Du J, Korkmaz E, Alhajj R, Barker K (2004) Novel clustering approach that employs genetic algorithm with new representation scheme and multiple objectives. In: Proceedings of the 6th international conference on data warehousing and knowledge discovery (DAWAK 2004). Lecture notes in computer science, vol 3181. Springer, Berlin, pp 219–233

  • Eiben AE, Smith JE (2003) Introduction to evolutionary computing. Springer, Berlin

  • Faceli K, de Carvalho ACPLF, de Souto MCP (2007) Multi-objective clustering ensemble. Int J Hybrid Intell Syst 4(3):145–156

    MATH  Google Scholar 

  • Garcia S, Molina D, Lozano M, Herrera F (2008) A study on the use of nonparametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the cec 2005 special session on real parameter optimization. J Heuristics. doi:10.1007/s10732-008-9080-4

  • Göksedef M, Gündüz Ögüdücü Ş (2007) A consensus recommender for web users. In: Proceedings of the 3rd international conference on advanced data mining and applications. Lecture notes in artificial intelligence, vol 4632. Springer, Berlin, pp 287–299

  • Gündüz Ş, Özsu MT (2003) A web page prediction model based on click-stream tree representation of user behavior. In: Proceedings of ninth ACM international conference on knowledge discovery and data mining (KDD), pp 535–540

  • Gündüz Ş, Özsu MT (2006) Incremental click-stream tree model: learning from new users for web page prediction. Distributed Parallel Databases 19(1):5–27

    Article  Google Scholar 

  • Gündüz Öğüdücü Ş, Uyar AŞ (2004) A graph based clustering method using a hybrid evolutionary algorithm. WSEAS Trans Math 3(3):731–736

    MathSciNet  Google Scholar 

  • Günter S, Bunke H (2003) Validation indices for graph clustering. Pattern Recognit Lett 24(8):1107–1113

    Article  MATH  Google Scholar 

  • Handl J, Knowles J (2005) Multiobjective clustering around medoids. In: Proceedings of the congress on evolutionary computation (CEC-2005). IEEE, New York, pp 632–639

  • Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11(1):56–76

    Article  Google Scholar 

  • Hartuv E, Shamir R (2000) A clustering algorithm based on graph connectivity. Inform Process Lett 76:175–181

    Article  MATH  MathSciNet  Google Scholar 

  • Horn J, Nafpliotis N, Goldberg DE (1994) A niched pareto genetic algorithm for multiobjective optimization. In: Proceedings of the congress on evolutionary computation (CEC-1994). IEEE, New York, pp 82–87

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1987) Clustering by means of medoids. In: Statistical data analysis based on the L1 norm and related methods, pp 405–416

  • Kim S (2003) Computational biology and genome informatics. World Scientific, Singapore

  • Kim S, Lee J (2006) BAG: a graph theoretic sequence clustering algorithm. Int J Data Min Bioinform 1(2):178–200

    Article  Google Scholar 

  • Knowles J, Thiele L, Zitzler E (2006) A tutorial on the performance assessment of stochastic multiobjective optimizers. TIK Report 214, Computer Engineering and Networks Laboratory (TIK). ETH Zurich

  • Korkmaz E (2006) A two-level clustering method using linear linkage encoding. In: Proceedings of the parallel problem solving from nature (PPSN 2006). Lecture notes in computer science, vol 4193. Springer, Berlin, pp 681–690

  • Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621

    Article  MATH  Google Scholar 

  • Law MHC, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 424–430

  • Manouselis N, Costopoulou C (2007) Analysis and classification of multi-criteria recommender systems. World Wide Web 10(4):415–441

    Article  Google Scholar 

  • Mobasher B, Dai H, Luo T, Nakagawa M (2002) Discovery of aggregate usage profiles for web personalization. Data Min Knowl Discov 6(1):61–82

    Article  MathSciNet  Google Scholar 

  • Mohr G, Kimpton M, Stack M, Ranitovic I (2004) Introduction to Heritrix : an open source archival quality web crawler. In: Proceedings of the 4th international web archiving workshop

  • Ozyer T, Liu Y, Alhajj R, Barker K (2004) Multi-objective genetic algorithm based clustering approach and its application to gene expression data. In: Proceedings of the advances in information systems (ADVIS 2004). Lecture notes in computer science, vol 3261. Springer, Berlin, pp 451–461

  • Park YJ, Song MS (1998) A genetic algorithm for clustering problems. In: Proceedings of the 3rd annual conference on genetic programming, pp 568–575

  • Perugini S, Gonçalves MA, Fox EA (2004) Recommender systems research: a connection-centric survey. J Intell Inform Syst 23(2):107–143

    Article  MATH  Google Scholar 

  • Rosen KH (1991) Discrete mathematics and its applications, 2nd edn. McGraw-Hill, New York

    Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65

    Article  MATH  Google Scholar 

  • Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Patterns Anal Mach Intell (PAMI) 22(8):888–905

    Article  Google Scholar 

  • Speer N, Spieth C, Zell A (2005) Biological cluster validity indices based on the gene ontology. In: Proceedings of advances in intelligent data anaylsis VI: 6th international symposium on intelligent data analysis (IDA 2005). Lecture notes in computer science, vol 3646. Springer, Berlin, pp 429–439

  • Srivastava J, Cooley R, Deshpande M, Tan P-N (2000) Web usage mining: Discovery and applications of usage patterns from web data. ACM SIGKDD Explor Newsl 1(2):12–23

    Article  Google Scholar 

  • Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    Article  MATH  MathSciNet  Google Scholar 

  • Uyar AŞ, Gündüz Öğüdücü Ş (2005) A new graph-based evolutionary approach to sequence clustering. In: Proceedings of fourth international conference of machine learning and applications, pp 273–278

  • Yan TW, Jacobsen M, Garcia-Molina H, Dayal U (1996) From user access patterns to dynamic hypertext linking. In: Proceedings of the fifth world wide web conference (WWW5), pp 1007–1014

  • Zitzler E, Laumanns M, Thiele L (2001) SPEA2: improving the strength pareto evolutionary algorithm for multiobjective optimization. In: Proceedings of the EUROGEN 2001—evolutionary methods for design, optimisation and control with applications to industrial problems, pp 95–100

  • Zitzler E, Thiele L (1999) Multiobjective evolutionary algorithms: a comparative case study and the strength pareto evolutionary algorithm. IEEE Trans Evol Comput 3(4):257–271

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the Scientific and Technological Research Council of Turkey (TUBITAK) EEEAG project 105E162. The authors would like to thank Murat Göksedef for his help in the recommendation engine and H. Turgut Uyar for his useful suggestions and careful reading of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Şima Uyar.

Appendix: Data preparation and cleaning

Appendix: Data preparation and cleaning

A Web server log is an important source for performing Web usage mining because it explicitly records the browsing behavior of site visitors. The data recorded in the server logs reflects the (possibly concurrent) access of a Web site by multiple users. These log files can be stored in various formats such as common log or extended log formats.

Basically an entry in common log format consists of (1) the user’s IP address; (2) the access date and time; (3) the request method (GET, POST, ...); (4) the URL of the page accessed, (5) the protocol (HTTP 1.0, HTTP 1.1, ...); (6) the return code; and (7) the number of bytes transmitted. A few lines of a typical access log in the common log file format for a sample Web site are presented in Table 11.

Table 11 Sample server logs in common log format

An extended log format file is a variant of the common log format file that simply adds two extra fields to the end of the line, the referrer and the user agent fields.

The information provided by the Web server can all be used to construct a data model consisting of several abstractions, such as users, pages, click-streams and server sessions. In order to provide some consistency in the definition of these terms, the World Wide Web Consortium (or W3C)Footnote 8 has published a draft of Web term definitions relevant to analyzing Web usage. A Web user is a single individual who accesses files from one or more Web servers through a browser. A page file is the file that is served through a Hypertext Transfer Protocol (HTTP) to a user. The set of page files that contribute to a single display in a Web browser constitutes a Web page. A browser is a client site software application that interprets Hypertext Markup Language (HTML), the programming language of the Internet, into the words and graphics that the user sees when viewing a Web page. The click-stream is the sequence of pages followed by a user. A server session consists of a set of pages that a user requests from a single Web server during her/his single visit to that Web site.

However, a log file does not contain all of the information required for Web usage mining. Regardless of the application, data preparation and cleaning steps should be completed in order to create server sessions. Data preparation and cleaning tasks performed in this study consist of the following steps: (1) data cleaning; and (2) user and session identification. These preprocessing steps are the same for any Web usage mining problem and fundamental methods have been well studied in Cooley et al. (1999) and Srivastava et al. (2000).

In the data cleaning step, first the irrelevant log entries, which correspond to URLs of embedded objects with filename suffixes such as, gif, jpeg, GIF, JPEG, jpg, JPG are eliminated. Next, the URLs in the log file are normalized in order to determine the same Web pages, which are represented by syntactically different URLs. Most Web servers treat a request for a directory as a request for a default file, e.g., “\({\tt index.html}\)” or “\({\tt home.html}\)”. In this case, a common form for each Web page should be chosen. This can be done using a Web crawler. Web crawlers start by parsing a specified Web page, noting any hypertext links on that page that point to other Web pages. They then parse those pages for new links, and so on, recursively. Only links that point to the Web pages within the site can be added to the list of pages to explore using a Web crawler (Mohr et al. 2004). Comparing the content of pages provides a way to determine different URLs belonging to the same Web page. The visiting page time, which is defined as the time difference between consecutive page requests, is calculated for each page.

In the cases where cookies are not available in the Web logs, a heuristic method can be used to identify a unique IP as a user. A new session is created when a new IP-address is encountered or if the visiting page time exceeds 30 min for the same IP-address. Thus, a session consists of an ordered sequence of page visits. It is important to note that these are only heuristics to identify users and user sessions.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Demir, G.N., Uyar, A.Ş. & Gündüz-Öğüdücü, Ş. Multiobjective evolutionary clustering of Web user sessions: a case study in Web page recommendation. Soft Comput 14, 579–597 (2010). https://doi.org/10.1007/s00500-009-0428-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-009-0428-y

Keywords

Navigation