Semantic Based Weighted Web Session Clustering Using Adapted K-Means and Hierarchical Agglomerative Algorithms

Authors

  • Sowmya HK Department of Information Science and Engineering, New Horizon College of Engineering, Affiliated to Visvesvaraya Technological University, Bengaluru, India https://orcid.org/0000-0002-1082-659X
  • R. J. Anandhi Department of Information Science and Engineering, New Horizon College of Engineering, Affiliated to Visvesvaraya Technological University, Bengaluru, India

DOI:

https://doi.org/10.13052/jwe1540-9589.2125

Keywords:

sessionization, dissimilarity matrix, session weight, session cluster, cluster evaluation.

Abstract

The WWW has a big number of pages and URLs that supply the user with a great amount of content. In an intensifying epoch of information, analysing users browsing behaviour is a significant affair. Web usage mining techniques are applied to the web server log to analyse the user behaviour. Identification of user sessions is one of the key and demanding tasks in the pre-processing stage of web usage mining. This paper emphasizes on two important fallouts with the approaches used in the existing session identification methods such as Time based and Referrer based sessionization. The first is dealing with comparing of current request’s referrer field with the URL of previous request. The second is dealing with session creation, new sessions are created or comes in to one session due to threshold value of page stay time and session time. So, authors developed enhanced semantic distance based session identification algorithm that tackles above mentioned issues of traditional session identification methods. The enhanced semantic based method has an accuracy of 84 percent, which is higher than the Time based and Time-Referrer based session identification approaches. The authors also used adapted K-Means and Hierarchical Agglomerative clustering algorithms to improve the prediction of user browsing patterns. Clusters were found using a weighted dissimilarity matrix, which is calculated using two key parameters: page weight and session weight. The Dunn Index and Davies-Bouldin Index are then used to evaluate the clusters. Experimental results shows that more pure and accurate session clusters are formed when adapted clustering algorithms are applied on the weighted sessions rather than the session obtained from traditional sessionization algorithms. Accuracy of the semantic session cluster is higher compared with the cluster of sessions obtained using traditional sessionization.

Downloads

Download data is not yet available.

Author Biographies

Sowmya HK, Department of Information Science and Engineering, New Horizon College of Engineering, Affiliated to Visvesvaraya Technological University, Bengaluru, India

Sowmya HK received the Bachelor of Engineering degree in Computer Science and Engineering from Kurunji Venkataramana Gowda College of Engineering in 2004, the Master of Engineering degree in Computer Science and Engineering from University Visvesvaraya College of Engineering, Bangalore University in 2010 respectively. She is currently working as Senior Assistant Professor at the Department of Artificial Intelligence and Machine Learning, New Horizon College of Engineering, Visvesvaraya Technological University. She is currently pursuing Ph.D at the Department of Information Science and Engineering, New Horizon College of Engineering, Visvesvaraya Technological University. Her research areas include Web Usage Mining, Data Mining, and Deep Learning.

R. J. Anandhi, Department of Information Science and Engineering, New Horizon College of Engineering, Affiliated to Visvesvaraya Technological University, Bengaluru, India

R. J. Anandhi received the Bachelor of Engineering degree in Computer Science and Engineering from Government College of Technology, Coimbatore in 1991. She secured Master of Technology degree in Computer Science and Engineering from Pondicherry Central University in 1995. She pursued Ph.D from Dr. MGR University, Chennai in 2011. She is currently working as Professor and Head of the Department of Information Science & Engineering at New Horizon College of Engineering, Bengaluru. Her research areas include Data Mining, NLP and Cloud Computing.

References

D.S. Anupama, S.D. Gowda, ‘Clustering of web user sessions to maintain occurrence of sequence in navigation pattern’, Second International Symposium on Computer Vision and the Internet (VisionNet’15), Elsevier, 2015.

M. Munk, M. Drlík, and J. Reichel, ‘Quantitative and Qualitative Evaluation of Sequence Patterns Found by Application of Different Educational Data Preprocessing Techniques’, IEEE, 2017.

S. P. Mary, E. Baburaj, ‘Performance Enhancement in Session Identification’, International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), IEEE, 2014.

M. Srivastava, R. Garg, P.K. Mishra, ‘A MapReduce-Based User Identification Algorithm in Web Usage Mining’, International Journal of Information Technology and Web Engineering, Volume 13, April–June 2018.

K. Mihara, M. Terabe and K. Hashimoto, ‘A Novel Web Usage Mining Method Mining and Clustering of DAG Access Patterns Considering Page Browsing Time’, International Conference on Web Information Systems and Technologies, 2008.

R. Katarya, O. P. Verma, ‘An effective web page recommender system with fuzzy c-mean clustering’, Springer, October 2016.

G. Poornalatha and P. S. Raghavendra, ‘Web User Session Clustering Using Modified K-Means Algorithm’, International Conference on Advances in Computing and Communications, Springer, Berlin, Heidelberg, 2011.

D. Xu, Y. Tian, ‘A Comprehensive Survey of Clustering Algorithms’, Springer Berlin Heidelberg, Annals of Data Science, 2015

V. Bhatnagar, R. Majhi, P. R. Jena, ‘Comparative Performance Evaluation of Clustering Algorithms for Grouping Manufacturing Firms’, Arab J Sci Eng.

N. Kaur, Dr. H. Aggarwal, ‘A Novel Semantically-Time-Referrer based Approach of Web Usage Mining for Improved Sessionization in Pre-Processing of Web Log’, (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 1, 2017.

Z. A. Ansari, A.S. Syed, ‘Discovery of Web Usage Patterns Using Fuzzy Mountain Clustering”, Int. J. Business Intelligence and Data Mining, Vol. 11, No. 1, 2016.

P. Nerurkar, A. Shrike, M. Chandane, S. Bhirud, ‘Empirical Analysis of Data Clustering Algorithms’, 6th International Conference on Smart Computing and Communications, ICSCC 2017.

P. Dhanalakshmi, K. Ramani, E. Reddy, ‘The Research of Pre-processing and Pattern Discovery Techniques on Web Log File’, IEEE 6th International Conference on Advanced Computing (IACC), 139–145, 2016.

G. D. Kumar and M. Gosul, ‘Web Mining Research and Future Directions’, Advances in Network Security and Applications, Volume 196, 2011.

S. Miyamoto, ‘An Overview of Hierarchical and Non-hierarchical Algorithms of Clustering for Semi-supervised Classification’, International Conference on Modeling Decisions for Artificial Intelligence, MDAI 2012, vol. 7647. Springer, Berlin, Heidelberg.

O. Sangita, J. Dhanamma, ‘An Improved K-Means Clustering Approach for Teaching Evaluation’, Advances in Computing, Communication and Control, vol. 125, Springer, Berlin, Heidelberg, 2011.

M. Srivatsava, R. Garg, ‘Analysis of Data Extraction and Data Cleaning in Web Usage Mining’, Analysis of Data Extraction Data Cleaning in Web Usage Mining, Proceedings of th International Conference on Advanced Research in Computer Science and Engineering, 2015.

K. S. Reddy, G. P. Saradhi Varma, ‘Preprocessing the web server logs: an illustrative approach for effective usage mining’, ACM SIGSOFT Software Engineering Notes, May 2012, Volume 37 Number 3.

P. Sengottuvelan, Lokeshkumar, ‘Session Identification in Web Usage Mining to Personalize the Web’, International Journal of Applied Engineering Research, ISSN 0973-4562, Volume 10, Number 9, 2015.

J. Kapusta, ‘User Identification in the Process of Web Usage Data Preprocessing’, International Journal of Emerging Technologies in Learning (iJET), Vol. 14, No. 9, 2019.

N. Goel, C.K. Jha, ‘Preprocessing Web logs: A Critical phase in Web Usage Mining’, International Conference on Advances in Computer Engineering and Applications (ICACEA), 2015.

M.Z. Rodriguez, C.H. Comin, D. Casanova, O.M. Bruno, D.R. Amancio, Costa LdF, ‘Clustering algorithms: A comparative approach’, 2019.

W. Lu, ‘Improved K-Means Clustering Algorithm for Big Data Mining under Hadoop Parallel Framework’, J Grid Computing 18, 239–250, 2020.

A. Sinha, ‘A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets’, J Supercomput 74, 1562–1579, 2018.

O. Kassak, M. Kompan, M. Bielikova, ‘Acquisition and Modelling of Short-Term User Behaviour on the Web: A Survey’, Journal of Web Engineering, Vol. 17(5), 23–70, 2018.

http://www.almhuette-raith.at/apache-log/access.log

Downloads

Published

2022-01-04

How to Cite

HK, S. ., & Anandhi, R. J. . (2022). Semantic Based Weighted Web Session Clustering Using Adapted K-Means and Hierarchical Agglomerative Algorithms. Journal of Web Engineering, 21(02), 239–264. https://doi.org/10.13052/jwe1540-9589.2125

Issue

Section

Articles