ABSTRACT
While scalable data mining methods are expected to cope with massive Web data, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. In this paper, we explore the task of mining mass user profiles by discovering evolving Web session clusters in a single pass with a recently proposed scalable immune based clustering approach (TECNO-STREAMS), and study the effect of the choice of different similarity measures on the mining process and on the interpretation of the mined patterns. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages, and furthermore requiring that the affinity of the data to the learned profiles or summaries be defined by the minimum of their coverage or precision, hence requiring that the learned profiles are simultaneously precise and complete, with no compromises.In our experiments, we study the task of mining evolving user profiles from Web clickstream data (web usage mining) in a single pass, and under different trend sequencing scenarios, showing that compared oto the cosine similarity measure, the proposed similarity measure explicitly based on precision and coverage allows the discovery of more correct profiles at the same precision or recall quality levels.
- S. Babu and J. Widom. Continuous queries over data streams. In SIGMOD Record'01, pages 109--120, 2001.]] Google ScholarDigital Library
- D. Barbara. Requirements for clustering data streams. ACM SIGKDD Explorations Newsletter, 3(2):23--27, 2002.]] Google ScholarDigital Library
- J. Borges and M. Levene. Data mining of user navigation patterns. In H. A. Abbass, R. A. Sarker, and C. Newton, editors, Web Usage Analysis and User Profiling, Lecture Notes in Computer Science, pages 92--111. Springer-Verlag, 1999.]] Google ScholarDigital Library
- P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proceedings of the 4th international conf. on Knowledge Discovery and Data Mining (KDD98), 1998.]]Google Scholar
- Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In 2002 Int. Conf. on Very Large Data Bases (VLDB'02), Hong Kong, China, 2002.]]Google ScholarDigital Library
- R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Journal of knowledge and information systems, 1(1), 1999.]]Google Scholar
- S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In IEEE Symposium on Foundations of Computer Science (FOCS'00), Redondo Beach, CA, 2000.]] Google ScholarDigital Library
- J. Hunt and D. Cooke. An adaptative, distributed learning system, based on immune system. In IEEE International Conference on Systems, Man and Cybernetics, pages 2494--2499, Los Alamitos, CA, 1995.]]Google Scholar
- N. K. Jerne. The immune system. Scientific American, 229(1):52--60, 1973.]]Google ScholarCross Ref
- R. R. Korfhage. Information Storage and Retrieval. Wiley, 1997.]] Google ScholarDigital Library
- O. Nasraoui, C. Cardona-Uribe, and C. Rojas-Coronel. Tecno-streams: Tracking evolving clusters in noisy data streams with a scalable immune system learning model. In IEEE International Conference on Data Mining, Melbourne, Florida, Nov. 2003.]] Google ScholarDigital Library
- O. Nasraoui, D. Dasgupta, and F. Gonzalez. An artificial immune system approach to robust data mining. In Genetic and Evolutionary Computation Conference (GECCO) Late breaking papers, pages 356--363, New York, NY, 2002.]]Google Scholar
- O. Nasraoui, H. Frigui, R. Krishnapuram, and A. Joshi. Mining web access logs using relational competitive fuzzy clustering. In Eighth International Fuzzy Systems Association Congress, Hsinchu, Taiwan, Aug. 1999.]]Google Scholar
- O. Nasraoui and R. Krishnapuram. One step evolutionary mining of context sensitive associations and web navigation patterns. In SIAM conference on Data Mining, pages 531--547, Arlington, VA, 2002.]]Google ScholarCross Ref
- O. Nasraoui, R. Krishnapuram, H. Frigui, and A. Joshi. Extracting web user profiles using relational competitive fuzzy clustering. International Journal of Artificial Intelligence Tools, 9(4):509--526, 2000.]]Google ScholarCross Ref
- O. Nasraoui, R. Krishnapuram, and A. Joshi. Mining web access logs using a relational clustering algorithm based on a robust estimator. In 8th International World Wide Web Conference, pages 40--41, Toronto, Canada, 1999.]]Google Scholar
- M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In AAAI 98, 1998.]] Google ScholarDigital Library
- C. Shahabi, A. M. Zarkesh, J. Abidi, and V. Shah. Knowledge discovery from users web-page navigation. In Proceedings of workshop on research issues in Data engineering, Birmingham, England, 1997.]] Google ScholarDigital Library
- J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):1--12, Jan 2000.]] Google ScholarDigital Library
- J. Timmis, M. Neal, and J. Hunt. An artificial immune system for data analysis. Biosystems, 55(1/3):143--150, 2000.]]Google ScholarCross Ref
- T. Yan, M. Jacobsen, H. Garcia-Molina, and U. Dayal. From user access patterns to dynamic hypertext linking. In Proceedings of the 5th International World Wide Web conference, Paris, France, 1996.]] Google ScholarDigital Library
- H. Yang, S. Parthasarathy, and S. Reddy. On the use of constrained association rules for web mining. In WebKDD workshop on Knowledge Discovery in the Web, pages 77--90, Edmonton, Alberta, Canada, 2002.]]Google Scholar
- O. Zaiane, M. Xin, and J. Han. Discovering web access patterns and trends by applying olap and data mining technology on web logs. In Advances in Digital Libraries, pages 19--29, Santa Barbara, CA, 1998.]] Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for large databases. In ACM SIGMOD International Conference on Management of Data, pages 103--114, New York, NY, 1996. ACM Press.]] Google ScholarDigital Library
Index Terms
- Using retrieval measures to assess similarity in mining dynamic web clickstreams
Recommendations
Frequent pattern mining on stream data using Hadoop CanTree-GTree
The need for knowledge discovery from real-time stream data is continuously increasing nowadays and processing of transactions for mining patterns needs efficient data structures and algorithms. We propose a time-efficient Hadoop CanTree-GTree algorithm,...
A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites
In this paper, we present a complete framework and findings in mining web usage patterns from Web log files of a real website that has all the challenging aspects of real life web usage mining, including evolving user profiles and external data ...
Knowledge Discovery and Retrieval on World Wide Web Using Web Structure Mining
AMS '10: Proceedings of the 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer SimulationThe World Wide Web is nearing omnipresence. The explosively growing number of Web contents including Digitalized manuals, emails pictures, multimedia, and Web services require a distinct and elaborate structural framework that can provide a navigational ...
Comments