skip to main content
10.1145/1526709.1526732acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Smart Miner: a new framework for mining large scale web usage data

Published: 20 April 2009 Publication History

Abstract

In this paper, we propose a novel framework called Smart-Miner for web usage mining problem which uses link information for producing accurate user sessions and frequent navigation patterns. Unlike the simple session concepts in the time and navigation based approaches, where sessions are sequences of web pages requested from the server or viewed in the browser, Smart Miner sessions are set of paths traversed in the web graph that corresponds to users' navigations among web pages. We have modeled session construction as a new graph problem and utilized a new algorithm, Smart-SRA, to solve this problem efficiently. For the pattern discovery phase, we have developed an efficient version of the Apriori-All technique which uses the structure of web graph to increase the performance. From the experiments that we have performed on both real and simulated data, we have observed that Smart-Miner produces at least 30% more accurate web usage patterns than other approaches including previous session construction methods. We have also studied the effect of having the referrer information in the web server logs to show that different versions of Smart-SRA produce similar results. Our another contribution is that we have implemented distributed version of the Smart Miner framework by employing Map/Reduce Paradigm. We conclude that we can efficiently process terabytes of web server logs belonging to multiple web sites by our scalable framework.

References

[1]
R. Agrawal and R. Srikant. Mining sequential patterns. In ICDE, pages 3--14, 1995.
[2]
M. A. Bayir, D. Guney, and T. Can. Integration of topological measures for eliminating non-specific interactions in protein interaction networks. Discrete Applied Mathematics. 2008.
[3]
J. Borges and M. Levene. Generating dynamic higher-order markov models in web usage mining. In PKDD, pages 34--45, 2005.
[4]
S. Brohee and J. van Helden. Evaluation of clustering algorithms for protein--protein interaction networks. BMC Bioinformatics, 7:488, 2006.
[5]
L. D. Catledge and J. E. Pitkow. Characterizing browsing strategies in the world--wide web. Computer Networks and ISDN Systems, 27(6):1065--1073, 1995.
[6]
D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38(1), 2006.
[7]
R. Cooley, B. Mobasher, and J. Srivastava. Web mining: Information and pattern discovery on the world wide web. In ICTAI, pages 558--567, 1997.
[8]
R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Knowl. Inf. Syst., 1(1):5--32, 1999.
[9]
R. Cooley, P.-N. Tan, and J. Srivastava. Discovery of interesting usage patterns from web data. In WEBKDD, pages 163--182, 1999.
[10]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[11]
D. Donato, L. Laura, S. Leonardi, and S. Millozzi. The web as a graph: How far we are. ACM Trans. Internet Techn., 7(1), 2007.
[12]
E. Frias-Martinez and V. Karamcheti. A customizable behavior model for temporal prediction of web user sequences. In WEBKDD, pages 66--85, 2002.
[13]
A. A. Ghorbani and X. Xu. A fuzzy markov model approach for predicting user navigation. In Web Intelligence, pages 307--311, 2007.
[14]
D. Godoy and A. Amandi. Learning browsing patterns for context--aware recommendation. In IFIP AI, pages 61--70, 2006.
[15]
J. Jung. Semantic preprocessing of web request streams for web usage mining. J. UCS, 11(8):1383--1396, 2005.
[16]
N. Khasawneh and C.-C. Chan. Active user-based and ontology-based web log data preprocessing for web usage mining. In Web Intelligence, pages 325--328, 2006.
[17]
R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. The web as a graph. In PODS, pages 1--10, 2000.
[18]
O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain. A web usage mining framework for mining evolving user profiles in dynamic web sites. IEEE Trans. Knowl. Data Eng., 20(2):202--215, 2008.
[19]
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. tech. rep. computer systems laboratory. Technical report, Stanford University, Stanford, CA., 1998.
[20]
J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE Trans. Knowl. Data Eng., 16(11):1424--1440, 2004.
[21]
C. Shahabi and F. B. Kashani. Efficient and anonymous web-usage mining for web personalization. INFORMS Journal on Computing, 15(2):123--147, 2003.
[22]
A. D. Silva, Y. Lechevallier, F. de A. T. de Carvalho, and B. Trousse. Mining web usage data for discovering navigation clusters. In ISCC, pages 910--915, 2006.
[23]
M. Spiliopoulou and L. Faulstich. Wum -- a tool for www ulitization analysis. In WebDB, pages 184--103, 1998.
[24]
M. Spiliopoulou, B. Mobasher, B. Berendt, and M. Nakagawa. A framework for the evaluation of session reconstruction heuristics in web-usage analysis. INFORMS Journal on Computing, 15(2):171--190, 2003.
[25]
R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In EDBT, pages 3--17, 1996.
[26]
J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12--23, 2000.
[27]
R. B. T. O'Neill, F. Lavoie. Trends in the evolution of the public web. D-Lib Magazine, 9 Number:4, 2003.
[28]
M. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42:31--60, 2001.

Cited By

View all
  • (2023)IRPDP_HT2: a scalable data pre-processing method in web usage mining using Hadoop MapReduceSoft Computing10.1007/s00500-023-08019-w27:12(7907-7923)Online publication date: 24-Mar-2023
  • (2018)Revealing connectivity structural patterns among web objects based on co-clustering of bipartite request dependency graphWireless Networks10.1007/s11276-016-1345-524:2(439-451)Online publication date: 1-Feb-2018
  • (2017)How many ways to use CiteSpace? A study of user interactive events over 14 monthsJournal of the Association for Information Science and Technology10.1002/asi.2377068:5(1234-1256)Online publication date: 1-May-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '09: Proceedings of the 18th international conference on World wide web
April 2009
1280 pages
ISBN:9781605584874
DOI:10.1145/1526709

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. graph mining
  2. map/reduce
  3. parallel data mining
  4. web usage mining
  5. web user modeling

Qualifiers

  • Research-article

Conference

WWW '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)IRPDP_HT2: a scalable data pre-processing method in web usage mining using Hadoop MapReduceSoft Computing10.1007/s00500-023-08019-w27:12(7907-7923)Online publication date: 24-Mar-2023
  • (2018)Revealing connectivity structural patterns among web objects based on co-clustering of bipartite request dependency graphWireless Networks10.1007/s11276-016-1345-524:2(439-451)Online publication date: 1-Feb-2018
  • (2017)How many ways to use CiteSpace? A study of user interactive events over 14 monthsJournal of the Association for Information Science and Technology10.1002/asi.2377068:5(1234-1256)Online publication date: 1-May-2017
  • (2016)A METHODOLOGY FOR WEBLOG DATA ANALYSIS USING HADOOP MAP REDUCE AND PIGi-manager’s Journal on Cloud Computing10.26634/jcc.3.1.80743:1(13)Online publication date: 2016
  • (2016)Early phase software effort estimation model2016 Symposium on Colossal Data Analysis and Networking (CDAN)10.1109/CDAN.2016.7570914(1-8)Online publication date: Mar-2016
  • (2016)User-based approach for finding various results in web usage mining2016 Symposium on Colossal Data Analysis and Networking (CDAN)10.1109/CDAN.2016.7570867(1-6)Online publication date: Mar-2016
  • (2016)Improving the prediction of page access by using semantically enhanced clusteringJournal of Intelligent Information Systems10.1007/s10844-016-0398-347:1(165-192)Online publication date: 1-Aug-2016
  • (2014)Improving the performance of Hadoop Hive by sharing scan and computation tasksJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-014-0012-63:1(1-11)Online publication date: 1-Dec-2014
  • (2013)Modeling Data for Enterprise Systems with MemoriesJournal of Database Management10.4018/jdm.201304010124:2(1-12)Online publication date: 1-Apr-2013
  • (2013)Web Usage Mining: Discovering Usage Patterns for Web ApplicationsAdvanced Techniques in Web Intelligence-210.1007/978-3-642-33326-2_4(75-104)Online publication date: 2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media