Skip to main content
Log in

Applying language modeling to session identification from database trace logs

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

A database session is a sequence of requests presented to the database system by a user or an application to achieve a certain task. Session identification is an important step in discovering useful patterns from database trace logs. The discovered patterns can be used to improve the performance of database systems by prefetching predicted queries, rewriting the current query or conducting effective cache replacement.

In this paper, we present an application of a new session identification method based on statistical language modeling to database trace logs. Several problems of the language modeling based method are revealed in the application, which include how to select values for the parameters of the language model, how to evaluate the accuracy of the session identification result and how to learn a language model without well-labeled training data. All of these issues are important in the successful application of the language modeling based method for session identification. We propose solutions to these open issues. In particular, new methods for determining an entropy threshold and the order of the language model are proposed. New performance measures are presented to better evaluate the accuracy of the identified sessions. Furthermore, three types of learning methods, namely, learning from labeled data, learning from semi-labeled data and learning from unlabeled data, are introduced to learn language models from different types of training data. Finally, we report experimental results that show the effectiveness of the language model based method for identifying sessions from the trace logs of an OLTP database application and the TPC-C Benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal S, Chaudhuri S, Narasayya VR (2000) Automated selection of materialized views and indexes in SQL databases. VLDB Conference, pp 496–505

  2. Allan J (2002) Introduction to topic detection and tracking. In: Allan J (ed) Topic detection and tracking: event-based information organization. Kluwer Academic Publishers, pp 1–16

  3. Bahl L, Jelinek F, Mercer R (1983) A maximum likelihood approach to continuous speech recognition. IEEE Trans Pattern Anal Mach Intell 5(2):179–190

    Article  Google Scholar 

  4. Benchmark Factory software (2005) http://www.quest.com/benchmark_factory/index.asp, Quest Software Inc.

  5. Brent M (1999) An efficient, probabilistically sound algorithm for segmentation and word discovery. Mach Learn 34:71–106

    Article  MATH  Google Scholar 

  6. Brent M, Tao X (2001) Chinese text segmentation with MBDP-1: making the most of training corpora. In: Proceedings of the ACL2001, France

  7. Calzarossa M, Serazzi G (1993) Workload characterization: a survey. Proceedings of the IEEE 81(8):1136–1150

    Google Scholar 

  8. Catledge L, Pitkow J (1995) Characterizing browsing strategies in the world wide web. Proceedings of the 3rd International World Wide Web Conference

  9. Chang JS, Su KY (1997) An unsupervised iterative method for Chinese New Lexicon extraction. Int J Comput Linguist Chin Lang Process 2(2): 97–148

    MATH  Google Scholar 

  10. Chaudhuri S, Narasayya VR (1998) Microsoft index tuning wizard for SQL Server 7.0. SIGMOD Conference, pp 553–554

  11. Chaudhuri S, Narasayya VR (2000) Automating statistics management for query optimizers. ICDE Conference, pp 339–348

  12. Chen A, He J, Xu L, Gey FC, Meggs J (1997) Chinese text retrieval without using a dictionary. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 42–49, ACM

  13. Chen S, Goodman J (1998) An empirical study of smoothing techniques for language modeling. Technical report, TR-10-98, Harvard University

  14. Chen MS, Park JS, Yu PS (1998) Efficient data mining for path traversal patterns. IEEE Trans Knowl Data Eng 10(2):209–221

    Google Scholar 

  15. Cooley R, Mobasher B, Srivastava J (1999) Data preparation for mining world wide web browsing patterns. Knowl Inf Syst: An Int J 1(1):5–32

    Google Scholar 

  16. Duyand J, Vaughan L (2003) Usage data for electronic resources: a comparison between locally collected and vendor-provided statistics. J Acad Libr 29(1):16–22

    Article  Google Scholar 

  17. Fung P (1998) Extracting key terms from Chinese and Japnese text. Int J Comput Process Orien Lang, Special Issue on Information Retrieval on Oriental Languages pp 99–121

  18. Ge X, Pratt W, Smyth P (1999) Discovering Chinese words from unsegmented text. In: Proceedings of the 22th annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 271–272

  19. Hatch P (2000) Lexical chaining for the online detection of new events. Master's thesis, University College Dubin

  20. He D, Goker A (2000) Detecting session boundaries from web user logs. In: Proceedings of the 22nd annual colloquium on information retrieval research, Cambridge, England, pp 57–66

  21. Huang X, Peng F, An A, Schuurmans D, Cercone N (2003) Session boundary detection for association rule learning using N-gram language models. In: Proceedings of the 16th Canadian conference on artificial intelligence (CAI-03), Halifax, Canada, pp 237–251

  22. Huang X, Peng F, An A, Schuurmans D (2004) Dynamic web log session identification with statistical language model. J Am Soc Inf Sci Tech, Special Issue on Webometrics 55(14):1290–1303

    Google Scholar 

  23. Huang, X, Peng F, Schuurmans D, Cercone N, Robertson SE (2003) Applying machine learning to text segmentation for information retrieval. Inf Retriev J 6(4):333–362

    Article  Google Scholar 

  24. Huang X, Robertson SE (2000) A probabilistic approach to Chinese information retrieval: theory and experiments. In: Proceedings of the BCS-IRSG 2000: the 22nd annual colloquium on information retrieval research, Cambridge, England, pp 178–193

  25. Jin W (1992) Chinese segmentation and its disambiguation. In: MCCS-92-227, computing research laboratory, New Mexico State University, Las Cruces, New Mexico

  26. Katz S (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans Acoust, Speech Signal Process 35(3):400–401

    Article  Google Scholar 

  27. Nie JY, Ren F (1999) Chinese information retrieval: using characters or words? Inform Process Manage 35:443–462

    Google Scholar 

  28. Nie JY, Brisebois M, Ren X (1996) On Chinese text retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 225–233

  29. Peng F, Schuurmans D (2001) Self-supervised Chinese Word Segmentation. In: Hoffman F, et al (eds) Advances in intelligent data analysis, proceedings of the fourth international conference (IDA-01), LNCS 2189, Cascais, Portugal, pp 238–247

  30. Peng F, Feng, F, McCallum A (2004) Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th COLING 2004, Switzerland, pp 562-568

  31. Ponte J, Croft W (1996) Useg: A retargetable word segmentation procedure for information retrieval. In: Proceedings of symposium on document analysis and information retrival 96 (SDAIR)

  32. Ponte J, Croft W (1998) Text Segmentation by Topic. Proc Eur Conf Digit Libr 113–125

  33. Sapia C (2000) PROMISE: Predicting Query Behavior to Enable Predictive Caching Strategies for OLAP Systems. Proceedings of the 2nd international conference on data warehousing and knowledge discovery, Greewich, UK. Springer Verlag, pp 224–233

  34. Stokes N, Carthy J, Smeaton AF (2004) SeLeCT: A Lexical Cohesion based News Story Segmentation System. J AI Commun 17(1):3-12

    Google Scholar 

  35. Sproat R, Shih C (1990) A statistical method for finding word boundaries in Chinese text. Comput Process Chin Orient Lang 4:336–351

    Google Scholar 

  36. Teahan WJ (2000) Text Classification and Segmentation Using Minimum Cross-entropy. In: Proceedings of international conference on content-based multimedia information access (RIAO-00)

  37. Transaction Processing Performance Council (2004) TPC Benchmark C Standard Specification Revision 5.3

  38. Wang X, Wang K, Li Z (1989) Minimal word segmentation and its algorithm. J Sci 13:1030–1032

    Google Scholar 

  39. Xue N (2003) Chinese word segmentation as charater tagging. Int J Comput Linguist Chin Lang Process 8(1):29-48

    Google Scholar 

  40. Yang, Y, Carbonell, JG, Brown, R, Pierce, T, Archibald, B, Liu, X (1999) Learning approaches for detecting and tracking news events. IEEE Intell Syst: Spec Iss Appl Intell Inform Retriev 14(4):32–43

    Google Scholar 

  41. Yao Q, An A (2003) Using user access patterns for semantic query caching. In: Proceedings of the 14th international conference on database and expert systems applications (DEXA'03), Prague, Czech Republic, pp 737–746.

  42. Yao Q, An A (2003) SQL-Relay: An event-driven rule-based database. In: International conference on web-age information management (WAIM'03)

  43. Yao Q, An A (2004) Characterizing database user's access patterns. In: Proceedings of the 15th international conference on database and expert systems applications (DEXA'04), Spain, pp 528–538

  44. Zhang HP, Liu Q, Cheng XQ, Zhang H, Yu HK (2003) Chinese lexical analysis using hierarchical hidden Markov model. In: Proceedings of the second SIGHAN workshop, Japan, pp 63–70

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangji Huang.

Additional information

Xiangji Huang joined York University as an Assistant Professor in July 2003 and then became a tenured Associate Professor in May 2006. Previously, he was a Post Doctoral Fellow at the School of Computer Science, University of Waterloo, Canada. He did his Ph.D. in Information Science at City University in London, England, with Professor Stephen E. Robertson. Before he went into his Ph.D. program, he worked as a lecturer for 4 years at Wuhan University. He also worked in the financial industry in Canada doing E-business, where he was awarded a CIO Achievement Award, for three and half years. He has published more than 50 refereed papers in journals, book chapter and conference proceedings. His Master (M.Eng.) and Bachelor (B.Eng.) degrees were in Computer Organization & Architecture and Computer Engineering, respectively. His research interests include information retrieval, data mining, natural language processing, bioinformatics and computational linguistics.

Qingsong Yao is a Ph.D. student in the Department of Computer Science and Engineering at York University, Toronto, Canada. His research interests include database management systems and query optimization, data mining, information retrieval, natural language processing and computational linguistics. He earned his Master's degree in Computer Science from Institute of Software, Chinese Academy of Science in 1999 and Bachelor's degree in Computer Science from Tsinghua University.

Aijun An is an associate professor in the Department of Computer Science and Engineering at York University, Toronto, Canada. She received her Bachelor's and Master's degrees in Computer Science from Xidian University in China. She received her PhD degree in Computer Science from the University of Regina in Canada in 1997. She worked at the University of Waterloo as a postdoctoral fellow from 1997 to 1999 and as a research assistant professor from 1999 to 2001. She joined York University in 2001. She has published more than 60 papers in refereed journals and conference proceedings. Her research interests include data mining, machine learning, and information retrieval.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, X., Yao, Q. & An, A. Applying language modeling to session identification from database trace logs. Knowl Inf Syst 10, 473–504 (2006). https://doi.org/10.1007/s10115-006-0015-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0015-9

Keywords

Navigation