Skip to main content

Segmenting User Sessions in Search Engine Query Logs Leveraging Word Embeddings

  • Conference paper
  • First Online:
Book cover Digital Libraries for Open Knowledge (TPDL 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11799))

Included in the following conference series:

Abstract

Segmenting user sessions in search engine query logs is important to perceive information needs and assess how they are satisfied, to enhance the quality of search engine rankings, and to better direct content to certain users. Most previous methods use human judgments to inform supervised learning algorithms, and/or use global thresholds on temporal proximity and on simple lexical similarity metrics. This paper proposes a novel unsupervised method that improves the current state-of-art, leveraging additional heuristics and similarity metrics derived from word embeddings. We specifically extend a previous approach based on combining temporal and lexical similarity measurements, integrating semantic similarity components that use pre-trained FastText embeddings. The paper reports on experiments with an AOL query dataset used in previous studies, containing a total of 10,235 queries, with 4,253 sessions, 2.4 queries per session, and 215 unique users. The results attest to the effectiveness of the proposed method, which outperforms a large set of baselines, also corresponding to unsupervised techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.nytimes.com/2006/08/09/technology/09aol.html.

  2. 2.

    https://github.com/PedroG1515/Segmenting-User-Sessions.

  3. 3.

    https://support.google.com/analytics/answer/2731565.

  4. 4.

    https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.wmdistance.

References

  1. Feild, H., Allan, J., Jones, R.: Predicting searcher frustration. In: Proceedings of the ACM Conference on Research and Development in Information Retrieval (2010)

    Google Scholar 

  2. Hassan, A., Shi, X., Craswell, N., Ramsey, B.: Beyond clicks: query reformulation as a predictor of search satisfaction. In: Proceedings of the ACM Conference on Information and Knowledge Management (2013)

    Google Scholar 

  3. Jiang, J., Awadallah, A.H., Shi, X., White, R.W.: Understanding and predicting graded search satisfaction. In: Proceedings of the ACM Conference on Web Search and Data Mining (2015)

    Google Scholar 

  4. Kim, Y., Hassan, A., White, R.W., Zitouni, I.: Modeling dwell time to predict click-level satisfaction. In: Proceedings of the ACM Conference on Web Search and Data Mining (2014)

    Google Scholar 

  5. Mehrotra, R., et al.: Deep sequential models for task satisfaction prediction. In: Proceedings of the ACM on Conference on Information and Knowledge Management (2017)

    Google Scholar 

  6. Mayr, P., Kacem, A.: A complete year of user retrieval sessions in a social sciences academic search engine. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds.) TPDL 2017. LNCS, vol. 10450, pp. 560–565. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67008-9_46

    Chapter  Google Scholar 

  7. Hagen, M., Stein, B., Rüb, T.: Query session detection as a cascade. In: Proceedings of the ACM Conference on Information and Knowledge Management (2011)

    Google Scholar 

  8. Jones, R., Klinkner, K.L.: Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In: Proceedings of the ACM Conference on Information and Knowledge Management (2008)

    Google Scholar 

  9. Mehrzadi, D., Feitelson, D.G.: On extracting session data from activity logs. In: Proceedings of the Annual International Systems and Storage Conference (2012)

    Google Scholar 

  10. Gayo-Avello, D.: A survey on session detection methods in query logs and a proposal for future evaluation. Inf. Sci. 179(12), 1822–1843 (2009)

    Article  Google Scholar 

  11. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  12. Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: Proceedings of the International Conference on Scalable Information Systems (2006)

    Google Scholar 

  13. Downey, D., Dumais, S.T., Horvitz, E.: Models of searching and browsing: languages, studies, and application. In: Proceedings of the International Joint Conference on Artificial Intelligence (2007)

    Google Scholar 

  14. He, D., Göker, A.: Detecting session boundaries from web user logs. In: Proceedings of the BCS-IRSG Annual Colloquium on Information Retrieval Research (2000)

    Google Scholar 

  15. Radlinski, F., Joachims, T.: Query chains: learning to rank from implicit feedback. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2005)

    Google Scholar 

  16. Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the world wide web. Comput. Network ISDN Syst. 27(6), 1065–1073 (1995)

    Article  Google Scholar 

  17. Jansen Bernard, J., Spink, A., Blakely, C., Koshman, S.: Defining a session on web search engines. J. Am. Soc. Inform. Sci. Technol. 58(6), 862–871 (2007)

    Article  Google Scholar 

  18. Lucchese, C., Orlando, S., Perego, R., Silvestri, F., Tolomei, G.: Identifying task-based sessions in search engine query logs. In: Proceedings of the ACM Conference on Web Search and Data Mining (2011)

    Google Scholar 

  19. Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)

    Article  Google Scholar 

  20. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the International Joint Conference on Artificial Intelligence (2007)

    Google Scholar 

  21. Ozmutlu, S., Cenk Ozmutlu, H., Spink, A.: Automatic new topic identification in search engine transaction logs? Using multiple linear regression. In: Proceedings of the Hawaii International Conference on System Sciences (2008)

    Google Scholar 

  22. Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: Proceedings of the International Conference on Machine Learning (2015)

    Google Scholar 

  23. Santos, R., Murrieta-Flores, P.: Learning to combine multiple string similarity metrics for effective toponym matching. Int. J. Digit. Earth 11(9), 913–938 (2018)

    Article  Google Scholar 

  24. Santos, R., Murrieta-Flores, P., Calado, P., Martins, B.: Toponym matching through deep neural networks. Int. J. Geographical Inf. Sci. 32(2), 324–348 (2018)

    Article  Google Scholar 

  25. Gan, Z., et al.: Character-level deep conflation for business data analytics. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2017)

    Google Scholar 

Download references

Acknowledgements

This work was supported by Fundação para a Ciência e Tecnologia (FCT), through project GoLocal (CMUP-ERI/TIC/0046/2014) and also through the INESC-ID multi-annual funding from the PIDDAC program (UID/CEC/50021/2019).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Gomes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gomes, P., Martins, B., Cruz, L. (2019). Segmenting User Sessions in Search Engine Query Logs Leveraging Word Embeddings. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds) Digital Libraries for Open Knowledge. TPDL 2019. Lecture Notes in Computer Science(), vol 11799. Springer, Cham. https://doi.org/10.1007/978-3-030-30760-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30760-8_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30759-2

  • Online ISBN: 978-3-030-30760-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics