ABSTRACT
Query term weighting is a fundamental task in information retrieval and most popular term weighting schemes are primarily based on statistical analysis of term occurrences within the document collection. In this work we study how term weighting may benefit from syntactic analysis of the corpus. Focusing on community question answering (CQA) sites, we take into account the syntactic function of the terms within CQA texts as an important factor affecting their relative importance for retrieval. We analyze a large log of web queries that landed on Yahoo Answers site, showing a strong deviation between the tendencies of different document words to appear in a landing (click-through) query given their syntactic function. To this end, we propose a novel term weighting method that makes use of the syntactic information available for each query term occurrence in the document, on top of term occurrence statistics. The relative importance of each feature is learned via a learning to rank algorithm that utilizes a click-through query log. We examine the new weighting scheme using manual evaluation based on editorial data and using automatic evaluation over the query log. Our experimental results show consistent improvement in retrieval when syntactic information is taken into account.
- J. Allan and H. Raghavan. Using part-of-speech patterns to reduce query ambiguity. In Proceedings of SIGIR, pages 307--314. ACM, 2002. Google ScholarDigital Library
- G. Amati, V. Rijsbergen, and C. Joost. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4), Oct. 2002. Google ScholarDigital Library
- J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo. Sources of evidence for vertical selection. In Proceedings of SIGIR, pages 315--322. ACM, 2009. Google ScholarDigital Library
- R. Baeza-Yates. Challenges in the interaction of information retrieval and natural language processing. In A. Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 2945, pages 445--456. Springer Berlin Heidelberg, 2004.Google ScholarCross Ref
- R. A. Baeza-yates and B. A. Ribeiro-neto. Modern Information Retrieval, Second Edition. Addison-Wesley Professional, 2011. Google ScholarDigital Library
- C. Barr, R. Jones, and M. Regelson. The linguistic structure of English Web-search Queries. In Proceedings of EMNLP, pages 1021--1030. ACL, 2008. Google ScholarDigital Library
- C. J. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In NIPS, volume 6, pages 193--200, 2006.Google Scholar
- L. Cai, G. Zhou, K. Liu, and J. Zhao. Learning the latent topics for question retrieval in community QA. In IJCNLP, volume 11, pages 273--281, 2011.Google Scholar
- X. Cao, G. Cong, B. Cui, C. S. Jensen, and C. Zhang. The use of categorization information in language models for question retrieval. In Proceedings of CIKM, pages 265--274. ACM, 2009. Google ScholarDigital Library
- Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking svm to document retrieval. In Proceedings of SIGIR, pages 186--193. ACM, 2006. Google ScholarDigital Library
- M. Collins, L. Ramshaw, J. Hajič, and C. Tillmann. A statistical parser for Czech. In Proceedings of ACL, pages 505--512. ACL, 1999. Google ScholarDigital Library
- K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. Machine Learning, 91(2):155--187, 2013. Google ScholarDigital Library
- H. Cui, R. Sun, K. Li, M.-Y. Kan, and T.-S. Chua. Question answering passage retrieval using dependency relations. In Proceedings of SIGIR, pages 400--407. ACM, 2005. Google ScholarDigital Library
- M.-C. De Marneffe, B. MacCartney, C. D. Manning, et al. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, volume 6, pages 449--454, 2006.Google Scholar
- P. Donmez, K. M. Svore, and C. J. Burges. On the local optimality of lambdarank. In Proceedings of SIGIR, pages 460--467. ACM, 2009. Google ScholarDigital Library
- H. Duan, Y. Cao, C.-Y. Lin, and Y. Yu. Searching questions by identifying question topic and question focus. In Proceedings of ACL, pages 156--164, 2008.Google Scholar
- J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121--2159, 2011. Google ScholarDigital Library
- J. Gao, J.-Y. Nie, G. Wu, and G. Cao. Dependence language model for information retrieval. In Proceedings of SIGIR, pages 170--177. ACM, 2004. Google ScholarDigital Library
- J. Jeon, W. B. Croft, and J. H. Lee. Finding similar questions in large question and answer archives. In Proceedings of CIKM, pages 84--90. ACM, 2005. Google ScholarDigital Library
- D. Klein and C. D. Manning. Accurate unlexicalized parsing. In Proceedings of ACL, pages 423--430. ACL, 2003. Google ScholarDigital Library
- V. Lavrenko and W. B. Croft. Relevance based language models. In Proceedings of SIGIR '01, pages 120--127. ACM, 2001. Google ScholarDigital Library
- C.-J. Lee, R.-C. Chen, S.-H. Kao, and P.-J. Cheng. A term dependency-based approach for query terms ranking. In Proceedings of CIKM, pages 1267--1276. ACM, 2009. Google ScholarDigital Library
- Q. Liu, E. Agichtein, G. Dror, E. Gabrilovich, Y. Maarek, D. Pelleg, and I. Szpektor. Predicting web searcher satisfaction with existing community-based answers. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 415--424. ACM, 2011. Google ScholarDigital Library
- Q. Liu, E. Agichtein, G. Dror, Y. Maarek, and I. Szpektor. When web search fails, searchers become askers: Understanding the transition. In Proceedings of SIGIR, pages 801--810. ACM, 2012. Google ScholarDigital Library
- T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In Proceedings of SIGIR 2007 workshop on learning to rank for information retrieval, pages 3--10, 2007.Google Scholar
- Y. Lu, F. Peng, G. Mishne, X. Wei, and B. Dumoulin. Improving web search relevance with semantic features. In Proceedings of EMNLP, pages 648--657. ACL, 2009. Google ScholarDigital Library
- M. Marcus, B. Santorini, and M. Marcinkiewicz. Building a large annotated corpus of English: The penn treebank. Computational Linguistics, 19(2):313--330, 1993. Google ScholarDigital Library
- D. Metzler and W. B. Croft. A markov random field model for term dependencies. In Proceedings of SIGIR, pages 472--479. ACM, 2005. Google ScholarDigital Library
- J. W. Murdock, J. Fan, A. Lally, H. Shima, and B. Boguraev. Textual evidence gathering and analysis. IBM Journal of Research and Development, 56(3.4):8--1, 2012. Google ScholarDigital Library
- J. H. Park and W. B. Croft. Query term ranking based on dependency parsing of verbose queries. In Proceedings of SIGIR, pages 829--830. ACM, 2010. Google ScholarDigital Library
- J. H. Park, W. B. Croft, and D. A. Smith. A quasi-synchronous dependence model for information retrieval. In Proceedings of CIKM, pages 17--26. ACM, 2011. Google ScholarDigital Library
- S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333--389, Apr. 2009. Google ScholarDigital Library
- G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, Aug. 1988. Google ScholarDigital Library
- C. Shah and W. B. Croft. Evaluating high accuracy retrieval techniques. In Proceedings of SIGIR, pages 2--9. ACM, 2004. Google ScholarDigital Library
- A. F. Smeaton. Using NLP or NLP resources for information retrieval tasks. In Natural language information retrieval, pages 99--111. Springer, 1999.Google ScholarCross Ref
- E. M. Voorhees. Natural language processing and information retrieval. In Information Extraction: Towards Scalable, Adaptable Systems, pages 32--48. Springer-Verlag, 1999. Google ScholarDigital Library
- K. Wang, Z. Ming, and T.-S. Chua. A syntactic tree matching approach to finding similar questions in community-based QA services. In Proceedings of SIGIR, pages 187--194. ACM, 2009. Google ScholarDigital Library
- H. Wu, W. Wu, M. Zhou, E. Chen, L. Duan, and H.-Y. Shum. Improving search relevance for short queries in community question answering. In Proceedings of WSDM, pages 43--52. ACM, 2014. Google ScholarDigital Library
- F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank: theory and algorithm. In Proceedings of IMCL, pages 1192--1199. ACM, 2008. Google ScholarDigital Library
- X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In Proceedings of SIGIR, pages 475--482. ACM, 2008. Google ScholarDigital Library
- W. Zhang, Z. Ming, Y. Zhang, L. Nie, T. Liu, and T.-S. Chua. The use of dependency relation graph to enhance the term weighting in question retrieval. In Proceedings of Coling, pages 3105--3120, 2012.Google Scholar
Index Terms
- Improving Term Weighting for Community Question Answering Search Using Syntactic Analysis
Recommendations
Context-Aware Document Term Weighting for Ad-Hoc Search
WWW '20: Proceedings of The Web Conference 2020Bag-of-words document representations play a fundamental role in modern search engines, but their power is limited by the shallow frequency-based term weighting scheme. This paper proposes HDCT, a context-aware document term weighting framework for ...
Intent Term Weighting in E-commerce Queries
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge ManagementE-commerce search engines can fail to retrieve results that satisfy a query's product intent because: (i) conventional retrieval approaches, such as BM25, may ignore the important terms in queries owing to their low "inverse document frequency" " (IDF), ...
A community question-answering refinement system
HT '11: Proceedings of the 22nd ACM conference on Hypertext and hypermediaCommunity Question Answering (CQA) websites, which archive millions of questions and answers created by CQA users to provide a rich resource of information that is missing at web search engines and QA websites, have become increasingly popular. Web ...
Comments