Skip to main content

Advertisement

Log in

Training query filtering for semi-supervised learning to rank with pseudo labels

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Semi-supervised learning is a machine learning paradigm that can be applied to create pseudo labels from unlabeled data for learning a ranking model, when there is only limited or no training examples available. However, the effectiveness of semi-supervised learning in information retrieval (IR) can be hindered by the low quality pseudo labels, hence the need for the training query filtering that removes the low quality queries. In this paper, we assume two application scenarios with respect to the availability of human labels. First, for applications without any labeled data available, a clustering-based approach is proposed to select the high quality training queries. This approach selects the training queries following the empirical observation that the relevant documents of high quality training queries are highly coherent. Second, for applications with limited labeled data available, a classification-based approach is proposed. This approach learns a weak classifier to predict the retrieval performance gain of a given training query by making use of query features. The queries with high performance gains are selected for the following transduction process to create the pseudo labels for learning to rank algorithms. Experimental results on the standard LETOR dataset show that our proposed approaches outperform the strong baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

References

  1. Alldrin, N., Smith, A., Turnbull, D.: Clustering with em and k-means. Tech. rep., University of San Diego, California (2003)

  2. Amati, G., Amodeo, G., Bianchi, M., Celi, A., Nicola, C.D., Flammini, M., Gaibisso, C., Gambosi, G., Marcone, G.: Fub, iasi-cnr, univaq at trec 2011. In: TREC (2011)

  3. Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36, 105–139 (1999)

    Article  Google Scholar 

  4. Brando, W.C., Santos, R.L.T., Ziviani, N., de Moura, E.S., da Silva, A.S.: Learning to expand queries using entities. Journal of the Association for Information Science and Technology pp. n/a–n/a. doi: 10.1002/asi.23084 (2014)

  5. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). doi:10.1023/A:1018054314350

    MathSciNet  MATH  Google Scholar 

  6. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, pp 89–96. ACM, New York (2005), doi:10.1145/1102351.1102363

    Chapter  Google Scholar 

  7. Carvalho, V.R., Elsas, J.L., Cohen, W.W., Carbonell, J.G.: A meta-learning approach for robust rank learning (2008)

  8. Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning, 1st edn. MIT Press (2010)

  9. Chelaru, S., Orellana-Rodriguez, C., Altingovde, I.: How useful is social feedback for learning to rank youtube videos? World Wide Web 17(5), 997–1025 (2014). doi:10.1007/s11280-013-0258-9

    Article  Google Scholar 

  10. Chu, W., Ghahramani, Z.: Extensions of gaussian processes for ranking: Semi-supervised and active learning (2005)

  11. Clarke, C.L.A., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2011 web track. In: TREC (2011)

  12. Clarke, C.L.A., Craswell, N., Voorhees, E.M.: Overview of the TREC 2012 web track. In: TREC (2012)

  13. Cohen, W.W., Schapire, R.E., Singer, Y.: Learning to order things. J. Artif. Int. Res. 10(1), 243–270 (1999). http://dl.acm.org/citation.cfm?id=1622859.1622867

    MathSciNet  MATH  Google Scholar 

  14. Constantinopoulos, C., Likas, A.: Semi-supervised and active learning with the probabilistic {RBF} classifier. Neurocomputing 71(1315), 2489–2498 (2008). doi:10.1016/j.neucom.2007.11.039. http://www.sciencedirect.com/science/article/pii/S0925231208002117. Artificial Neural Networks (ICANN 2006) / Engineering of Intelligent Systems (ICEIS 2006)

    Article  Google Scholar 

  15. Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’02, pp. 299–306. ACM, New York (2002). doi:10.1145/564376.564429

    Chapter  Google Scholar 

  16. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  17. Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29(2–3), 103–130 (1997). doi:10.1023/A:1007413511361

    Article  MATH  Google Scholar 

  18. Donmez, P., Lebanon, G., Balasubramanian, K.: Unsupervised supervised learning i: estimating classification and regression errors without labels. J. Mach. Learn. Res. 11, 1323–1351 (2010). http://dl.acm.org/citation.cfm?id=1756006.1859895

    MathSciNet  MATH  Google Scholar 

  19. Duan, Y., Jiang, L., Qin, T., Zhou, M., Shum, H.Y.: An empirical study on learning to rank of tweets. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pp. 295–303. Association for Computational Linguistics, Stroudsburg. http://dl.acm.org/citation.cfm?id=1873781.1873815 (2010)

  20. Duh, K., Kirchhoff, K.: Learning to rank with partially-labeled data. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 251–258. ACM, New York (2008). doi:10.1145/1390334.1390379

    Chapter  Google Scholar 

  21. El-yaniv, R., Pechyony, D.: Stable transductive learning. In: Proceedings of the 19th Annual Conference on Learning Theory, COLT’06, pp. 35–49. Springer, Berlin (2006). doi:10.1007/11776420_6

    Google Scholar 

  22. Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)

    Article  MATH  Google Scholar 

  23. Freund, Y., Schapire, R.E.: A short introduction to boosting. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 1401–1406. Morgan Kaufmann (1999)

  24. Ganjisaffar, Y., Caruana, R., Lope, C.: Bagging gradient-boosted trees for high precision, low variance ranking models. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pp. 85–94. ACM, New York (2011). doi:10.1145/2009916.2009932

    Google Scholar 

  25. Geng, X., Qin, T., Liu, T., Cheng, X., Li, H.: Selecting optimal training data for learning to rank. Inf. Process. Manag. 47(5), 730–741 (2011). doi:10.1016/j.ipm.2011.01.002

    Article  Google Scholar 

  26. Geng, X., Qin, T., Liu, T.Y., Cheng, X.Q.: A noise-tolerant graphical model for ranking. Inf. Process. Manag. 48(2), 374–383 (2012). doi:10.1016/j.ipm.2011.11.003

    Article  Google Scholar 

  27. Guan, D., Yuan, W., Lee, Y.K., Lee, S.: Identifying mislabeled training data with the aid of unlabeled data. Appl. Intell. 35(3), 345–358 (2011). doi:10.1007/s10489-010-0225-4

    Article  Google Scholar 

  28. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explorations 11(1) (2009)

  29. He, X., Ji, M., Bao, H.: A unified active and semi-supervised learning framework for image compression. In: Computer Vision and Pattern Recognition (2009)

  30. Hu, H., Sha, C., Wang, X., Zhou, A.: A unified framework for semi-supervised pu learning. World Wide Web 17(4), 493–510 (2014). doi:10.1007/s11280-013-0215-7

    Article  Google Scholar 

  31. Huang, J.X., Miao, J., He, B.: High performance query expansion using adaptive co-training. Inf. Process. Manag. 49(2), 441–453 (2013). doi:10.1016/j.ipm.2012.08.002

    Article  Google Scholar 

  32. Huang, X., Huang, Y.R., Wen, M., An, A., Liu, Y., Poon, J.: Applying data mining to pseudo-relevance feedback for high performance text retrieval. In: ICDM, pp. 295–306. IEEE Computer Society (2006)

  33. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pp. 133–142. ACM, New York (2002). doi:10.1145/775047.775067

    Google Scholar 

  34. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220(4598) (1983)

  35. Leng, Y., Xu, X., Qi, G.: Combining active learning and semi-supervised learning to construct {SVM} classifier. Knowl.-Based Syst. 44, 121–131 (2013). doi:10.1016/j.knosys.2013.01.032. http://www.sciencedirect.com/science/article/pii/S095070511300052X

    Article  Google Scholar 

  36. Li, D., He, B., Luo, T., Ma, Q.: Selecting training data for learning-based twitter search. In: ECIR. To appear (2015)

  37. Li, H., Liao, X., Carin, L.: Active learning for semi-supervised multi-task learning. In: Proceedings of International Conference on Acoustics (2009)

  38. Li, M., Li, H., hua Zhou, Z.: Semi-supervised document retrieval. Inf. Process. Manag. 45, 341–355 (2009). doi:10.1016/j.ipm.2008.11.002

    Article  Google Scholar 

  39. Lin, Y., Lin, H., Xu, K., Sun, X.: Learning to rank using smoothing methods for language modeling. J. Am. Soc. Inf. Sci. Technol. 64(4), 818–828 (2013). doi:10.1002/asi.22789

    Article  Google Scholar 

  40. Liu, T.: Learning to rank for information retrieval. Foundations and Trends in Information Retrieval (3), 225–331 (2009)

  41. Liu, T., Xu, J., Qin, T., Xiong, W., Li, H.: Letor: benchmark dataset for research on learning to rank for information retrieval. In: SIGIR 2007 Workshop on Learning to Rank for Information Retrieval (2007)

  42. Luo, Z., Osborne, M., Wang, T.: An effective approach to tweets opinion retrieval. World Wide Web, pp. 1–22 (2013). doi:10.1007/s11280-013-0268-7

  43. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (eds.) Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)

  44. McCallum, A., Nigam, K.: Employing em and pool-based active learning for text classification. In: Proceedings of the 15th International Conference on Machine Learning, ICML ’98, pp. 350–358. Morgan Kaufmann, San Francisco, CA (1998). http://dl.acm.org/citation.cfm?id=645527.757765

  45. Muslea, I., Minton, S., Knoblock, C.A.: Selective sampling with redundant views. In: AAAI, pp. 621–626 (2000)

  46. Muslea, I., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robust multi-view learning. In: Proceedings of the 19th International Conference on Machine Learning, ICML ’02, pp. 435–442. Morgan Kaufmann, San Francisco, CA (2002). http://dl.acm.org/citation.cfm?id=645531.655845

    Google Scholar 

  47. Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the TREC 2011 microblog track. In: TREC. Gaithersburg, MD (2011)

  48. Palei, S.K., Das, S.K.: Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: an approach. Saf. Sci. 47(1), 88–96 (2009). doi:10.1016/j.ssci.2008.01.002. http://www.sciencedirect.com/science/article/pii/S0925753508000118

    Article  Google Scholar 

  49. Pierce, D., Cardie, C.: Limitations of co-training for natural language learning from large datasets. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 1–9 (2001)

  50. Prince, M.: Does active learning work? A review of the research. J. Eng. Educ. 93(3), 223–231 (2004). doi:10.1002/j.2168-9830.2004.tb00809.x

    Article  MathSciNet  Google Scholar 

  51. Qin, T., Liu, T.: Introducing letor 4.0 datasets. Tech. rep., Microsoft Research Asia (2013)

  52. Qin, T., Liu, T.Y., Xu, J., Li, H.: Letor: a benchmark collection for research on learning to rank for information retrieval. Inf. Retr. 13(4), 346–374 (2010). doi:10.1007/s10791-009-9123-y

    Article  Google Scholar 

  53. Reitmaie, T., Calma, A., Sick, B.: Transductive active learning a new semi-supervised learning approach based on iteratively refined generative models to capture structure in data. Inf. Sci. 293, 275–298 (2015)

    Article  Google Scholar 

  54. Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. In: TREC (1995)

  55. Rocchio, J.: Relevance Feedback in Information Retrieval, pp. 313–323. Prentice-Hall, Englewood Cliffs (1971)

    Google Scholar 

  56. Rosales, R., Krishnamurthy, P., Rao, R.B.: Semi-supervised active learning for modeling medical concepts from free text. In: ICMLA (2007)

  57. Sellamanickam, S., Garg, P., Selvaraj, S.K.: A pairwise ranking based approach to learning with positive and unlabeled examples. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pp. 663–672. ACM, New York (2011). doi:10.1145/2063576.2063675

    Google Scholar 

  58. Shtok, A., Kurland, O., Carmel, D., Raiber, F., Markovits, G.: Predicting query performance by query-drift estimation. ACM Trans. Inf. Syst. 30(2), 11:1–11:35 (2012). doi:10.1145/2180868.2180873

    Article  Google Scholar 

  59. Szummer, M., Yilmaz, E.: Semi-supervised learning to rank with preference regularization. In: Proceedings of the 20th ACM Conference on Conference on Information and Knowledge Management, CIKM ’11, pp. 269–278 (2011)

  60. Tr, G., Hakkani-Tr, D.Z., Schapire, R.E.: Combining active and semi-supervised learning for spoken language understanding. Speech Comm. 45(2), 171–186 (2005). http://dblp.uni-trier.de/db/journals/speech/speech45.html#TurHS05

    Article  Google Scholar 

  61. Usunier, N., Truong, V., Amini, M.R., Gallinari, P., Curie, M.: Ranking with unlabeled data: a first study. In: Proceedings of NIPS Workshop (2005)

  62. Valizadegan, H., et al.: Kernel based detection of mislabeled training examples (2007)

  63. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  64. Webb, G.I., Boughton, J.R., Wang, Z.: Not so naive bayes: aggregating one-dependence estimators. Mach. Learn. 58(1), 5–24 (2005). doi:10.1007/s10994-005-4258-6

    Article  MATH  Google Scholar 

  65. Xu, J., Chen, C., Xu, G., Li, H., Abib, E.R.T.: Improving quality of training data for learning to rank using click-through data. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, WSDM ’10, pp. 171–180. ACM, New York (2010). doi:10.1145/1718487.1718509

    Google Scholar 

  66. You, G.w., Park, J.w., Hwang, S.w., Nie, Z., Wen, J.R.: Socialsearch: enriching social network with web evidences. World Wide Web 16(5–6), 701–727 (2013). doi:10.1007/s11280-012-0165-5

    Article  Google Scholar 

  67. Yu, D., Varadarajan, B., Deng, L., Acero, A.: Active learning and semi-supervised learning for speech recognition: a unified framework using the global entropy reduction maximization criterion. Comput. Speech Lang. 24(3), 433–444 (2010). doi:10.1016/j.csl.2009.03.004

    Article  Google Scholar 

  68. Yue, Y., Finley, T., Radlinski, F., Joachims, T.: A support vector method for optimizing average precision. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’07, pp. 271–278. ACM, New York (2007). doi:10.1145/1277741.1277790

    Chapter  Google Scholar 

  69. Zhang, R., Tran, T., Mao, Y.: Opinion helpfulness prediction in the presence of words of few mouths. World Wide Web 15(2), 117–138 (2012). doi:10.1007/s11280-011-0127-3

    Article  Google Scholar 

  70. Zhang, X., He, B., Luo, T.: Transductive learning for real-time twitter search. In: The International Conference on Weblogs and Social Media (ICWSM), pp. 611–614 (2012)

  71. Zhang, X., He, B., Luo, T., Li, B.: Query-biased learning to rank for real-time twitter search. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, pp. 1915–1919. ACM, New York (2012). doi:10.1145/2396761.2398543

    Google Scholar 

  72. Zhang, X., He, B., Luo, T., Li, D., Xu, J.: Clustering-based transduction for learning a ranking model with limited human labels. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, pp. 1777–1782. ACM, New York (2013). doi:10.1145/2505515.2505647

    Google Scholar 

  73. Zhou, Y., Croft, W.B.: Query performance prediction in web search environments. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’07, pp. 543–550. ACM, New York (2007). doi:10.1145/1277741.1277835

    Chapter  Google Scholar 

  74. Zhou, Y., Goldman, S.: Democratic co-learning. In: ICTAI (2004)

  75. Zhu, X.: Semi-supervised learning literature survey. Tech. rep., Department of Computer Sciences, University of Wisconsin at Madison Madison, WI. Available from: http://www.cs.wisc.edu/jerryzhu/pub/sslsurvey.pdf

  76. Zhu, X., Lafferty, J., Ghahramani, Z.: Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In: ICML 2003 workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pp. 58–65 (2003)

Download references

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (61103131/61472391), Beijing Natural Science Foundation (4142050) and SRF for ROCS, SEM.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ben He.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., He, B. & Luo, T. Training query filtering for semi-supervised learning to rank with pseudo labels. World Wide Web 19, 833–864 (2016). https://doi.org/10.1007/s11280-015-0363-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-015-0363-z

Keywords

Navigation