Skip to main content
Log in

A novel approach for ranking web documents based on query-optimized personalized pagerank

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Ranking plays an important role in the search process of web documents on a huge corpus. This not only reduces the searching time but also provides useful documents to the users. In this paper, we extend our earlier query-optimized PageRank approach by combining the TF-IDF and personalized PageRank algorithm to generate a robust ranking mechanism. In our earlier approach, we modeled a ranking scheme by considering the link structures of the documents along with their content. A novel feature selection technique named as ‘Term-term correlation-based feature selection’ (TCFS) is also proposed which removes all noise terms from the document before the ranking process starts. We believe that by incorporating TCFS and personalized PageRank of the documents along with their relevance will improve the retrieval results. The aim is to modify the link structure based on the similarity score between the content of the document and the user query. Experimental results show that the proposed feature selection technique can outperform the conventional feature selection techniques, and the performance of the combined TF-IDF and personalized PageRank approach is promising compared to the traditional approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://www.worldwidewebsize.com/

  2. the query either having one top term or two top terms

  3. http://snowball.tartarus.org/algorithms/porter/stemmer.html

  4. the threshold is decided by the experiment

  5. http://www.dmoz.org

  6. http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/

  7. http://www.daviddlewis.com/resources/testcollections/reuters21578/

  8. http://qwone.com/~jason/20Newsgroups/

  9. http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/

  10. http://wiki.dbpedia.org/Datasets

  11. decided experimentally

  12. decided experimentally

References

  1. Agichtein, E., Brill, E., Dumais, S.: Improving web search ranking by incorporating user behavior information. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp 19–26 (2006)

  2. Andersen, R., Borgs, C., Chayes, J., Hopcraft, J., Mirrokni, V.S., Teng, S.H.: Local computation of pagerank contributions. In: Algorithms and Models for the Web-Graph, Springer, pp 150–165 (2007)

  3. Arun, K., Govindan, V., Kumar, S.M.: On integrating re-ranking and rank list fusion techniques for image retrieval. Int. J. Data Sci. Analytics 4(1), 53–81 (2017)

    Article  Google Scholar 

  4. Aslam, J.A., Montague, M.: Models for metasearch. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp 276–284 (2001)

  5. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Computers Geosci. 10(2), 191–203 (1984)

    Article  Google Scholar 

  6. Bougouin, A., Boudin, F., Daille, B.: Topicrank: Graph-based topic ranking for keyphrase extraction. In: International Joint Conference on Natural Language Processing (IJCNLP), pp 543–551 (2013)

  7. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., pp 43–52 (1998)

  8. Chahal, P., Singh, M., Kumar, S.: An efficient web page ranking for semantic web. J. Inst. Eng. India Ser B 95(1), 15–21 (2014)

    Article  Google Scholar 

  9. Chen, L., Kulasiri, D., Samarasinghe, S.: A novel data-driven boolean model for genetic regulatory networks. Front. Physiol. 9, 1328 (2018)

    Article  Google Scholar 

  10. Chirita, P.A., Diederich, J., Nejdl, W.: Mailrank: Using ranking for spam detection. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, ACM, pp 373–380 (2005)

  11. Collins, M.: Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp 489–496 (2002)

  12. Craswell, N., Hawking, D.: Overview of the trec-2002 web track. In: TREC, pp 78–92 (2002)

  13. Dali, L., Fortuna, B., Duc, TT., Mladenić, D.: Query-independent learning to rank for rdf entity search. In: Extended Semantic Web Conference, Springer, pp 484–498 (2012)

  14. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th International Conference on World Wide Web, ACM, pp 519–528 (2003)

  15. Derhami, V., Khodadadian, E., Ghasemzadeh, M., Bidoki, A.M.Z.: Applying reinforcement learning for web pages ranking algorithms. Appl. Soft Comput. 13(4), 1686–1692 (2013)

    Article  Google Scholar 

  16. Diaconis, P., Graham, R.L.: Spearman’s footrule as a measure of disarray. J. R. Stat. Soc. Ser. B Methodological 39, 262–268 (1977)

    MathSciNet  MATH  Google Scholar 

  17. Du, Y., Hai, Y.: Semantic ranking of web pages based on formal concept analysis. J. Syst. Softw. 86(1), 187–197 (2013)

    Article  Google Scholar 

  18. Ekstrand, M.D., Riedl, J.T., Konstan, J.A.: Collaborative filtering recommender systems. Found. Trends Human-Computer Interact. 4(2), 81–173 (2011)

    Article  Google Scholar 

  19. Fafalios, P., Kasturia, V., Nejdl, W.: Ranking archived documents for structured queries on semantic layers. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, ACM, pp. 155–164 (2018)

  20. Gao, Y., Xu, Y., Li, Y.: Pattern-based topics for document modelling in information filtering. IEEE Trans. Knowl. Data Eng. 27(6), 1629–1642 (2015)

    Article  Google Scholar 

  21. Gugnani, S., Roul, R.K.: Triple indexing: an efficient technique for fast phrase query evaluation. Int. J. Computer Appl. 87(13), 9–13 (2014)

    Google Scholar 

  22. Gugnani, S., Bihany, T., Roul, R.K.: A complete survey on web document ranking. Int. J. Computer Appl. ICACEA 975, 8887 (2014)

    Google Scholar 

  23. Guo, Z., Zhang, L., Zhang, D.: A completed modeling of local binary pattern operator for texture classification. IEEE Trans. Image Process. 19(6), 1657–1663 (2010)

    Article  MathSciNet  Google Scholar 

  24. Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., Zhao, L.: Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211 (2019)

    Article  Google Scholar 

  25. Khodaei, A., Shahabi, C., Li, C.: Skif-p: a point-based indexing and ranking of web documents for spatial-keyword search. Geoinformatica 16(3), 563–596 (2012)

    Article  Google Scholar 

  26. Kwak, N., Choi, C.H.: Input feature selection by mutual information based on parzen window. IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1667–1671 (2002)

    Article  Google Scholar 

  27. Langville, A.N., Meyer, C.D.: Deeper inside pagerank. Internet Math. 1(3), 335–380 (2004)

    Article  MathSciNet  Google Scholar 

  28. Liu, T.Y., et al.: Learning to rank for information retrieval. Found. Trends® Inf. Retr. 3(3), 225–331 (2009)

    Article  Google Scholar 

  29. Lv, Y., Zhai, C.: Positional language models for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp 299–306 (2009)

  30. Meymandpour, R., Davis, J.G.: A semantic similarity measure for linked data: an information content-based approach. Knowl.-Based Syst. 109, 276–293 (2016)

    Article  Google Scholar 

  31. Mirzal, A.: Clustering and latent semantic indexing aspects of the singular value decomposition. Int. J. Inf. Decision Sci. 8(1), 53–72 (2016)

    Google Scholar 

  32. Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 115–124 (2005)

  33. Pang, L., Lan, Y., Guo, J., Xu, J., Xu, J., Cheng, X.: Deeprank: a new deep architecture for relevance ranking in information retrieval. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM, pp. 257–266 (2017)

  34. Pasquinelli, M.: Google’s pagerank algorithm: a diagram of cognitive capitalism and the rentier of the common intellect. Deep Search: The Politics of Search Beyond Google pp. 152–163 (2009)

  35. Pon, R.K., Cardenas, A.F., Buttler, D., Critchlow, T.: Tracking multiple topics for finding interesting articles. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 560–569 (2007)

  36. Qin, T., Liu, T.Y., Zhang, X.D., Wang, D.S., Xiong, W.Y., Li, H.: Learning to rank relational objects and its application to web search. In: Proceedings of the 17th International Conference on World Wide Web, ACM, pp. 407–416 (2008)

  37. Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer, New York, pp. 232–241 (1994)

  38. Roul, R.K.: Detecting spam web pages using multilayer extreme learning machine. Int. J. Big Data Intell. 5(1–2), 49–61 (2018a)

    Article  Google Scholar 

  39. Roul, R.K.: An effective approach for semantic-based clustering and topic-based ranking of web documents. Int. J. Data Sci. Analytics 5(4), 269–284 (2018b)

    Article  Google Scholar 

  40. Roul, R.K., Arora, K.: A nifty review to text summarization-based recommendation system for electronic products. Soft. Comput. 23(24), 13183–13204 (2019)

    Article  Google Scholar 

  41. Roul, R.K., Rai, P.: A new feature selection technique combined with elm feature space for text classification. In: Proceedings of the 13th International Conference on Natural Language Processing, pp. 285–292 (2016)

  42. Roul, R.K., Sahoo, J.K.: Query-optimized pagerank: a novel approach. In: Advances in Intelligent Systems and Computing 711, Springer, pp. 673–683 (2017)

  43. Roul, R.K., Sahoo, J.K.: Sentiment analysis and extractive summarization based recommendation system. In: Computational Intelligence in Data Mining, Springer, pp. 473–487 (2020)

  44. Roul, R.K., Gugnani, S., Kalpeshbhai, S.M.: Clustering based feature selection using extreme learning machines for text classification. In: 2015 Annual IEEE India Conference (INDICON), IEEE, pp. 1–6 (2015)

  45. Roul, R.K., Asthana, S.R., Kumar, G.: Spam web page detection using combined content and link features. Int. J. Data Min. Modell. Manag. 8(3), 209–222 (2016a)

    Google Scholar 

  46. Roul, R.K., Bhalla, A., Srivastava, A.: Commonality-rarity score computation: a novel feature selection technique using extended feature space of elm for text classification. In: Proceedings of the 8th Annual Meeting of the Forum on Information Retrieval Evaluation, pp. 37–41 (2016b)

  47. Roul, R.K., Asthana, S.R., Kumar, G.: Study on suitability and importance of multilayer extreme learning machine for classification of text data. Soft Comput. 21, 4239 (2017a)

    Article  Google Scholar 

  48. Roul, R.K., Sahoo, J.K., Goel, R.: Deep learning in the domain of multi-document text summarization. PReMI, LNCS 10597, 575–581 (2017b)

    Google Scholar 

  49. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  50. Santos, I., Laorden, C., Sanz, B., Bringas, P.G.: Enhanced topic-based vector space model for semantics-aware spam filtering. Expert Syst. Appl. 39(1), 437–444 (2012)

    Article  Google Scholar 

  51. Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)

    Article  Google Scholar 

  52. Song, Y., Pan, S., Liu, S., Zhou, M.X., Qian, W.: Topic and keyword re-ranking for LDA-based topic modeling. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, ACM, pp. 1757–1760 (2009)

  53. Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)

    Article  Google Scholar 

  54. Spink, A., Wolfram, D., Jansen, M.B., Saracevic, T.: Searching the web: the public and their queries. J. Am. Soc. Inform. Sci. Technol. 52(3), 226–234 (2001)

    Article  Google Scholar 

  55. Tao, T., Zhai, C.: Regularized estimation of mixture models for robust pseudo-relevance feedback. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 162–169 (2006)

  56. Vuurens, J.B., de Vries, A.P.: Distance matters! cumulative proximity expansions for ranking documents. Inf. Retr. 17(4), 380–406 (2014)

    Article  Google Scholar 

  57. Wang, Y., Lu, J., Chen, J., Li, Y.: Crawling ranked deep web data sources. World Wide Web 20(1), 89–110 (2017)

    Article  Google Scholar 

  58. Xu, J., Cao, Y., Li, H., Zhao, M.: Ranking definitions with supervised learning methods. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, ACM, pp. 811–819 (2005)

  59. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)

    Google Scholar 

  60. Yulianti, E., Chen, R.C., Scholer, F., Croft, W.B., Sanderson, M.: Ranking documents by answer-passage quality. In: Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 335–344 (2018)

  61. Zhai, C., Cohen, W.W., Lafferty, J.: Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In: ACM SIGIR Forum, ACM vol. 49, pp. 2–9 (2015)

  62. Zhao, J., Yun, Y.: A proximity language model for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 291–298 (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajendra Kumar Roul.

Ethics declarations

Conflict of interest

The corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roul, R.K., Sahoo, J.K. A novel approach for ranking web documents based on query-optimized personalized pagerank. Int J Data Sci Anal 11, 37–55 (2021). https://doi.org/10.1007/s41060-020-00232-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-020-00232-2

Keywords

Navigation