Abstract
Ranking plays an important role in the search process of web documents on a huge corpus. This not only reduces the searching time but also provides useful documents to the users. In this paper, we extend our earlier query-optimized PageRank approach by combining the TF-IDF and personalized PageRank algorithm to generate a robust ranking mechanism. In our earlier approach, we modeled a ranking scheme by considering the link structures of the documents along with their content. A novel feature selection technique named as ‘Term-term correlation-based feature selection’ (TCFS) is also proposed which removes all noise terms from the document before the ranking process starts. We believe that by incorporating TCFS and personalized PageRank of the documents along with their relevance will improve the retrieval results. The aim is to modify the link structure based on the similarity score between the content of the document and the user query. Experimental results show that the proposed feature selection technique can outperform the conventional feature selection techniques, and the performance of the combined TF-IDF and personalized PageRank approach is promising compared to the traditional approaches.
Similar content being viewed by others
Notes
the query either having one top term or two top terms
the threshold is decided by the experiment
decided experimentally
decided experimentally
References
Agichtein, E., Brill, E., Dumais, S.: Improving web search ranking by incorporating user behavior information. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp 19–26 (2006)
Andersen, R., Borgs, C., Chayes, J., Hopcraft, J., Mirrokni, V.S., Teng, S.H.: Local computation of pagerank contributions. In: Algorithms and Models for the Web-Graph, Springer, pp 150–165 (2007)
Arun, K., Govindan, V., Kumar, S.M.: On integrating re-ranking and rank list fusion techniques for image retrieval. Int. J. Data Sci. Analytics 4(1), 53–81 (2017)
Aslam, J.A., Montague, M.: Models for metasearch. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp 276–284 (2001)
Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Computers Geosci. 10(2), 191–203 (1984)
Bougouin, A., Boudin, F., Daille, B.: Topicrank: Graph-based topic ranking for keyphrase extraction. In: International Joint Conference on Natural Language Processing (IJCNLP), pp 543–551 (2013)
Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., pp 43–52 (1998)
Chahal, P., Singh, M., Kumar, S.: An efficient web page ranking for semantic web. J. Inst. Eng. India Ser B 95(1), 15–21 (2014)
Chen, L., Kulasiri, D., Samarasinghe, S.: A novel data-driven boolean model for genetic regulatory networks. Front. Physiol. 9, 1328 (2018)
Chirita, P.A., Diederich, J., Nejdl, W.: Mailrank: Using ranking for spam detection. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, ACM, pp 373–380 (2005)
Collins, M.: Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp 489–496 (2002)
Craswell, N., Hawking, D.: Overview of the trec-2002 web track. In: TREC, pp 78–92 (2002)
Dali, L., Fortuna, B., Duc, TT., Mladenić, D.: Query-independent learning to rank for rdf entity search. In: Extended Semantic Web Conference, Springer, pp 484–498 (2012)
Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th International Conference on World Wide Web, ACM, pp 519–528 (2003)
Derhami, V., Khodadadian, E., Ghasemzadeh, M., Bidoki, A.M.Z.: Applying reinforcement learning for web pages ranking algorithms. Appl. Soft Comput. 13(4), 1686–1692 (2013)
Diaconis, P., Graham, R.L.: Spearman’s footrule as a measure of disarray. J. R. Stat. Soc. Ser. B Methodological 39, 262–268 (1977)
Du, Y., Hai, Y.: Semantic ranking of web pages based on formal concept analysis. J. Syst. Softw. 86(1), 187–197 (2013)
Ekstrand, M.D., Riedl, J.T., Konstan, J.A.: Collaborative filtering recommender systems. Found. Trends Human-Computer Interact. 4(2), 81–173 (2011)
Fafalios, P., Kasturia, V., Nejdl, W.: Ranking archived documents for structured queries on semantic layers. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, ACM, pp. 155–164 (2018)
Gao, Y., Xu, Y., Li, Y.: Pattern-based topics for document modelling in information filtering. IEEE Trans. Knowl. Data Eng. 27(6), 1629–1642 (2015)
Gugnani, S., Roul, R.K.: Triple indexing: an efficient technique for fast phrase query evaluation. Int. J. Computer Appl. 87(13), 9–13 (2014)
Gugnani, S., Bihany, T., Roul, R.K.: A complete survey on web document ranking. Int. J. Computer Appl. ICACEA 975, 8887 (2014)
Guo, Z., Zhang, L., Zhang, D.: A completed modeling of local binary pattern operator for texture classification. IEEE Trans. Image Process. 19(6), 1657–1663 (2010)
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., Zhao, L.: Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211 (2019)
Khodaei, A., Shahabi, C., Li, C.: Skif-p: a point-based indexing and ranking of web documents for spatial-keyword search. Geoinformatica 16(3), 563–596 (2012)
Kwak, N., Choi, C.H.: Input feature selection by mutual information based on parzen window. IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1667–1671 (2002)
Langville, A.N., Meyer, C.D.: Deeper inside pagerank. Internet Math. 1(3), 335–380 (2004)
Liu, T.Y., et al.: Learning to rank for information retrieval. Found. Trends® Inf. Retr. 3(3), 225–331 (2009)
Lv, Y., Zhai, C.: Positional language models for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp 299–306 (2009)
Meymandpour, R., Davis, J.G.: A semantic similarity measure for linked data: an information content-based approach. Knowl.-Based Syst. 109, 276–293 (2016)
Mirzal, A.: Clustering and latent semantic indexing aspects of the singular value decomposition. Int. J. Inf. Decision Sci. 8(1), 53–72 (2016)
Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 115–124 (2005)
Pang, L., Lan, Y., Guo, J., Xu, J., Xu, J., Cheng, X.: Deeprank: a new deep architecture for relevance ranking in information retrieval. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM, pp. 257–266 (2017)
Pasquinelli, M.: Google’s pagerank algorithm: a diagram of cognitive capitalism and the rentier of the common intellect. Deep Search: The Politics of Search Beyond Google pp. 152–163 (2009)
Pon, R.K., Cardenas, A.F., Buttler, D., Critchlow, T.: Tracking multiple topics for finding interesting articles. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 560–569 (2007)
Qin, T., Liu, T.Y., Zhang, X.D., Wang, D.S., Xiong, W.Y., Li, H.: Learning to rank relational objects and its application to web search. In: Proceedings of the 17th International Conference on World Wide Web, ACM, pp. 407–416 (2008)
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer, New York, pp. 232–241 (1994)
Roul, R.K.: Detecting spam web pages using multilayer extreme learning machine. Int. J. Big Data Intell. 5(1–2), 49–61 (2018a)
Roul, R.K.: An effective approach for semantic-based clustering and topic-based ranking of web documents. Int. J. Data Sci. Analytics 5(4), 269–284 (2018b)
Roul, R.K., Arora, K.: A nifty review to text summarization-based recommendation system for electronic products. Soft. Comput. 23(24), 13183–13204 (2019)
Roul, R.K., Rai, P.: A new feature selection technique combined with elm feature space for text classification. In: Proceedings of the 13th International Conference on Natural Language Processing, pp. 285–292 (2016)
Roul, R.K., Sahoo, J.K.: Query-optimized pagerank: a novel approach. In: Advances in Intelligent Systems and Computing 711, Springer, pp. 673–683 (2017)
Roul, R.K., Sahoo, J.K.: Sentiment analysis and extractive summarization based recommendation system. In: Computational Intelligence in Data Mining, Springer, pp. 473–487 (2020)
Roul, R.K., Gugnani, S., Kalpeshbhai, S.M.: Clustering based feature selection using extreme learning machines for text classification. In: 2015 Annual IEEE India Conference (INDICON), IEEE, pp. 1–6 (2015)
Roul, R.K., Asthana, S.R., Kumar, G.: Spam web page detection using combined content and link features. Int. J. Data Min. Modell. Manag. 8(3), 209–222 (2016a)
Roul, R.K., Bhalla, A., Srivastava, A.: Commonality-rarity score computation: a novel feature selection technique using extended feature space of elm for text classification. In: Proceedings of the 8th Annual Meeting of the Forum on Information Retrieval Evaluation, pp. 37–41 (2016b)
Roul, R.K., Asthana, S.R., Kumar, G.: Study on suitability and importance of multilayer extreme learning machine for classification of text data. Soft Comput. 21, 4239 (2017a)
Roul, R.K., Sahoo, J.K., Goel, R.: Deep learning in the domain of multi-document text summarization. PReMI, LNCS 10597, 575–581 (2017b)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Santos, I., Laorden, C., Sanz, B., Bringas, P.G.: Enhanced topic-based vector space model for semantics-aware spam filtering. Expert Syst. Appl. 39(1), 437–444 (2012)
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)
Song, Y., Pan, S., Liu, S., Zhou, M.X., Qian, W.: Topic and keyword re-ranking for LDA-based topic modeling. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, ACM, pp. 1757–1760 (2009)
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)
Spink, A., Wolfram, D., Jansen, M.B., Saracevic, T.: Searching the web: the public and their queries. J. Am. Soc. Inform. Sci. Technol. 52(3), 226–234 (2001)
Tao, T., Zhai, C.: Regularized estimation of mixture models for robust pseudo-relevance feedback. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 162–169 (2006)
Vuurens, J.B., de Vries, A.P.: Distance matters! cumulative proximity expansions for ranking documents. Inf. Retr. 17(4), 380–406 (2014)
Wang, Y., Lu, J., Chen, J., Li, Y.: Crawling ranked deep web data sources. World Wide Web 20(1), 89–110 (2017)
Xu, J., Cao, Y., Li, H., Zhao, M.: Ranking definitions with supervised learning methods. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, ACM, pp. 811–819 (2005)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)
Yulianti, E., Chen, R.C., Scholer, F., Croft, W.B., Sanderson, M.: Ranking documents by answer-passage quality. In: Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 335–344 (2018)
Zhai, C., Cohen, W.W., Lafferty, J.: Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In: ACM SIGIR Forum, ACM vol. 49, pp. 2–9 (2015)
Zhao, J., Yun, Y.: A proximity language model for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 291–298 (2009)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Roul, R.K., Sahoo, J.K. A novel approach for ranking web documents based on query-optimized personalized pagerank. Int J Data Sci Anal 11, 37–55 (2021). https://doi.org/10.1007/s41060-020-00232-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-020-00232-2