Abstract
In this paper we describe learning algorithms for Web page rank prediction. We consider linear regression models and combinations of regression with probabilistic clustering and Principal Components Analysis (PCA). These models are learned from time-series data sets and can predict the ranking of a set of Web pages in some future time. The first algorithm uses separate linear regression models. This is further extended by applying probabilistic clustering based on the EM algorithm. Clustering allows for the Web pages to be grouped together by fitting a mixture of regression models. A different method combines linear regression with PCA so as dependencies between different web pages can be exploited. All the methods are evaluated using real data sets obtained from Internet Archive, Wikipedia and Yahoo! ranking lists. We also study the temporal robustness of the prediction framework. Overall the system constitutes a set of tools for high accuracy pagerank prediction which can be used for efficient resource management by search engines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Vazirgiannis, M., Drosos, D., Senellart, P., Vlachou, A.: Web Page Rank Prediction with Markov Models. WWW poster, Beijing, China (2008)
Vlachou, A., Vazirgiannis, M., Berberich, K.: Representing and quantifying rank - change for the web graph. In: Aiello, W., Broder, A., Janssen, J., Milios, E.E. (eds.) WAW 2006. LNCS, vol. 4936, pp. 157–165. Springer, Heidelberg (2008)
Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Computation 11, 443–482 (1999)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39(1), 1–38 (1977)
Jolliffe, I.T.: Principal Component Analysis. Springer, New York (1986)
Bishop, C.M.: Machine learning and pattern recognition. Information Science and Statistics. Springer, Heidelberg (2006)
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. TOIS 20(4), 422–446 (2002)
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)
Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: Proc. CIKM, Bremen, Germany (October 2005)
Chien, S., Dwork, C., Kumar, R., Simon, D.R., Sivakumar, D.: Link evolution: Analysis and algorithms. Internet Mathematics 1(3), 277–304 (2003)
Broder, A.Z., Lempel, R., Maghoul, F., Pedersen, J.: Efficient PageRank approximation via graph aggregation. Information Retrieval 9(2), 123–138 (2006)
Chen, Y.-Y., Gan, Q., Suel, T.: Local methods for estimating PageRank values. In: Proc. CIKM, Washington, USA (November 2004)
Haveliwala, T.H.: Topic-sensitive PageRank. In: Proc. WWW, Honolulu, USA (May 2002)
Langville, A.N., Meyer, C.D.: Updating PageRank with iterative aggregation. In: Proc. WWW, New York, USA (May 2004)
Kendall, M.G., Gibbons, J.D.: Rank Correlation Methods, Charles Griffin, London, UK (1990)
Yang, H., King, I., Lyu, M.R.: Predictive ranking: a novel page ranking approach by estimating the Web structure. In: Proc. WWW (May 2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zacharouli, P., Titsias, M., Vazirgiannis, M. (2009). Web Page Rank Prediction with PCA and EM Clustering. In: Avrachenkov, K., Donato, D., Litvak, N. (eds) Algorithms and Models for the Web-Graph. WAW 2009. Lecture Notes in Computer Science, vol 5427. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-95995-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-95995-3_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-95994-6
Online ISBN: 978-3-540-95995-3
eBook Packages: Computer ScienceComputer Science (R0)