ABSTRACT
The objective of Web forums is to create a shared space for open communications and discussions of specific topics and issues. The tremendous information behind forum sites is not fully-utilized yet. Most links between forum pages are automatically created, which means the link-based ranking algorithm cannot be applied efficiently. In this paper, we proposed a novel ranking algorithm which tries to introduce the content information into link-based methods as implicit links. The basic idea is derived from the more focused random surfer: the surfer may more likely jump to a page which is similar to what he is reading currently. In this manner, we are allowed to introduce the content similarities into the link graph as a personalization bias. Our method, named Fine-grained Rank (FGRank), can be efficiently computed based on an automatically generated topic hierarchy. Not like the topic-sensitive PageRank, our method only need to compute single PageRank score for each page. Another contribution of this paper is to present a very efficient algorithm for automatically generating topic hierarchy and map each page in a large-scale collection onto the computed hierarchy. The experimental results show that the proposed method can improve retrieval performance, and reveal that content-based link graph is also important compared with the hyper-link graph.
- Google search engine. http://www. google. comGoogle Scholar
- Yahoo! search engine. http://search. yahoo. comGoogle Scholar
- R. Baeza-Yates, F. Saint-Jean, and C. Castillo. Web dynamics, age and page quality. In Proc. of SPIRE 2002 Lisbon, Portugal, September 2002. Google ScholarDigital Library
- L. D. Baker and A. McCallum. Distributional clustering of words for text classification. In Proc. of the 21st annual international ACM SIGIR conference on Research and development in information retrieval pages 96--103, 1998. Google ScholarDigital Library
- D. Bergmark, C. Lagoze, and A. Sbityakov. Focused crawls, tunneling, and digital libraries. In Proc. of the 6th European Conference on Digital Libraries pages 91--106, September 2002. Google ScholarDigital Library
- P. Boldi, M. Santini, and S. Vigna. Pagerank as a function of the damping factor. In Proc. of the 14th international conference on World Wide Web Chiba, Japan, May 2005. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of 7th International World Wide Web Conference May 1998. Google ScholarDigital Library
- I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proc. of the 7th ACM SIGKDD Conference on Knowledge Discovery and Data Mining pages 269--274, 2001. Google ScholarDigital Library
- I. S. Dhillon, S. Mallela, and R. Kumar. Enhanced word clustering for hierarchical text classification. In Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2002. Google ScholarDigital Library
- M. Diligenti, M. Gori, and M. Maggini. Web page scoring systems for horizontal and vertical search. In Proc. of the 11st International World Wide Web Conference May 2002. Google ScholarDigital Library
- S. Dumais and H. Chen. Hierarchical classification of web content. In Proc. of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval August 2000. Google ScholarDigital Library
- T. H. Haveliwala. Topic-sensitive pagerank. In Proc. of the 7th International World Wide Web Conference 2002. Google ScholarDigital Library
- A. K. Jain and R. C. Dubes. Algorithms for clustering data Prentice Hall, 1988. Google ScholarDigital Library
- G. Jeh and J. Widom. Scaling personalized web search. In Proc. of the 12th International World Wide Web Conference 2003. Google ScholarDigital Library
- S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Exploiting the block structure of the web for computing. Technical eport, Stanford University, Stanford, CA, 2003.Google Scholar
- J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5):604--622, 1999. Google ScholarDigital Library
- D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc. of the 14th International Conference on Machine Learning pages 170--178, 1997. Google ScholarDigital Library
- K. Lang. News weeder: Learning to filter netnews. In Proc. of 12th International Conference on Machine Learning pages 331--339, 1995.Google Scholar
- T. Li, S. Zhu, and M. Ogihara. Topic hierarchy generation via linear discriminant projection. In Proc. of the 26th annual international ACM SIGIR conference on Research and development in information retrieval Toronto, Canada, 2003. Google ScholarDigital Library
- A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems volume 14, pages 849--856, 2002.Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking:Bringing order to the web. Technical eport, Stanford University, Stanford, CA, 1998.Google Scholar
- M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In Advances in Neural Information Processing Systems volume 14, Cambridge, MA, 2002. MIT Press.Google Scholar
- S. E. Robertson. Overview of the okapi projects. Journal of Documentation 53(1), 1997.Google ScholarCross Ref
- S. Vaithyanathan and B. Dom. Model-based hierarchical clustering. In Proc. of 6th Conferenceon Uncertainty in Artificial Intelligence pages 599--608, 2000. Google ScholarDigital Library
- X. Wang, A. Shakery, and T. Tao. Dirichlet pagerank. In Proc. of the 27th annual international ACM SIGIR conference on Research and development in information retrieval pages 661--662, Salvador, Brazil, 2005. Google ScholarDigital Library
- W. Xi, J. Lind, and E. Brill. Learning effective ranking functions for newsgroup search. In Proc. of the 27th annual international ACM SIGIR conference on Research and development in information retrieval pages 394--401, Sheffield, United Kingdom, 2004. Google ScholarDigital Library
- G. R. Xue, Q. Yang, H. J. Zeng, Y. Yu, and Z. Chen. Exploiting the hierarchical structure for link analysis. In Proc. of the 28th annual international ACM SIGIR conference on Research and development in information retrieval Salvador, Brazil, August 2005. Google ScholarDigital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proc. of the 15th International Conference on Machine Learning pages 412--420, 1997. Google ScholarDigital Library
Index Terms
- Building implicit links from content for forum search
Recommendations
Web-Based Links and Authoritative Content Pagerank Improvement
ICEE '10: Proceedings of the 2010 International Conference on E-Business and E-GovernmentBy studying the traditional analysis algorithm of PageRank found some deficiencies. Combination of link analysis and web pages relevant to the content analysis of an improved PageRank algorithm for EPR (Extended PageRank),starting from the web content ...
Content and link-structure perspective of ranking webpages: A review
AbstractThe delivery of ranked relevant results is probably the most important factor in making a web search engine acceptable to its users. This inspiration has led the search engine engineers and researchers to conceive ranking algorithms ...
Mining query subtopics from search log data
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalMost queries in web search are ambiguous and multifaceted. Identifying the major senses and facets of queries from search log data, referred to as query subtopic mining in this paper, is a very important issue in web search. Through search log analysis, ...
Comments