ABSTRACT
Capturing the "aboutness" of documents has been a key research focus throughout the history of automated textual information processing. In this work, we represent aboutness using words and phrases that best reflect the central topics of a document. We present a machine learning approach that learns to score and rank words and phrases in a document according to their relevance to the document. We use implicit user feedback available in search engine click logs to characterize the user-perceived notion of term relevance. Using a small set of manually generated training data, we show that the surrogate training data from click logs correlates well with this data, thus eliminating the need to create data for training manually which is both expensive and fundamentally difficult to obtain for such a task. Further, we use a diverse set of features in our learning model that capitalize heavily on the structural and visual properties of web documents. In our extensive experimentation, we pay particular attention to tail web pages and show that our approach trained on mainly head web pages generalizes and performs well on all kinds of documents. In several evaluation methods using manually generated summaries and term relevance judgments, our system shows 25% improvement over other aboutness solutions.
- E. Agichtein, E. Brill, and S. T. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR, 2006. Google ScholarDigital Library
- H. Alani and C. Brewster. Ontology ranking based on the analysis of concept structures. In K-CAP '05: Proceedings of the 3rd international conference on Knowledge capture, 2005. language processing, 1997. Google ScholarDigital Library
- H. Alani, C. Brewster, and N. Shadbolt. Ranking ontologies with aktiverank. In International Semantic Web Conference, 2006. Google ScholarDigital Library
- P. G. Anick. Using terminological feedback for web search refinement: a log-based study. In SIGIR, 2003. Google ScholarDigital Library
- P. G. Anick and S. Tipirneni. Interactive document retrieval using faceted terminological feedback. In HICSS, 1999. Google ScholarDigital Library
- P. D. Bruza and T. W. C. Huibers. Investigating aboutness axioms using information fields. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 1994. Google ScholarDigital Library
- P. D. Bruza and T. W. C. Huibers. A study of aboutness in information retrieval. Artificial Intel ligence Review, 1996. Google ScholarDigital Library
- P. D. Bruza, D. W. Song, and K. F. Wong. Aboutness from a commonsense perspective. J. Am. Soc. Inf. Sci., pages 1090--1105, 2000. Google ScholarDigital Library
- D. Cai, S. Yu, J. rong Wen, and W. ying Ma. Extracting content structure for web pages based on visual representation. In Proc.5 th Asia Pacific Web Conference, 2003. Google ScholarDigital Library
- D. Chakrabarti, R. Kumar, and K. Punera. Generating succinct titles for web urls. In KDD '08, 2008. Google ScholarDigital Library
- D. Chakrabarti, R. Kumar, and K. Punera. A graph-theoretic approach to webpage segmentation. In WWW '08, 2008. Google ScholarDigital Library
- T. Cheng, X. Yan, and K. C. Chang. Entityrank: searching entities directly and holistically. In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, 2007. Google ScholarDigital Library
- N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In WSDM '08: Proceedings of the international conference on Web search and web data mining, 2008. In SIGIR, 2008. Google ScholarDigital Library
- D. Eichmann. Extraction of document structure for genomics documents. In Fifteenth Conference on Text Retrieval, 2006.Google Scholar
- J. Friedman. Stochastic gradient boosting. Technical report, Stanford University, 1999.Google Scholar
- J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2001.Google ScholarCross Ref
- B. Hjorland. Towards a theory of aboutness, subject, topicality, theme, domain, field, content... and relevance. J. Am. Soc. Inf. Sci., 2001. Google ScholarDigital Library
- W. J. HUTCHINS. On the problem of "aboutness" in document analysis. In Journal of Informatics, 1977.Google Scholar
- U. Irmak, V. von Brzeski, and R. Kraft. Contextual ranking of keywords using click data. Data Engineering, International Conference on, 2009. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. In KDD '02, 2002. Google ScholarDigital Library
- M. Jones and H. Alani. Content-based ontology ranking. In Proceedings of the 9th International Protege Conference, 2006.Google Scholar
- Y. Lin and E. Hovy. Identifying topics by position. In Proceedings of the fifth conference on Applied natural language processing, 1997. Google ScholarDigital Library
- Y. Lin and E. Hovy. The automated acquisition of topic signatures for text summarization. In Proc. Of the COLING Conference, 2000. Google ScholarDigital Library
- F. Radlinski, A. Z. Broder, P. Ciccolo, E. Gabrilovich, V. Josifovski, and L. Riedel. Optimizing relevance and revenue in ad search: a query substitution approach. In SIGIR, 2008. Google ScholarDigital Library
- H. Rode, P. Serdyukov, and D. Hiemstra. Combining document- and paragraph-based entity ranking. In SIGIR, 2008. Google ScholarDigital Library
- G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., 1986. Google ScholarDigital Library
- F. Sebastiani and C. N. D. Ricerche. Machine learning in automated text categorization. ACM Computing Surveys, 2002. Google ScholarDigital Library
- J. tao Sun, Q. Yang, and Y. Lu. Web-page summarization using clickthrough data. In In SIGIR 2005. Google ScholarDigital Library
- W. Tau-Wih, J. Goodman, and V. R. Carvalho. Finding advertising keywords on web pages. In Proceedings of the World Wide Web Conference 2006, 2006. Google ScholarDigital Library
- G. Wu, J. Li, T. Li, and K. Wang. Understanding an ontology in rdfs by ranking its concepts and relations. In 6th International and 2nd Asian Semantic Web Conference, 2007.Google Scholar
- X. Wu and A. Bolivar. Keyword extraction for contextual advertisement. In WWW '08: Proceeding of the 17th international conference on World Wide Web, 2008. Google ScholarDigital Library
- H. Zha. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In SIGIR '02, 2002. Google ScholarDigital Library
- L. Zhang, Y. Pan, and T. Zhang. Focused named entity recognition using machine learning. In SIGIR, 2004. Google ScholarDigital Library
Index Terms
Learning document aboutness from implicit user feedback and document structure
Recommendations
Click data as implicit relevance feedback in web search
Search sessions consist of a person presenting a query to a search engine, followed by that person examining the search results, selecting some of those search results for further review, possibly following some series of hyperlinks, and perhaps ...
Social network document ranking
JCDL '10: Proceedings of the 10th annual joint conference on Digital librariesIn search engines, ranking algorithms measure the importance and relevance of documents mainly based on the contents and relationships between documents. User attributes are usually not considered in ranking. This user-neutral approach, however, may not ...
Deep Understanding of a Document's Structure
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and TechnologiesCurrent language understanding approaches focus on small documents, such as newswire articles, blog posts, product reviews and discussion forum discussions. Understanding and extracting information from large documents like legal briefs, proposals, ...
Comments