skip to main content
10.1145/1645953.1646002acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Learning document aboutness from implicit user feedback and document structure

Published:02 November 2009Publication History

ABSTRACT

Capturing the "aboutness" of documents has been a key research focus throughout the history of automated textual information processing. In this work, we represent aboutness using words and phrases that best reflect the central topics of a document. We present a machine learning approach that learns to score and rank words and phrases in a document according to their relevance to the document. We use implicit user feedback available in search engine click logs to characterize the user-perceived notion of term relevance. Using a small set of manually generated training data, we show that the surrogate training data from click logs correlates well with this data, thus eliminating the need to create data for training manually which is both expensive and fundamentally difficult to obtain for such a task. Further, we use a diverse set of features in our learning model that capitalize heavily on the structural and visual properties of web documents. In our extensive experimentation, we pay particular attention to tail web pages and show that our approach trained on mainly head web pages generalizes and performs well on all kinds of documents. In several evaluation methods using manually generated summaries and term relevance judgments, our system shows 25% improvement over other aboutness solutions.

References

  1. E. Agichtein, E. Brill, and S. T. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. Alani and C. Brewster. Ontology ranking based on the analysis of concept structures. In K-CAP '05: Proceedings of the 3rd international conference on Knowledge capture, 2005. language processing, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. H. Alani, C. Brewster, and N. Shadbolt. Ranking ontologies with aktiverank. In International Semantic Web Conference, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. G. Anick. Using terminological feedback for web search refinement: a log-based study. In SIGIR, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. G. Anick and S. Tipirneni. Interactive document retrieval using faceted terminological feedback. In HICSS, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. D. Bruza and T. W. C. Huibers. Investigating aboutness axioms using information fields. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. D. Bruza and T. W. C. Huibers. A study of aboutness in information retrieval. Artificial Intel ligence Review, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. D. Bruza, D. W. Song, and K. F. Wong. Aboutness from a commonsense perspective. J. Am. Soc. Inf. Sci., pages 1090--1105, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Cai, S. Yu, J. rong Wen, and W. ying Ma. Extracting content structure for web pages based on visual representation. In Proc.5 th Asia Pacific Web Conference, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Chakrabarti, R. Kumar, and K. Punera. Generating succinct titles for web urls. In KDD '08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Chakrabarti, R. Kumar, and K. Punera. A graph-theoretic approach to webpage segmentation. In WWW '08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Cheng, X. Yan, and K. C. Chang. Entityrank: searching entities directly and holistically. In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In WSDM '08: Proceedings of the international conference on Web search and web data mining, 2008. In SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Eichmann. Extraction of document structure for genomics documents. In Fifteenth Conference on Text Retrieval, 2006.Google ScholarGoogle Scholar
  15. J. Friedman. Stochastic gradient boosting. Technical report, Stanford University, 1999.Google ScholarGoogle Scholar
  16. J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  17. B. Hjorland. Towards a theory of aboutness, subject, topicality, theme, domain, field, content... and relevance. J. Am. Soc. Inf. Sci., 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. J. HUTCHINS. On the problem of "aboutness" in document analysis. In Journal of Informatics, 1977.Google ScholarGoogle Scholar
  19. U. Irmak, V. von Brzeski, and R. Kraft. Contextual ranking of keywords using click data. Data Engineering, International Conference on, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Joachims. Optimizing search engines using clickthrough data. In KDD '02, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Jones and H. Alani. Content-based ontology ranking. In Proceedings of the 9th International Protege Conference, 2006.Google ScholarGoogle Scholar
  22. Y. Lin and E. Hovy. Identifying topics by position. In Proceedings of the fifth conference on Applied natural language processing, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Lin and E. Hovy. The automated acquisition of topic signatures for text summarization. In Proc. Of the COLING Conference, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Radlinski, A. Z. Broder, P. Ciccolo, E. Gabrilovich, V. Josifovski, and L. Riedel. Optimizing relevance and revenue in ad search: a query substitution approach. In SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Rode, P. Serdyukov, and D. Hiemstra. Combining document- and paragraph-based entity ranking. In SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Sebastiani and C. N. D. Ricerche. Machine learning in automated text categorization. ACM Computing Surveys, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. tao Sun, Q. Yang, and Y. Lu. Web-page summarization using clickthrough data. In In SIGIR 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Tau-Wih, J. Goodman, and V. R. Carvalho. Finding advertising keywords on web pages. In Proceedings of the World Wide Web Conference 2006, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. Wu, J. Li, T. Li, and K. Wang. Understanding an ontology in rdfs by ranking its concepts and relations. In 6th International and 2nd Asian Semantic Web Conference, 2007.Google ScholarGoogle Scholar
  31. X. Wu and A. Bolivar. Keyword extraction for contextual advertisement. In WWW '08: Proceeding of the 17th international conference on World Wide Web, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H. Zha. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In SIGIR '02, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. Zhang, Y. Pan, and T. Zhang. Focused named entity recognition using machine learning. In SIGIR, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning document aboutness from implicit user feedback and document structure

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
          November 2009
          2162 pages
          ISBN:9781605585123
          DOI:10.1145/1645953

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 2 November 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader