skip to main content
10.1145/1498759.1498827acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Predicting the readability of short web summaries

Published: 09 February 2009 Publication History

Abstract

Readability is a crucial presentation attribute that web summarization algorithms consider while generating a querybaised web summary. Readability quality also forms an important component in real-time monitoring of commercial search-engine results since readability of web summaries impacts clickthrough behavior, as shown in recent studies, and thus impacts user satisfaction and advertising revenue.
The standard approach to computing the readability is to first collect a corpus of random queries and their corresponding search result summaries, and then each summary is then judged by a human for its readabilty quality. An average readability score is then reported. This process is time consuming and expensive. Besides, the manual evaluation process can not be used in the real-time summary generation process. In this paper we propose a machine learning approach to the problem. We use the corpus as described above and extract summary features that we think may characterize readability. We then estimate a model (gradient boosted decision tree) that predicts human judgments given the features. This model can then be used in real time to estimate the readability of new (unseen) web search summaries and also be used in the summary generation process.
We present results on approximately 5000 editorial judgments collected over the course of a year and show examples where the model predicts the quality well and where it disagrees with human judgments. We compare the results of the model to previous models of readability, most notably Collins-Thompson-Callan, Fog and Flesch-Kincaid, and see that our model shows substantially better correlation with editorial judgments as measured by Pearson's correlation coefficient. The learning algorithm also provides us with the relative importance of the features used.

References

[1]
The R project for statistical computing. http://r-project.org.
[2]
E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proc. of WSDM, 2008.
[3]
A. Aula. Enhancing the readability of search result summaries. In Proc. of HCI, 2004.
[4]
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proc. 22nd Proc. Intl. Conference on Machine Learning, pages 89--96, 2005.
[5]
J. Burstein, K. Kukich, S. Wolff, C. Lu, M. Chodorow, L. Braden-Harder, and M. D. Harris. Automated scoring using a hybrid feature identification technique. In Proc. of the 17th Intl. Conference on Computational Linguistics, 1998.
[6]
C. L. A. Clarke, E. Agichtein, S. Dumais, and R. W. White. The influence of caption features on clickthrough patterns in web search. In Proc. 30th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 135--142, 2007.
[7]
K. Collins-Thompson and J. Callan. A language modeling approach to predicting reading difficulty. In Proceedings of HLT/NAACL, 2004.
[8]
J. H. Friedman. Greedy function approximation: A graidient boosting machine. Annals of Statistics, 29:1189--1232, 2001. http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf.
[9]
J. H. Friedman. Stochastic gradient boosting. Computational Statistics and Data Analysis, 38:367--378, 2001. http://www-stat.stanford.edu/~jhf/ftp/stobst.pdf.
[10]
R. Gunning. The technique of clear writing. McGraw-Hill, 1952.
[11]
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Sringer-Verlag, New York, NY, 2001.
[12]
A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recogntion: A review. IEEE Transactions on Pattern Analysis and Machine Learning, 22:4--37, 2000.
[13]
J. Jeon, B. W. Croft, J. H. Lee, and S. Park. A framework to predict the quality of anwers with non-textual features. In Proc. of 29thAnn. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 228--235, 2006.
[14]
T. Joachims. Optimizing search engines using clickthrough data. In Proc. 8th Ann. Intl. ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, pages 133--142, 2002.
[15]
M. D. Kickmeier and D. Albert. The effects of scanability on information search: An online experiment. In Proc. of HCI, 2003.
[16]
J. P. Kincaid, R. P. Fishburn, R. L. Rogers, and B. S. Chissom. Derivation of new redability formulas for navy enlisted personnel. Technical report, Milington, Tenn, Naval Air Station, 1975. Tech Report Research Branch Report 8-75.
[17]
G. Legge. Psychophysics of Reading in Normal and Low Vision. Lawrence Erlbaum Associates, 2006.
[18]
P. Li, C. J. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In Proc. 21st Proc. of Advances in Neural Information Processing Systems, 2007.
[19]
S. F. Liang, S. Delvin, and J. Tait. Evaluating web search result summaries. In European Conference in IR Research, pages 96--106, 2006.
[20]
G. H. McLaughlin. SMOG grading: A new readability formula. Journal of Reading, 12:639--646, 1969.
[21]
R. Nallapati. Discriminative models for information retrieval. In Proc. 27th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 64--71, 2004.
[22]
H. Obendorf and H. Weinreich. Comparing link marker visualization techniques: Changes in reading behavior. In Proc. of 12thIntl. Conference on the World Wide Web, pages 736--745, 2003.
[23]
D. R. Radev and W. Fan. Automatic summarization of search engine hit lists. In Proc. of ACL, 2000.
[24]
K. Rayner. Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124:372--422, 1998.
[25]
G. Ridgeway. Generalized boosted models: A guide to the gbm package. http://i-pensieri.com/gregr/papers/gbm-vignette.pdf.
[26]
G. Ridgeway. The state of boosting. Computing Science and Statistics, 31:172--181, 1999. http://www.i-pensieri.com/gregr/papers/interface99.pdf.
[27]
D. E. Rose, D. M. Orr, and R. G. P. Kantamneni. Summary atributes and perceived search quality. In Proc. of Intl. Conference on the World Wide Web, 2007.
[28]
K. Ryan. Fathom. http://search.cpan.org/dist/Lingua-EN-Fathom.
[29]
L. Si and J. Callan. A statistical model for scientific readability. In Proc. of the 10th Intl. Conference on Information and Knowledge Management, 2001.
[30]
F. Song and W. B. Croft. A general language model for information retrieval. In Proc. 8th Intl. Conf. on Information and Knowledge Management, pages 316--321, 1999.
[31]
W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Sringer-Verlag, New York, NY, 2002.
[32]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. 24th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 334--342, 2001.
[33]
Z. Zheng, H. Zha, T. Zhang, O. Chapelle, K. Chen, and G. Sun. A general boosting method and its application to learning ranking functions for web search. In Proc. 21st Proc. of Advances in Neural Information Processing Systems, 2007.

Cited By

View all
  • (2025)Enhancing Snippet Visualizations to Improve Web SearchInternational Journal of Human–Computer Interaction10.1080/10447318.2024.2443267(1-20)Online publication date: 15-Jan-2025
  • (2024)Improving Web Readability Using Video Content: A Relevance-Based ApproachApplied Sciences10.3390/app14231105514:23(11055)Online publication date: 27-Nov-2024
  • (2024)A topic relevance-aware click model for web searchJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23689446:4(8961-8974)Online publication date: 18-Apr-2024
  • Show More Cited By

Index Terms

  1. Predicting the readability of short web summaries

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
    February 2009
    314 pages
    ISBN:9781605583907
    DOI:10.1145/1498759
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 February 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. gradient boosted decision trees
    2. readability
    3. realtime
    4. summarization

    Qualifiers

    • Research-article

    Conference

    WSDM'09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Enhancing Snippet Visualizations to Improve Web SearchInternational Journal of Human–Computer Interaction10.1080/10447318.2024.2443267(1-20)Online publication date: 15-Jan-2025
    • (2024)Improving Web Readability Using Video Content: A Relevance-Based ApproachApplied Sciences10.3390/app14231105514:23(11055)Online publication date: 27-Nov-2024
    • (2024)A topic relevance-aware click model for web searchJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23689446:4(8961-8974)Online publication date: 18-Apr-2024
    • (2024)How Readability Cues Affect Children's Navigation of Search Engine Result PagesProceedings of the 23rd Annual ACM Interaction Design and Children Conference10.1145/3628516.3655818(62-69)Online publication date: 17-Jun-2024
    • (2024)The Efficacy Potential of Cyber Security Advice as Presented in News ArticlesInteracting with Computers10.1093/iwc/iwae048Online publication date: 10-Oct-2024
    • (2023)Readability Classification with Wikipedia Data and All-MiniLM EmbeddingsArtificial Intelligence Applications and Innovations. AIAI 2023 IFIP WG 12.5 International Workshops10.1007/978-3-031-34171-7_30(369-380)Online publication date: 2-Jun-2023
    • (2021)Teens’ Conceptual Understanding of Web Search Engines: The Case of Google Search Engine Result Pages (SERPs)Human-Computer Interaction. Design and User Experience Case Studies10.1007/978-3-030-78468-3_18(253-270)Online publication date: 3-Jul-2021
    • (2020)Extractive Snippet Generation for ArgumentsProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401186(1969-1972)Online publication date: 25-Jul-2020
    • (2019)Abstractive Sentence Compression with Event AttentionApplied Sciences10.3390/app91939499:19(3949)Online publication date: 20-Sep-2019
    • (2019)Readability of web content2019 14th Iberian Conference on Information Systems and Technologies (CISTI)10.23919/CISTI.2019.8760889(1-4)Online publication date: Jun-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media