research-article

Predicting the readability of short web summaries

Authors:

David OrrAuthors Info & Claims

WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining

Pages 202 - 211

https://doi.org/10.1145/1498759.1498827

Published: 09 February 2009 Publication History

Abstract

Readability is a crucial presentation attribute that web summarization algorithms consider while generating a querybaised web summary. Readability quality also forms an important component in real-time monitoring of commercial search-engine results since readability of web summaries impacts clickthrough behavior, as shown in recent studies, and thus impacts user satisfaction and advertising revenue.

The standard approach to computing the readability is to first collect a corpus of random queries and their corresponding search result summaries, and then each summary is then judged by a human for its readabilty quality. An average readability score is then reported. This process is time consuming and expensive. Besides, the manual evaluation process can not be used in the real-time summary generation process. In this paper we propose a machine learning approach to the problem. We use the corpus as described above and extract summary features that we think may characterize readability. We then estimate a model (gradient boosted decision tree) that predicts human judgments given the features. This model can then be used in real time to estimate the readability of new (unseen) web search summaries and also be used in the summary generation process.

We present results on approximately 5000 editorial judgments collected over the course of a year and show examples where the model predicts the quality well and where it disagrees with human judgments. We compare the results of the model to previous models of readability, most notably Collins-Thompson-Callan, Fog and Flesch-Kincaid, and see that our model shows substantially better correlation with editorial judgments as measured by Pearson's correlation coefficient. The learning algorithm also provides us with the relative importance of the features used.

References

[1]

The R project for statistical computing. http://r-project.org.

[2]

E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proc. of WSDM, 2008.

Digital Library

[3]

A. Aula. Enhancing the readability of search result summaries. In Proc. of HCI, 2004.

[4]

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proc. 22nd Proc. Intl. Conference on Machine Learning, pages 89--96, 2005.

Digital Library

[5]

J. Burstein, K. Kukich, S. Wolff, C. Lu, M. Chodorow, L. Braden-Harder, and M. D. Harris. Automated scoring using a hybrid feature identification technique. In Proc. of the 17th Intl. Conference on Computational Linguistics, 1998.

Digital Library

[6]

C. L. A. Clarke, E. Agichtein, S. Dumais, and R. W. White. The influence of caption features on clickthrough patterns in web search. In Proc. 30th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 135--142, 2007.

Digital Library

[7]

K. Collins-Thompson and J. Callan. A language modeling approach to predicting reading difficulty. In Proceedings of HLT/NAACL, 2004.

[8]

J. H. Friedman. Greedy function approximation: A graidient boosting machine. Annals of Statistics, 29:1189--1232, 2001. http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf.

[9]

J. H. Friedman. Stochastic gradient boosting. Computational Statistics and Data Analysis, 38:367--378, 2001. http://www-stat.stanford.edu/~jhf/ftp/stobst.pdf.

Digital Library

[10]

R. Gunning. The technique of clear writing. McGraw-Hill, 1952.

[11]

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Sringer-Verlag, New York, NY, 2001.

[12]

A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recogntion: A review. IEEE Transactions on Pattern Analysis and Machine Learning, 22:4--37, 2000.

Digital Library

[13]

J. Jeon, B. W. Croft, J. H. Lee, and S. Park. A framework to predict the quality of anwers with non-textual features. In Proc. of 29thAnn. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 228--235, 2006.

Digital Library

[14]

T. Joachims. Optimizing search engines using clickthrough data. In Proc. 8th Ann. Intl. ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, pages 133--142, 2002.

Digital Library

[15]

M. D. Kickmeier and D. Albert. The effects of scanability on information search: An online experiment. In Proc. of HCI, 2003.

[16]

J. P. Kincaid, R. P. Fishburn, R. L. Rogers, and B. S. Chissom. Derivation of new redability formulas for navy enlisted personnel. Technical report, Milington, Tenn, Naval Air Station, 1975. Tech Report Research Branch Report 8-75.

[17]

G. Legge. Psychophysics of Reading in Normal and Low Vision. Lawrence Erlbaum Associates, 2006.

[18]

P. Li, C. J. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In Proc. 21st Proc. of Advances in Neural Information Processing Systems, 2007.

[19]

S. F. Liang, S. Delvin, and J. Tait. Evaluating web search result summaries. In European Conference in IR Research, pages 96--106, 2006.

Digital Library

[20]

G. H. McLaughlin. SMOG grading: A new readability formula. Journal of Reading, 12:639--646, 1969.

[21]

R. Nallapati. Discriminative models for information retrieval. In Proc. 27th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 64--71, 2004.

Digital Library

[22]

H. Obendorf and H. Weinreich. Comparing link marker visualization techniques: Changes in reading behavior. In Proc. of 12thIntl. Conference on the World Wide Web, pages 736--745, 2003.

Digital Library

[23]

D. R. Radev and W. Fan. Automatic summarization of search engine hit lists. In Proc. of ACL, 2000.

Digital Library

[24]

K. Rayner. Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124:372--422, 1998.

[25]

G. Ridgeway. Generalized boosted models: A guide to the gbm package. http://i-pensieri.com/gregr/papers/gbm-vignette.pdf.

[26]

G. Ridgeway. The state of boosting. Computing Science and Statistics, 31:172--181, 1999. http://www.i-pensieri.com/gregr/papers/interface99.pdf.

[27]

D. E. Rose, D. M. Orr, and R. G. P. Kantamneni. Summary atributes and perceived search quality. In Proc. of Intl. Conference on the World Wide Web, 2007.

Digital Library

[28]

K. Ryan. Fathom. http://search.cpan.org/dist/Lingua-EN-Fathom.

[29]

L. Si and J. Callan. A statistical model for scientific readability. In Proc. of the 10th Intl. Conference on Information and Knowledge Management, 2001.

Digital Library

[30]

F. Song and W. B. Croft. A general language model for information retrieval. In Proc. 8th Intl. Conf. on Information and Knowledge Management, pages 316--321, 1999.

Digital Library

[31]

W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Sringer-Verlag, New York, NY, 2002.

[32]

C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. 24th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 334--342, 2001.

Digital Library

[33]

Z. Zheng, H. Zha, T. Zhang, O. Chapelle, K. Chen, and G. Sun. A general boosting method and its application to learning ranking functions for web search. In Proc. 21st Proc. of Advances in Neural Information Processing Systems, 2007.

Cited By

Taieb-Maimon M(2025)Enhancing Snippet Visualizations to Improve Web SearchInternational Journal of Human–Computer Interaction10.1080/10447318.2024.2443267(1-20)Online publication date: 15-Jan-2025
https://doi.org/10.1080/10447318.2024.2443267
Elahi EMorato JIglesias A(2024)Improving Web Readability Using Video Content: A Relevance-Based ApproachApplied Sciences10.3390/app14231105514:23(11055)Online publication date: 27-Nov-2024
https://doi.org/10.3390/app142311055
Jianping LYingfei WJian WMeng WXintao C(2024)A topic relevance-aware click model for web searchJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23689446:4(8961-8974)Online publication date: 18-Apr-2024
https://doi.org/10.3233/JIFS-236894
Show More Cited By

Index Terms

Predicting the readability of short web summaries
1. Information systems
  1. Information retrieval

Recommendations

Design Guidelines for Web Readability
DIS '17: Proceedings of the 2017 Conference on Designing Interactive Systems

Reading is fundamental to interactive-system use, but around 800 million of people might struggle with it due to literacy difficulties. Few websites are designed for high readability, as readability remains an underinvestigated facet of User Experience. ...
The Effect of Font Type on Screen Readability by People with Dyslexia

Around 10% of the people have dyslexia, a neurological disability that impairs a person’s ability to read and write. There is evidence that the presentation of the text has a significant effect on a text’s accessibility for people with dyslexia. However,...
Enhancing readability of web documents by text augmentation for deaf people
WIMS '13: Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics

Deaf people have particular difficulty in understanding text-based web documents because their mother language, or sign language, is essentially visually oriented. To enhance the readability of text-based web documents for deaf people, we propose a news ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining

February 2009

314 pages

ISBN:9781605583907

DOI:10.1145/1498759

Editors:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Paolo Boldi
Universita degli Studi di Milano, Italy
,
Berthier Ribeiro-Neto
Google Engineering, Brazil & CS Dept., Univ. Fed. de Minas Gerais, Brazil
,
B. Barla Cambazoglu
Yahoo! Research

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Yahoo! Research
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
Nokia
Google Inc.
SIGIR: ACM Special Interest Group on Information Retrieval
Microsoft: Microsoft

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM'09

Sponsor:

WSDM'09: Second ACM International Conference on Web Search and Web Data Mining

February 9 - 12, 2009

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
810
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)4

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Taieb-Maimon M(2025)Enhancing Snippet Visualizations to Improve Web SearchInternational Journal of Human–Computer Interaction10.1080/10447318.2024.2443267(1-20)Online publication date: 15-Jan-2025
https://doi.org/10.1080/10447318.2024.2443267
Elahi EMorato JIglesias A(2024)Improving Web Readability Using Video Content: A Relevance-Based ApproachApplied Sciences10.3390/app14231105514:23(11055)Online publication date: 27-Nov-2024
https://doi.org/10.3390/app142311055
Jianping LYingfei WJian WMeng WXintao C(2024)A topic relevance-aware click model for web searchJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23689446:4(8961-8974)Online publication date: 18-Apr-2024
https://doi.org/10.3233/JIFS-236894
Pinney CBettencourt BFails JKennington CWright KPera M(2024)How Readability Cues Affect Children's Navigation of Search Engine Result PagesProceedings of the 23rd Annual ACM Interaction Design and Children Conference10.1145/3628516.3655818(62-69)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3628516.3655818
Quinlan MCeross ASimpson A(2024)The Efficacy Potential of Cyber Security Advice as Presented in News ArticlesInteracting with Computers10.1093/iwc/iwae048Online publication date: 10-Oct-2024
https://doi.org/10.1093/iwc/iwae048
Vergou EPagouni INanos MKermanidis K(2023)Readability Classification with Wikipedia Data and All-MiniLM EmbeddingsArtificial Intelligence Applications and Innovations. AIAI 2023 IFIP WG 12.5 International Workshops10.1007/978-3-031-34171-7_30(369-380)Online publication date: 2-Jun-2023
https://doi.org/10.1007/978-3-031-34171-7_30
Bilal DZhang Y(2021)Teens’ Conceptual Understanding of Web Search Engines: The Case of Google Search Engine Result Pages (SERPs)Human-Computer Interaction. Design and User Experience Case Studies10.1007/978-3-030-78468-3_18(253-270)Online publication date: 3-Jul-2021
https://doi.org/10.1007/978-3-030-78468-3_18
Alshomary MDüsterhus NWachsmuth HHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Extractive Snippet Generation for ArgumentsProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401186(1969-1972)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401186
Choi SJung IPark SPark S(2019)Abstractive Sentence Compression with Event AttentionApplied Sciences10.3390/app91939499:19(3949)Online publication date: 20-Sep-2019
https://doi.org/10.3390/app9193949
Antunes HLopes C(2019)Readability of web content2019 14th Iberian Conference on Information Systems and Technologies (CISTI)10.23919/CISTI.2019.8760889(1-4)Online publication date: Jun-2019
https://doi.org/10.23919/CISTI.2019.8760889
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten