research-article

Pagination: it's what you say, not how long it takes to say it

Authors:
Joshua Hailpern

HP Labs, Palo Alto, CA, USA

HP Labs, Palo Alto, CA, USA
View Profile

,
Niranjan Damera Venkata

HP Labs, Chennai, Tamil Nadu, India

HP Labs, Chennai, Tamil Nadu, India
View Profile

,
Marina Danilevsky

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

DocEng '14: Proceedings of the 2014 ACM symposium on Document engineeringSeptember 2014Pages 147–156https://doi.org/10.1145/2644866.2644867

Published:16 September 2014Publication History

DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering

Pages 147–156

ABSTRACT

Pagination the process of determining where to break an article across pages in a multi-article layout is a common layout challenge for most commercially printed newspapers and magazines. To date, no one has created an algorithm that determines a minimal pagination break point based on the content of the article. Existing approaches for automatic multi-article layout focus exclusively on maximizing content (number of articles) and optimizing aesthetic presentation (e.g., spacing between articles). However, disregarding the semantic information within the article can lead to overly aggressive cutting, thereby eliminating key content and potentially confusing the reader, or setting too generous of a break point, thereby leaving in superfluous content and making automatic layout more difficult. This is one of the remaining challenges on the path from manual layouts to fully automated processes that still ensure article content quality. In this work, we present a new approach to calculating a document minimal break point for the task of pagination. Our approach uses a statistical language model to predict minimal break points based on the semantic content of an article. We then compare 4 novel candidate approaches, and 4 baselines (currently in use by layout algorithms). Results from this experiment show that one of our approaches strongly outperforms the baselines and alternatives. Results from a second study suggest that humans are not able to agree on a single "best" break point. Therefore, this work shows that a semantic-based lower bound break point prediction is necessary for ideal automated document synthesis within a real-world context.

References

I. Ahmadullin and N. Damera-Venkata. Hierarchical probabilistic model for news composition. In DocEng, page 141, New York, New York, USA, Sept. 2013. ACM Request Permissions. Google ScholarDigital Library
G. J. Badros, A. Borning, and P. J. Stuckey. The Cassowary linear arithmetic constraint solving algorithm. TOCHI, 8(4 (Dec)):267--306, Dec. 2001. Google ScholarDigital Library
D. Beeferman, A. Berger, and J. Lafferty. Statistical Models for Text Segmentation. Machine learning, 34(1-3):177--210, 1999. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, Mar. 2003. Google ScholarDigital Library
A. Brüggemann-Klein, R. Klein, and S. Wohlfeil. On the pagination of complex documents. Lecture Notes in Computer Science, 2598:49--68, 2003. Google ScholarDigital Library
F. Chua and S. Asur. Automatic Summarization of Events From Social Media. In ICWSM, 2013.Google Scholar
P. Ciancarini, A. Di Iorio, L. Furini, and F. Vitali. High-quality pagination for publishing. Software|Practice & Experience, 42(6), June 2012. Google ScholarDigital Library
N. Damera-Venkata, J. Bento, and E. O'Brien-Strain. Probabilistic document model for automated document composition. In DocEng, page 3, New York, New York, USA, Sept. 2011. ACM Request Permissions. Google ScholarDigital Library
H. P. Edmundson. New Methods in Automatic Extracting. Journal of the ACM (JACM, 16(2), Apr. 1969. Google ScholarDigital Library
G. Erkan and D. R. Radev. LexRank: Graph-based lexical centrality as salience in text summarization. J Artif Intell Res(JAIR), 2004. Google ScholarDigital Library
M. Fiszman, T. C. Rindflesch, and H. Kilicoglu. Abstraction summarization for managing the biomedical research literature. pages 76--83, May 2004. Google ScholarDigital Library
F. Giannetti. An exploratory mapping strategy for web-driven magazines. In Proceeding of the eighth ACM symposium, pages 223--229, New York, New York, USA, 2008. ACM Press. Google ScholarDigital Library
A. Haghighi and L. Vanderwende. Exploring content models for multi-document summarization. pages 362--370, May 2009. Google ScholarDigital Library
J. Hailpern, N. Damera Venkata, and M. Danilevsky. Truncation: All the News that Fits We'll Print. In DocENG. ACM, 2014. Google ScholarDigital Library
J. Hailpern and B. A. Huberman. Echo: the editor's wisdom with the elegance of a magazine. In EICS. ACM Request Permissions, June 2013. Google ScholarDigital Library
M. A. Hearst. TextTiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), Mar. 1997. Google ScholarDigital Library
N. Hurst, W. Li, and K. Marriott. Review of automatic document formatting. In DocEng, page 99, New York, New York, USA, Sept. 2009. ACM Request Permissions. Google ScholarDigital Library
C. Jacobs, W. Li, E. Schrier, D. Bargeron, and D. Salesin. Adaptive grid-based document layout. SIGGRAPH, 22(3):838--847, July 2003. Google ScholarDigital Library
N. Jamil, J. Mueller, C. Lutteroth, and G. Weber. Extending Linear Relaxation for User Interface Layout. In ICTAI. IEEE Computer Society, Nov. 2012. Google ScholarDigital Library
I. Kastner and C. Monz. Automatic single-document key fact extraction from newswire articles. In EACL. Association for Computational Linguistics, Mar. 2009. Google ScholarDigital Library
R. Katragadda, P. Pingali, and V. Varma. Sentence position revisited: a robust light-weight update summarization baseline algorithm. pages 46--52, June 2009. Google ScholarDigital Library
C.-Y. Lin and E. Hovy. Identifying topics by position. In ANCL. Association for Computational Linguistics, Mar. 1997.Google ScholarDigital Library
C. Lutteroth, R. Strandh, and G. Weber. Domain Specific High-Level Constraints for User Interface Layout. Constraints, 13(3), Sept. 2008. Google ScholarDigital Library
R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In EMNLP, 2004.Google Scholar
A. Nenkova. Automatic text summarization of newswire: lessons learned from the document understanding conference. In AAAI. AAAI Press, July 2005. Google ScholarDigital Library
A. Nenkova and L. Vanderwende. The impact of frequency on summarization. Technical Report MSR-TR-2005-101, Microsoft Research, 2005.Google Scholar
L. Pevzner and M. A. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 2002. Google ScholarDigital Library
D. A. Savitz and A. F. Olshan. Multiple comparisons and related issues in the interpretation of epidemiologic data. 1995.Google Scholar
A. Scoditti and W. Stuerzlinger. A new layout method for graphical user interfaces. In TIC-STH, pages 642--647. IEEE, 2009.Google ScholarCross Ref
Y. Seki, K. Eguchi, and N. Kando. Compact Summarization for Mobile Phones. Mobile and Ubiquitous Information Access, 2954 (Chapter: 13) 172--186, 2004.Google Scholar
J. Seo and W. B. Croft. Unsupervised estimation of dirichlet smoothing parameters. In SIGIR '10, pages 759--760, New York, New York, USA, 2010. ACM Press. Google ScholarDigital Library
T. Weninger, W. H. Hsu, and J. Han. CETR: content extraction via tag ratios. WWW 2010, 2010. Google ScholarDigital Library
C. C. Yang and F. L. Wang. Automatic summarization of financial news delivery on mobile devices. In WWW'03, 2003.Google Scholar
C. C. Yang and F. L. Wang. Hierarchical summarization of large documents. J. of the American Society for Information Science and Technology, 59(6), Apr. 2008. Google ScholarDigital Library
C. Zeidler, J. Müller, C. Lutteroth, and G. Weber. Comparing the usability of grid-bag and constraint-based layouts. In OzCHI, pages 674--682, New York, New York, USA, Nov. 2012. ACM Request Permissions. Google ScholarDigital Library
C. Zhai. Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2009. Google ScholarDigital Library

Index Terms

Pagination: it's what you say, not how long it takes to say it
1. Applied computing
  1. Computers in other domains
    1. Personal computers and PC applications
    2. Publishing

Recommendations

A General Framework for Globally Optimized Pagination
DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering

Pagination problems deal with questions around transforming a source text stream into a formatted document by dividing it up into individual columns and pages, including adding auxiliary elements that have some relationship to the source stream data but ...
Read More
Truncation: all the news that fits we'll print
DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering

A news article generally contains a high-level overview of the facts early on, followed by paragraphs of more detailed information. This structure allows copy editors to truncate the latter paragraphs of an article in order to satisfy space limitations ...
Read More
High-quality pagination for publishing

The problem of line breaking consists of finding the best way to split paragraphs into lines. It has been cleverly addressed by the total-fit algorithm exposed by Knuth and Plass in a well-known paper. Similarly, page-breaking algorithms break the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering
September 2014
226 pages
ISBN:9781450329491
DOI:10.1145/2644866
General Chair:
Steven Simske
Hewlett-Packard, Fort Collins, USA
,
Program Chair:
Sebastian Rönnau
Zalando AG, Berlin, Germany
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 September 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cut
novelty
pagination
semantic
slm
truncation
Qualifiers
- research-article
Conference

Acceptance Rates
DocEng '14 Paper Acceptance Rate15of41submissions,37%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 101
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Pagination: it's what you say, not how long it takes to say it

DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

A General Framework for Globally Optimized Pagination

Truncation: all the news that fits we'll print

High-quality pagination for publishing