Work in Progress

SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings

Authors:

Chieh-Yang Huang,

Shih-Hong Huang,

Ting-Hao Kenneth HuangAuthors Info & Claims

CHI EA '24: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

Article No.: 284, Pages 1 - 9

https://doi.org/10.1145/3613905.3650738

Published: 11 May 2024 Publication History

Abstract

Crafting effective captions for figures is important. Readers heavily depend on these captions to grasp the figure’s message. However, despite a well-developed set of AI technologies for figures and captions, these have rarely been tested for usefulness in aiding caption writing. This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions to aid caption composition. SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality across multiple critical aspects, such as helpfulness, OCR mention, key takeaways, and visual properties reference. Users can directly edit captions in SciCapenter, resubmit for revised evaluations, and iteratively refine them. A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing. Participants’ feedback further offers valuable design insights for future systems aiming to enhance caption writing.

Supplemental Material

MP4 File

Talk Video

Transcript for: Talk Video

References

[1]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019).

[2]

John Bransford. 1979. Human cognition: Learning, understanding, and remembering. (No Title) (1979).

[3]

Yu-Ying Chang and John M Swales. 2014. Informal elements in English academic writing: threats or opportunities for advanced non-native speakers? In Writing: Texts, processes and practices. Routledge, 145–167.

[4]

Charles Chen, Ruiyi Zhang, Eunyee Koh, Sungchul Kim, Scott Cohen, and Ryan Rossi. 2020. Figure captioning with relation maps for reasoning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1537–1545.

[5]

Jiwon Choi and Jaemin Jo. 2022. Intentable: A mixed-initiative system for intent-based chart captioning. In 2022 IEEE Visualization and Visual Analytics (VIS). IEEE, 40–44.

[6]

Jinho Choi, Sanghun Jung, Deok Gun Park, Jaegul Choo, and Niklas Elmqvist. 2019. Visualizing for the non-visual: Enabling the visually impaired to use visualization. In Computer Graphics Forum, Vol. 38. Wiley Online Library, 249–260.

[7]

Christopher Clark and Santosh Divvala. 2016. Pdffigures 2.0: Mining figures from research papers. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 143–152.

Digital Library

[8]

Susan M Conrad. 1996. Investigating academic texts with corpus-based techniques: An example from biology. Linguistics and education 8, 3 (1996), 299–326.

[9]

Ignacio Garcia and María Isabel Pena. 2011. Machine translation-assisted language learning: writing for beginners. Computer Assisted Language Learning 24, 5 (2011), 471–487.

[10]

Yashesh Gaur, Walter S Lasecki, Florian Metze, and Jeffrey P Bigham. 2016. The effects of automatic speech recognition quality on human transcription latency. In Proceedings of the 13th International Web for All Conference. 1–8.

Digital Library

[11]

Li Gong, Josep Crego, and Jean Senellart. 2019. Enhanced Transformer Model for Data-to-Text Generation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Alexandra Birch, Andrew Finch, Hiroaki Hayashi, Ioannis Konstas, Thang Luong, Graham Neubig, Yusuke Oda, and Katsuhito Sudoh (Eds.). Association for Computational Linguistics, Hong Kong, 148–156. https://doi.org/10.18653/v1/D19-5615

[12]

Sandra G Hart. 2006. NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 50. Sage publications Sage CA: Los Angeles, CA, 904–908.

[13]

Mary Hegarty and Marcel-Adam Just. 1993. Constructing mental models of machines from text and diagrams. Journal of memory and language 32, 6 (1993), 717–742.

[14]

John Hinds, U Connor, and RB Kaplan. 1987. Reader versus writer responsibility: A new typology. Landmark essays on ESL writing (1987), 63–74.

[15]

Sameera Horawalavithana, Sai Munikoti, Ian Stewart, and Henry Kvinge. 2023. Scitune: Aligning large language models with scientific multimodal instructions. arXiv preprint arXiv:2307.01139 (2023).

[16]

Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. 2021. SciCap: Generating Captions for Scientific Figures. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 3258–3264. https://doi.org/10.18653/v1/2021.findings-emnlp.277

[17]

Ting-Yao Hsu, Yi-Li Hsu, Shaurya Rohatgi, Ryan Rossi, Sungchul Kim, Ani Nenkova, Lun-Wei Ku, Huijuan Xu, C. Giles, and Ting-Hao Huang. 2023. The 1st Scientific Figure Captioning (SciCap) Challenge. http://scicap.ai/.

[18]

Ting-Yao Hsu, Chieh-Yang Huang, Ryan Rossi, Sungchul Kim, C. Giles, and Ting-Hao Huang. 2023. GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 5464–5474. https://doi.org/10.18653/v1/2023.findings-emnlp.363

[19]

Chieh-Yang Huang, Ting-Yao Hsu, Ryan Rossi, Ani Nenkova, Sungchul Kim, Gromit Yeuk-Yin Chan, Eunyee Koh, Clyde Lee Giles, and Ting-Hao’Kenneth’ Huang. 2023. Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization. arXiv preprint arXiv:2302.12324 (2023).

[20]

KV Jobin, Ajoy Mondal, and CV Jawahar. 2019. Docfigure: A dataset for scientific document figure classification. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 1. IEEE, 74–79.

[21]

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. 2018. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5648–5656.

[22]

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. 2017. Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 (2017).

[23]

Shankar Kantharaj, Xuan Long Do, Rixie Tiffany Leong, Jia Qing Tan, Enamul Hoque, and Shafiq Joty. 2022. OpenCQA: Open-ended Question Answering with Charts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 11817–11837. https://doi.org/10.18653/v1/2022.emnlp-main.811

[24]

Shankar Kantharaj, Rixie Tiffany Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. 2022. Chart-to-Text: A Large-Scale Benchmark for Chart Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 4005–4023. https://doi.org/10.18653/v1/2022.acl-long.277

[25]

Zeba Karishma, Shaurya Rohatgi, Kavya Shrinivas Puranik, Jian Wu, and C Lee Giles. 2023. ACL-Fig: A Dataset for Scientific Figure Classification. arXiv preprint arXiv:2301.12293 (2023).

[26]

Dae Hyun Kim, Vidya Setlur, and Maneesh Agrawala. 2021. Towards Understanding How Readers Integrate Charts and Captions: A Case Study with Line Charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems(CHI ’21).

Digital Library

[27]

Dae Hyun Kim, Vidya Setlur, and Maneesh Agrawala. 2021. Towards understanding how readers integrate charts and captions: A case study with line charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–11.

Digital Library

[28]

Philipp Koehn and Barry Haddow. 2009. Interactive assistance to human translators using statistical machine translation methods. In Proceedings of Machine Translation Summit XII: Papers.

[29]

Vivian Lai, Samuel Carton, Rajat Bhatnagar, Q Vera Liao, Yunfeng Zhang, and Chenhao Tan. 2022. Human-ai collaboration via conditional delegation: A case study of content moderation. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–18.

Digital Library

[30]

Andrew Large, Jamshid Beheshti, Alain Breuleux, and Andre Renaud. 1995. Multimedia and comprehension: The relationship among text, animation, and captions. Journal of the American society for information science 46, 5 (1995), 340–347.

Digital Library

[31]

Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. 2023. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning. PMLR, 18893–18912.

[32]

Shengzhi Li and Nima Tajbakhsh. 2023. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. arXiv preprint arXiv:2308.03349 (2023).

[33]

Yanna Lin, Haotian Li, Leni Yang, Aoyu Wu, and Huamin Qu. 2023. Inksight: Leveraging sketch interaction for documenting chart findings in computational notebooks. IEEE Transactions on Visualization and Computer Graphics (2023).

[34]

Can Liu, Yuhan Guo, and Xiaoru Yuan. 2023. Autotitle: An interactive title generator for visualizations. IEEE Transactions on Visualization and Computer Graphics (2023).

[35]

Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. 2022. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662 (2022).

[36]

Anita Mahinpei, Zona Kostic, and Chris Tanner. 2022. Linecap: Line charts for data visualization captioning models. In 2022 IEEE Visualization and Visual Analytics (VIS). IEEE, 35–39.

[37]

Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. 2023. UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning. arXiv preprint arXiv:2305.14761 (2023).

[38]

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022).

[39]

Florian Mathis, John H Williamson, Kami Vaniea, and Mohamed Khamis. 2021. Fast and secure authentication in virtual reality using coordinated 3d manipulation and pointing. ACM Transactions on Computer-Human Interaction (ToCHI) 28, 1 (2021), 1–44.

Digital Library

[40]

Gwen C Nugent. 1983. Deaf students’ learning from captioned instruction: The relationship between the visual and caption display. The Journal of Special Education 17, 2 (1983), 227–234.

[41]

Jason Obeid and Enamul Hoque. 2020. Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model. In Proceedings of the 13th International Conference on Natural Language Generation, Brian Davis, Yvette Graham, John Kelleher, and Yaji Sripada (Eds.). Association for Computational Linguistics, Dublin, Ireland, 138–147. https://doi.org/10.18653/v1/2020.inlg-1.20

[42]

OpenAI. 2022. GPT-3.5: Language Models are Few-Shot Learners. https://platform.openai.com/docs/models/gpt-3-5.

[43]

OpenAI. 2023. GPT-4V(ision) System Card. https://api.semanticscholar.org/CorpusID:263218031

[44]

William D Page. 1974. The author and the reader in writing and reading. Research in the Teaching of English 8, 2 (1974), 170–183.

[45]

Xin Qian, Eunyee Koh, Fan Du, Sungchul Kim, Joel Chan, Ryan A Rossi, Sana Malik, and Tak Yeon Lee. 2021. Generating accurate caption units for figure captioning. In Proceedings of the Web Conference 2021. 2792–2804.

Digital Library

[46]

Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Divvala, and Ali Farhadi. 2016. Figureseer: Parsing result-figures in research papers. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14. Springer, 664–680.

[47]

Ashish Singh, Prateek Agarwal, Zixuan Huang, Arpita Singh, Tong Yu, Sungchul Kim, Victor Bursztyn, Nikos Vlassis, and Ryan A Rossi. 2023. FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback. arXiv preprint arXiv:2307.10867 (2023).

[48]

Chase Stokes, Vidya Setlur, Bridget Cogley, Arvind Satyanarayan, and Marti A Hearst. 2022. Striking a balance: reader takeaways and preferences when integrating text and charts. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 1233–1243.

[49]

Benny J Tang, Angie Boggust, and Arvind Satyanarayan. 2023. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356 (2023).

[50]

Yanan Wang and Yea-Seul Kim. 2023. Making data-driven articles more accessible: An active preference learning approach to data fact personalization. In Proceedings of the 2023 ACM Designing Interactive Systems Conference. 1353–1366.

Digital Library

[51]

Zhishen Yang, Raj Dabre, Hideki Tanaka, and Naoaki Okazaki. 2023. SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning. arXiv preprint arXiv:2306.03491 (2023).

[52]

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, 2023. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 (2023).

Cited By

Kakehi BIwamura MMinatani KKise K(2024)Grouping Effect for Bar Graph Summarization for People with Visual ImpairmentsProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688534(1-6)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3688534
Bromley DSetlur V(2024)Dash: A Bimodal Data Exploration Tool for Interactive Text and Visualizations2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00059(256-260)Online publication date: 13-Oct-2024
https://doi.org/10.1109/VIS55277.2024.00059

Index Terms

SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User studies
    2. Interaction paradigms
      1. Graphical user interfaces
  2. Interaction design
    1. Empirical studies in interaction design
    2. Interaction design process and methods
      1. User interface design

Index terms have been assigned to the content through auto-classification.

Recommendations

Generating Accurate Caption Units for Figure Captioning
WWW '21: Proceedings of the Web Conference 2021

Scientific-style figures are commonly used on the web to present numerical information. Captions that tell accurate figure information and sound natural would significantly improve figure accessibility. In this paper, we present promising results on ...
Extracting Figures and Captions from Scientific Publications
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Figures and captions convey essential information in scientific publications. As such, there is a growing interest in mining published figures and in utilizing their respective captions as a source of knowledge. There is also much interest in image ...
A Novel Caption Extraction Scheme for Various Sports Captions
ICPR '06: Proceedings of the 18th International Conference on Pattern Recognition - Volume 02

The study proposes a novel scheme to extract various captions in sports videos. A caption detection process based on an iteratively temporal averaging technique is used to automatically detect and locate a caption region in a series of video frames. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI EA '24: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

May 2024

4761 pages

ISBN:9798400703317

DOI:10.1145/3613905

Editors:
Florian Floyd Mueller
Monash University
,
Penny Kyburz
The Australian National University
,
Julie R. Williamson
University of Glasgow
,
Corina Sas
Lancaster University

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 May 2024

Check for updates

Qualifiers

Work in progress
Research
Refereed limited

Conference

CHI '24

Sponsor:

CHI '24: CHI Conference on Human Factors in Computing Systems

May 11 - 16, 2024

HI, Honolulu, USA

Acceptance Rates

Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

Upcoming Conference

CHI 2025

Sponsor:
sigchi

ACM CHI Conference on Human Factors in Computing Systems

April 26 - May 1, 2025

Yokohama , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
311
Total Downloads

Downloads (Last 12 months)311
Downloads (Last 6 weeks)51

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kakehi BIwamura MMinatani KKise K(2024)Grouping Effect for Bar Graph Summarization for People with Visual ImpairmentsProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688534(1-6)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3663548.3688534
Bromley DSetlur V(2024)Dash: A Bimodal Data Exploration Tool for Interactive Text and Visualizations2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00059(256-260)Online publication date: 13-Oct-2024
https://doi.org/10.1109/VIS55277.2024.00059

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Table of Conten