skip to main content
10.1145/3613905.3650738acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
Work in Progress

SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings

Published: 11 May 2024 Publication History

Abstract

Crafting effective captions for figures is important. Readers heavily depend on these captions to grasp the figure’s message. However, despite a well-developed set of AI technologies for figures and captions, these have rarely been tested for usefulness in aiding caption writing. This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions to aid caption composition. SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality across multiple critical aspects, such as helpfulness, OCR mention, key takeaways, and visual properties reference. Users can directly edit captions in SciCapenter, resubmit for revised evaluations, and iteratively refine them. A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing. Participants’ feedback further offers valuable design insights for future systems aiming to enhance caption writing.

Supplemental Material

MP4 File
Talk Video
Transcript for: Talk Video

References

[1]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019).
[2]
John Bransford. 1979. Human cognition: Learning, understanding, and remembering. (No Title) (1979).
[3]
Yu-Ying Chang and John M Swales. 2014. Informal elements in English academic writing: threats or opportunities for advanced non-native speakers? In Writing: Texts, processes and practices. Routledge, 145–167.
[4]
Charles Chen, Ruiyi Zhang, Eunyee Koh, Sungchul Kim, Scott Cohen, and Ryan Rossi. 2020. Figure captioning with relation maps for reasoning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1537–1545.
[5]
Jiwon Choi and Jaemin Jo. 2022. Intentable: A mixed-initiative system for intent-based chart captioning. In 2022 IEEE Visualization and Visual Analytics (VIS). IEEE, 40–44.
[6]
Jinho Choi, Sanghun Jung, Deok Gun Park, Jaegul Choo, and Niklas Elmqvist. 2019. Visualizing for the non-visual: Enabling the visually impaired to use visualization. In Computer Graphics Forum, Vol. 38. Wiley Online Library, 249–260.
[7]
Christopher Clark and Santosh Divvala. 2016. Pdffigures 2.0: Mining figures from research papers. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 143–152.
[8]
Susan M Conrad. 1996. Investigating academic texts with corpus-based techniques: An example from biology. Linguistics and education 8, 3 (1996), 299–326.
[9]
Ignacio Garcia and María Isabel Pena. 2011. Machine translation-assisted language learning: writing for beginners. Computer Assisted Language Learning 24, 5 (2011), 471–487.
[10]
Yashesh Gaur, Walter S Lasecki, Florian Metze, and Jeffrey P Bigham. 2016. The effects of automatic speech recognition quality on human transcription latency. In Proceedings of the 13th International Web for All Conference. 1–8.
[11]
Li Gong, Josep Crego, and Jean Senellart. 2019. Enhanced Transformer Model for Data-to-Text Generation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Alexandra Birch, Andrew Finch, Hiroaki Hayashi, Ioannis Konstas, Thang Luong, Graham Neubig, Yusuke Oda, and Katsuhito Sudoh (Eds.). Association for Computational Linguistics, Hong Kong, 148–156. https://doi.org/10.18653/v1/D19-5615
[12]
Sandra G Hart. 2006. NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 50. Sage publications Sage CA: Los Angeles, CA, 904–908.
[13]
Mary Hegarty and Marcel-Adam Just. 1993. Constructing mental models of machines from text and diagrams. Journal of memory and language 32, 6 (1993), 717–742.
[14]
John Hinds, U Connor, and RB Kaplan. 1987. Reader versus writer responsibility: A new typology. Landmark essays on ESL writing (1987), 63–74.
[15]
Sameera Horawalavithana, Sai Munikoti, Ian Stewart, and Henry Kvinge. 2023. Scitune: Aligning large language models with scientific multimodal instructions. arXiv preprint arXiv:2307.01139 (2023).
[16]
Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. 2021. SciCap: Generating Captions for Scientific Figures. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 3258–3264. https://doi.org/10.18653/v1/2021.findings-emnlp.277
[17]
Ting-Yao Hsu, Yi-Li Hsu, Shaurya Rohatgi, Ryan Rossi, Sungchul Kim, Ani Nenkova, Lun-Wei Ku, Huijuan Xu, C. Giles, and Ting-Hao Huang. 2023. The 1st Scientific Figure Captioning (SciCap) Challenge. http://scicap.ai/.
[18]
Ting-Yao Hsu, Chieh-Yang Huang, Ryan Rossi, Sungchul Kim, C. Giles, and Ting-Hao Huang. 2023. GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 5464–5474. https://doi.org/10.18653/v1/2023.findings-emnlp.363
[19]
Chieh-Yang Huang, Ting-Yao Hsu, Ryan Rossi, Ani Nenkova, Sungchul Kim, Gromit Yeuk-Yin Chan, Eunyee Koh, Clyde Lee Giles, and Ting-Hao’Kenneth’ Huang. 2023. Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization. arXiv preprint arXiv:2302.12324 (2023).
[20]
KV Jobin, Ajoy Mondal, and CV Jawahar. 2019. Docfigure: A dataset for scientific document figure classification. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 1. IEEE, 74–79.
[21]
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. 2018. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5648–5656.
[22]
Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. 2017. Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 (2017).
[23]
Shankar Kantharaj, Xuan Long Do, Rixie Tiffany Leong, Jia Qing Tan, Enamul Hoque, and Shafiq Joty. 2022. OpenCQA: Open-ended Question Answering with Charts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 11817–11837. https://doi.org/10.18653/v1/2022.emnlp-main.811
[24]
Shankar Kantharaj, Rixie Tiffany Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. 2022. Chart-to-Text: A Large-Scale Benchmark for Chart Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 4005–4023. https://doi.org/10.18653/v1/2022.acl-long.277
[25]
Zeba Karishma, Shaurya Rohatgi, Kavya Shrinivas Puranik, Jian Wu, and C Lee Giles. 2023. ACL-Fig: A Dataset for Scientific Figure Classification. arXiv preprint arXiv:2301.12293 (2023).
[26]
Dae Hyun Kim, Vidya Setlur, and Maneesh Agrawala. 2021. Towards Understanding How Readers Integrate Charts and Captions: A Case Study with Line Charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems(CHI ’21).
[27]
Dae Hyun Kim, Vidya Setlur, and Maneesh Agrawala. 2021. Towards understanding how readers integrate charts and captions: A case study with line charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–11.
[28]
Philipp Koehn and Barry Haddow. 2009. Interactive assistance to human translators using statistical machine translation methods. In Proceedings of Machine Translation Summit XII: Papers.
[29]
Vivian Lai, Samuel Carton, Rajat Bhatnagar, Q Vera Liao, Yunfeng Zhang, and Chenhao Tan. 2022. Human-ai collaboration via conditional delegation: A case study of content moderation. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–18.
[30]
Andrew Large, Jamshid Beheshti, Alain Breuleux, and Andre Renaud. 1995. Multimedia and comprehension: The relationship among text, animation, and captions. Journal of the American society for information science 46, 5 (1995), 340–347.
[31]
Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. 2023. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning. PMLR, 18893–18912.
[32]
Shengzhi Li and Nima Tajbakhsh. 2023. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. arXiv preprint arXiv:2308.03349 (2023).
[33]
Yanna Lin, Haotian Li, Leni Yang, Aoyu Wu, and Huamin Qu. 2023. Inksight: Leveraging sketch interaction for documenting chart findings in computational notebooks. IEEE Transactions on Visualization and Computer Graphics (2023).
[34]
Can Liu, Yuhan Guo, and Xiaoru Yuan. 2023. Autotitle: An interactive title generator for visualizations. IEEE Transactions on Visualization and Computer Graphics (2023).
[35]
Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. 2022. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662 (2022).
[36]
Anita Mahinpei, Zona Kostic, and Chris Tanner. 2022. Linecap: Line charts for data visualization captioning models. In 2022 IEEE Visualization and Visual Analytics (VIS). IEEE, 35–39.
[37]
Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. 2023. UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning. arXiv preprint arXiv:2305.14761 (2023).
[38]
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022).
[39]
Florian Mathis, John H Williamson, Kami Vaniea, and Mohamed Khamis. 2021. Fast and secure authentication in virtual reality using coordinated 3d manipulation and pointing. ACM Transactions on Computer-Human Interaction (ToCHI) 28, 1 (2021), 1–44.
[40]
Gwen C Nugent. 1983. Deaf students’ learning from captioned instruction: The relationship between the visual and caption display. The Journal of Special Education 17, 2 (1983), 227–234.
[41]
Jason Obeid and Enamul Hoque. 2020. Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model. In Proceedings of the 13th International Conference on Natural Language Generation, Brian Davis, Yvette Graham, John Kelleher, and Yaji Sripada (Eds.). Association for Computational Linguistics, Dublin, Ireland, 138–147. https://doi.org/10.18653/v1/2020.inlg-1.20
[42]
OpenAI. 2022. GPT-3.5: Language Models are Few-Shot Learners. https://platform.openai.com/docs/models/gpt-3-5.
[43]
OpenAI. 2023. GPT-4V(ision) System Card. https://api.semanticscholar.org/CorpusID:263218031
[44]
William D Page. 1974. The author and the reader in writing and reading. Research in the Teaching of English 8, 2 (1974), 170–183.
[45]
Xin Qian, Eunyee Koh, Fan Du, Sungchul Kim, Joel Chan, Ryan A Rossi, Sana Malik, and Tak Yeon Lee. 2021. Generating accurate caption units for figure captioning. In Proceedings of the Web Conference 2021. 2792–2804.
[46]
Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Divvala, and Ali Farhadi. 2016. Figureseer: Parsing result-figures in research papers. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14. Springer, 664–680.
[47]
Ashish Singh, Prateek Agarwal, Zixuan Huang, Arpita Singh, Tong Yu, Sungchul Kim, Victor Bursztyn, Nikos Vlassis, and Ryan A Rossi. 2023. FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback. arXiv preprint arXiv:2307.10867 (2023).
[48]
Chase Stokes, Vidya Setlur, Bridget Cogley, Arvind Satyanarayan, and Marti A Hearst. 2022. Striking a balance: reader takeaways and preferences when integrating text and charts. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 1233–1243.
[49]
Benny J Tang, Angie Boggust, and Arvind Satyanarayan. 2023. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356 (2023).
[50]
Yanan Wang and Yea-Seul Kim. 2023. Making data-driven articles more accessible: An active preference learning approach to data fact personalization. In Proceedings of the 2023 ACM Designing Interactive Systems Conference. 1353–1366.
[51]
Zhishen Yang, Raj Dabre, Hideki Tanaka, and Naoaki Okazaki. 2023. SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning. arXiv preprint arXiv:2306.03491 (2023).
[52]
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, 2023. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 (2023).

Cited By

View all
  • (2024)Grouping Effect for Bar Graph Summarization for People with Visual ImpairmentsProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688534(1-6)Online publication date: 27-Oct-2024
  • (2024)Dash: A Bimodal Data Exploration Tool for Interactive Text and Visualizations2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00059(256-260)Online publication date: 13-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CHI EA '24: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems
May 2024
4761 pages
ISBN:9798400703317
DOI:10.1145/3613905
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 May 2024

Check for updates

Qualifiers

  • Work in progress
  • Research
  • Refereed limited

Conference

CHI '24

Acceptance Rates

Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

Upcoming Conference

CHI 2025
ACM CHI Conference on Human Factors in Computing Systems
April 26 - May 1, 2025
Yokohama , Japan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)310
  • Downloads (Last 6 weeks)52
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Grouping Effect for Bar Graph Summarization for People with Visual ImpairmentsProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3688534(1-6)Online publication date: 27-Oct-2024
  • (2024)Dash: A Bimodal Data Exploration Tool for Interactive Text and Visualizations2024 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS55277.2024.00059(256-260)Online publication date: 13-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media