research-article

Structure-aware Mathematical Expression Recognition with Sequence-Level Modeling

Authors:
Minli Li

South China University of Technology, Guangzhou, China

South China University of Technology, Guangzhou, China
View Profile

,
Peilin Zhao

Tencent AI Lab, Shenzhen, China

Tencent AI Lab, Shenzhen, China
View Profile

,
Yifan Zhang

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore
View Profile

,
Shuaicheng Niu

South China University of Technology, Guangzhou, China

South China University of Technology, Guangzhou, China
View Profile

,
Qingyao Wu

South China University of Technology, Guangzhou, China

South China University of Technology, Guangzhou, China
View Profile

,
Mingkui Tan

South China University of Technology, Guangzhou, China

South China University of Technology, Guangzhou, China
View Profile

MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021Pages 5038–5046https://doi.org/10.1145/3474085.3475578

Published:17 October 2021Publication History

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5038–5046

ABSTRACT

Mathematical expression recognition (MER) aims to convert an image of mathematical expressions into a Latex sequence. In practice, the task of MER is challenging, since 1) the images of mathematical expressions often contain complex structure relationships, e.g., fractions, matrixes, and subscripts; 2) the generated Latex sequences can be very complex and they have to satisfy strict syntax rules. Existing methods, however, often ignore the complex dependence among image regions, resulting in poor feature representation. In addition, they may fail to capture the rigorous relations among different formula symbols as they consider MER as a common language generation task. To address these issues, we propose a Structure-Aware Sequence-Level (SASL) model for MER. First, to better represent and recognize the visual content of formula images, we propose a structure-aware module to capture the relationship among different symbols. Meanwhile, the sequence-level modeling helps the model to concentrate on the generation of entire sequences. To make the problem feasible, we cast the generation problem into a Markov decision process (MDP) and seek to learn a Latex sequence generating policy. Based on MDP, we learn SASL by maximizing the matching score of each image-sequence pair to obtain the generation policy. Extensive experiments on the IM2LATEX-100K dataset verify the effectiveness and superiority of the proposed method.

Supplemental Material

MM21-fp2194.mp4

mp4

644.2 MB

Download

References

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation. 265--283. Google ScholarDigital Library
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In European Conference on Computer Vision. 382--398.Google Scholar
Ahmad-Montaser Awal, Harold Mouchere, and Christian Viard-Gaudin. 2009. Towards handwritten mathematical expression recognition. In International Conference on Document Analysis and Recognition. 1046--1050. Google ScholarDigital Library
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. ArXiv (2014).Google Scholar
Abdelwaheb Belaid and Jean-Paul Haton. 1984. A syntactic approach for handwritten mathematical formula recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (1984), 105--111. Google ScholarDigital Library
Sidney Bender, Monica Haurilet, Alina Roitberg, and Rainer Stiefelhagen. 2019. Learning Fine-Grained Image Representations for Mathematical Expression Recognition. In International Conference on Document Analysis and Recognition Workshops. 56--61.Google Scholar
Kam-Fai Chan and Dit-Yan Yeung. 2000. Mathematical expression recognition: a survey. International Journal on Document Analysis and Recognition (2000), 3--15.Google ScholarCross Ref
Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, and Qi Ju. 2019. Improving image captioning with conditional generative adversarial nets. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8142--8150.Google ScholarDigital Library
Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, Fuming Ma, and Q. Ju. 2019. Improving Image Captioning with Conditional Generative Adversarial Nets. In AAAI Conference on Artificial Intelligence. 8142--8150.Google Scholar
Deng et.al. 2016. Image-to-markup generation with coarse-to-fine attention. ArXiv (2016). Google ScholarDigital Library
Zhang et.al. [n.d.]. Multi-scale attention with dense encoder for hand written mathematical expression recognition. In 2018ICPR.Google Scholar
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. 2015. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.Google ScholarCross Ref
Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3137--3146.Google ScholarCross Ref
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5630--5639.Google ScholarCross Ref
Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg. 2003. Overview of the 2003 KDD Cup. Acm Sigkdd Explorations Newsletter 5, 2 (2003), 149--151. Google ScholarDigital Library
Zhiting Hu, Zichao Yang, Xiaodan Liang, R. Salakhutdinov, and E. Xing. 2017. Toward Controlled Generation of Text. In International Conference on Machine Learning. 1587--1596. Google ScholarDigital Library
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. 2019. ViSiL: Fine-grained spatio-temporal video similarity learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6351--6360.Google ScholarCross Ref
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. In International Conference on Machine Learning. PMLR, 3519--3529.Google Scholar
Stéphane Lavirotte and Loic Pottier. 1998. Mathematical formula recognition using graph grammar. In Document Recognition V. 44--52.Google Scholar
Anh Duc Le and Masaki Nakagawa. 2017. Training an End-to-End System for Handwritten Mathematical Expression Recognition by Generated Patterns. International Conference on Document Analysis and Recognition (2017), 1056--1061.Google ScholarCross Ref
Chang Liu, Fuchun Sun, ChanghuWang, FengWang, and Alan Yuille. 2017. MAT: A multimodal attentive translator for image captioning. In International Joint Conference on Artificial Intelligence. 4033--4039. Google ScholarDigital Library
Shubo Ma and Yahong Han. 2016. Describing images by feeding LSTM with structural words. In IEEE International Conference on Multimedia and Expo. 1--6.Google ScholarCross Ref
Erik G Miller and Paul A Viola. 1998. Ambiguity and constraint in mathematical expression recognition. In AAAI/IAAI. 784--791. Google ScholarDigital Library
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318. Google ScholarDigital Library
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.Google ScholarCross Ref
Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2016), 2298--2304.Google ScholarDigital Library
Archana A Shinde and DG Chougule. 2012. Text pre-processing and text segmentation for OCR. International Journal of Computer Science Engineering and Technology 2, 1 (2012), 810--812.Google Scholar
Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of Association for Computational Linguistics (2014), 207--218.Google ScholarCross Ref
Masakazu Suzuki, Fumikazu Tamari, Ryoji Fukuda, Seiichi Uchida, and Toshihiro Kanahori. 2003. INFTY: an integrated OCR system for mathematical documents. In ACM Symposium on Document Engineering. 95--104. Google ScholarDigital Library
Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision. 56--72.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008. Google ScholarDigital Library
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google ScholarCross Ref
JianWang, Yunchuan Sun, and ShenlingWang. 2019. Image to latex with densenet encoder and joint attention. Procedia Computer Science (2019), 374--380.Google Scholar
Guanghui Xu, Shuaicheng Niu, Mingkui Tan, Yucheng Luo, Qing Du, and Qi Wu. 2021. Towards Accurate Text-based Image Captioning with Content Diversity Exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12637--12646.Google ScholarCross Ref
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048--2057. Google ScholarDigital Library
Ke Xu, Yifan Zhang, Deheng Ye, Peilin Zhao, and Mingkui Tan. 2020. Relation- Aware Transformer for Portfolio Policy Learning. In International Joint Conference on Artificial Intelligence. 4647--4653.Google Scholar
Jianshu Zhang, Jun Du, and Lirong Dai. 2018. Track, attend, and parse (tap): An end-to-end framework for online handwritten mathematical expression recognition. IEEE Transactions on Multimedia (2018), 221--233.Google Scholar
Jianshu Zhang, Jun Du, Shiliang Zhang, Dan Liu, Yulong Hu, Jinshui Hu, Si Wei, and Lirong Dai. 2017. Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition. Pattern Recognition (2017), 196--206.Google Scholar
Ting Zhang. 2017. New Architectures for Handwritten Mathematical Expressions Recognition. Ph.D. Dissertation.Google Scholar
Wei Zhang, Zhiqiang Bai, and Yuesheng Zhu. 2019. An improved approach based on CNN-RNNs for mathematical expression recognition. In International Conference on Multimedia Systems and Signal Processing. 57--61. Google ScholarDigital Library
Yifan Zhang, Peilin Zhao, Bin Li, et al. 2020. Cost-sensitive portfolio selection via deep reinforcement learning. IEEE Transactions on Knowledge and Data Engineering (2020).Google Scholar

Index Terms

Structure-aware Mathematical Expression Recognition with Sequence-Level Modeling
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
    2. Natural language processing
      1. Natural language generation
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Reinforcement learning
        Sequential decision making

Recommendations

A case study on mathematical expression recognition to GPU

The technology of mathematical expression identification and recognition extracts mathematical expressions in document images, and it has been studied for over a decade. Based on previous works, we develop an automatic recognition tool, named EqnEye, ...
Read More
Expression-invariant face recognition by facial expression transformations

In this paper, we present a method of expression-invariant face recognition that transforms input face image with an arbitrary expression into its corresponding neutral facial expression image. When a new face image with an arbitrary expression is ...
Read More
Static topographic modeling for facial expression recognition and analysis

Facial expression plays a key role in non-verbal face-to-face communication. It is a challenging task to develop an automatic facial expression reading and understanding system, especially, for recognizing the facial expression from a static image ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mathematical expression recognition
sequence-level modeling
structure-aware module
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 205
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Structure-aware Mathematical Expression Recognition with Sequence-Level Modeling

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A case study on mathematical expression recognition to GPU

Expression-invariant face recognition by facial expression transformations

Static topographic modeling for facial expression recognition and analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media