skip to main content
10.1145/3474085.3475578acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Structure-aware Mathematical Expression Recognition with Sequence-Level Modeling

Authors Info & Claims
Published:17 October 2021Publication History

ABSTRACT

Mathematical expression recognition (MER) aims to convert an image of mathematical expressions into a Latex sequence. In practice, the task of MER is challenging, since 1) the images of mathematical expressions often contain complex structure relationships, e.g., fractions, matrixes, and subscripts; 2) the generated Latex sequences can be very complex and they have to satisfy strict syntax rules. Existing methods, however, often ignore the complex dependence among image regions, resulting in poor feature representation. In addition, they may fail to capture the rigorous relations among different formula symbols as they consider MER as a common language generation task. To address these issues, we propose a Structure-Aware Sequence-Level (SASL) model for MER. First, to better represent and recognize the visual content of formula images, we propose a structure-aware module to capture the relationship among different symbols. Meanwhile, the sequence-level modeling helps the model to concentrate on the generation of entire sequences. To make the problem feasible, we cast the generation problem into a Markov decision process (MDP) and seek to learn a Latex sequence generating policy. Based on MDP, we learn SASL by maximizing the matching score of each image-sequence pair to obtain the generation policy. Extensive experiments on the IM2LATEX-100K dataset verify the effectiveness and superiority of the proposed method.

Skip Supplemental Material Section

Supplemental Material

MM21-fp2194.mp4

mp4

644.2 MB

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation. 265--283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In European Conference on Computer Vision. 382--398.Google ScholarGoogle Scholar
  3. Ahmad-Montaser Awal, Harold Mouchere, and Christian Viard-Gaudin. 2009. Towards handwritten mathematical expression recognition. In International Conference on Document Analysis and Recognition. 1046--1050. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. ArXiv (2014).Google ScholarGoogle Scholar
  5. Abdelwaheb Belaid and Jean-Paul Haton. 1984. A syntactic approach for handwritten mathematical formula recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (1984), 105--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sidney Bender, Monica Haurilet, Alina Roitberg, and Rainer Stiefelhagen. 2019. Learning Fine-Grained Image Representations for Mathematical Expression Recognition. In International Conference on Document Analysis and Recognition Workshops. 56--61.Google ScholarGoogle Scholar
  7. Kam-Fai Chan and Dit-Yan Yeung. 2000. Mathematical expression recognition: a survey. International Journal on Document Analysis and Recognition (2000), 3--15.Google ScholarGoogle ScholarCross RefCross Ref
  8. Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, and Qi Ju. 2019. Improving image captioning with conditional generative adversarial nets. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8142--8150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, Fuming Ma, and Q. Ju. 2019. Improving Image Captioning with Conditional Generative Adversarial Nets. In AAAI Conference on Artificial Intelligence. 8142--8150.Google ScholarGoogle Scholar
  10. Deng et.al. 2016. Image-to-markup generation with coarse-to-fine attention. ArXiv (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Zhang et.al. [n.d.]. Multi-scale attention with dense encoder for hand written mathematical expression recognition. In 2018ICPR.Google ScholarGoogle Scholar
  12. Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. 2015. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.Google ScholarGoogle ScholarCross RefCross Ref
  13. Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3137--3146.Google ScholarGoogle ScholarCross RefCross Ref
  14. Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5630--5639.Google ScholarGoogle ScholarCross RefCross Ref
  15. Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg. 2003. Overview of the 2003 KDD Cup. Acm Sigkdd Explorations Newsletter 5, 2 (2003), 149--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Zhiting Hu, Zichao Yang, Xiaodan Liang, R. Salakhutdinov, and E. Xing. 2017. Toward Controlled Generation of Text. In International Conference on Machine Learning. 1587--1596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. 2019. ViSiL: Fine-grained spatio-temporal video similarity learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6351--6360.Google ScholarGoogle ScholarCross RefCross Ref
  18. Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. In International Conference on Machine Learning. PMLR, 3519--3529.Google ScholarGoogle Scholar
  19. Stéphane Lavirotte and Loic Pottier. 1998. Mathematical formula recognition using graph grammar. In Document Recognition V. 44--52.Google ScholarGoogle Scholar
  20. Anh Duc Le and Masaki Nakagawa. 2017. Training an End-to-End System for Handwritten Mathematical Expression Recognition by Generated Patterns. International Conference on Document Analysis and Recognition (2017), 1056--1061.Google ScholarGoogle ScholarCross RefCross Ref
  21. Chang Liu, Fuchun Sun, ChanghuWang, FengWang, and Alan Yuille. 2017. MAT: A multimodal attentive translator for image captioning. In International Joint Conference on Artificial Intelligence. 4033--4039. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shubo Ma and Yahong Han. 2016. Describing images by feeding LSTM with structural words. In IEEE International Conference on Multimedia and Expo. 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  23. Erik G Miller and Paul A Viola. 1998. Ambiguity and constraint in mathematical expression recognition. In AAAI/IAAI. 784--791. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.Google ScholarGoogle ScholarCross RefCross Ref
  26. Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2016), 2298--2304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Archana A Shinde and DG Chougule. 2012. Text pre-processing and text segmentation for OCR. International Journal of Computer Science Engineering and Technology 2, 1 (2012), 810--812.Google ScholarGoogle Scholar
  28. Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of Association for Computational Linguistics (2014), 207--218.Google ScholarGoogle ScholarCross RefCross Ref
  29. Masakazu Suzuki, Fumikazu Tamari, Ryoji Fukuda, Seiichi Uchida, and Toshihiro Kanahori. 2003. INFTY: an integrated OCR system for mathematical documents. In ACM Symposium on Document Engineering. 95--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision. 56--72.Google ScholarGoogle ScholarCross RefCross Ref
  31. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google ScholarGoogle ScholarCross RefCross Ref
  33. JianWang, Yunchuan Sun, and ShenlingWang. 2019. Image to latex with densenet encoder and joint attention. Procedia Computer Science (2019), 374--380.Google ScholarGoogle Scholar
  34. Guanghui Xu, Shuaicheng Niu, Mingkui Tan, Yucheng Luo, Qing Du, and Qi Wu. 2021. Towards Accurate Text-based Image Captioning with Content Diversity Exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12637--12646.Google ScholarGoogle ScholarCross RefCross Ref
  35. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ke Xu, Yifan Zhang, Deheng Ye, Peilin Zhao, and Mingkui Tan. 2020. Relation- Aware Transformer for Portfolio Policy Learning. In International Joint Conference on Artificial Intelligence. 4647--4653.Google ScholarGoogle Scholar
  37. Jianshu Zhang, Jun Du, and Lirong Dai. 2018. Track, attend, and parse (tap): An end-to-end framework for online handwritten mathematical expression recognition. IEEE Transactions on Multimedia (2018), 221--233.Google ScholarGoogle Scholar
  38. Jianshu Zhang, Jun Du, Shiliang Zhang, Dan Liu, Yulong Hu, Jinshui Hu, Si Wei, and Lirong Dai. 2017. Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition. Pattern Recognition (2017), 196--206.Google ScholarGoogle Scholar
  39. Ting Zhang. 2017. New Architectures for Handwritten Mathematical Expressions Recognition. Ph.D. Dissertation.Google ScholarGoogle Scholar
  40. Wei Zhang, Zhiqiang Bai, and Yuesheng Zhu. 2019. An improved approach based on CNN-RNNs for mathematical expression recognition. In International Conference on Multimedia Systems and Signal Processing. 57--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yifan Zhang, Peilin Zhao, Bin Li, et al. 2020. Cost-sensitive portfolio selection via deep reinforcement learning. IEEE Transactions on Knowledge and Data Engineering (2020).Google ScholarGoogle Scholar

Index Terms

  1. Structure-aware Mathematical Expression Recognition with Sequence-Level Modeling

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '21: Proceedings of the 29th ACM International Conference on Multimedia
          October 2021
          5796 pages
          ISBN:9781450386517
          DOI:10.1145/3474085

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 October 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader