ABSTRACT
Mathematical expression recognition (MER) aims to convert an image of mathematical expressions into a Latex sequence. In practice, the task of MER is challenging, since 1) the images of mathematical expressions often contain complex structure relationships, e.g., fractions, matrixes, and subscripts; 2) the generated Latex sequences can be very complex and they have to satisfy strict syntax rules. Existing methods, however, often ignore the complex dependence among image regions, resulting in poor feature representation. In addition, they may fail to capture the rigorous relations among different formula symbols as they consider MER as a common language generation task. To address these issues, we propose a Structure-Aware Sequence-Level (SASL) model for MER. First, to better represent and recognize the visual content of formula images, we propose a structure-aware module to capture the relationship among different symbols. Meanwhile, the sequence-level modeling helps the model to concentrate on the generation of entire sequences. To make the problem feasible, we cast the generation problem into a Markov decision process (MDP) and seek to learn a Latex sequence generating policy. Based on MDP, we learn SASL by maximizing the matching score of each image-sequence pair to obtain the generation policy. Extensive experiments on the IM2LATEX-100K dataset verify the effectiveness and superiority of the proposed method.
Supplemental Material
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation. 265--283. Google ScholarDigital Library
- Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In European Conference on Computer Vision. 382--398.Google Scholar
- Ahmad-Montaser Awal, Harold Mouchere, and Christian Viard-Gaudin. 2009. Towards handwritten mathematical expression recognition. In International Conference on Document Analysis and Recognition. 1046--1050. Google ScholarDigital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. ArXiv (2014).Google Scholar
- Abdelwaheb Belaid and Jean-Paul Haton. 1984. A syntactic approach for handwritten mathematical formula recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (1984), 105--111. Google ScholarDigital Library
- Sidney Bender, Monica Haurilet, Alina Roitberg, and Rainer Stiefelhagen. 2019. Learning Fine-Grained Image Representations for Mathematical Expression Recognition. In International Conference on Document Analysis and Recognition Workshops. 56--61.Google Scholar
- Kam-Fai Chan and Dit-Yan Yeung. 2000. Mathematical expression recognition: a survey. International Journal on Document Analysis and Recognition (2000), 3--15.Google ScholarCross Ref
- Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, and Qi Ju. 2019. Improving image captioning with conditional generative adversarial nets. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8142--8150.Google ScholarDigital Library
- Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, Fuming Ma, and Q. Ju. 2019. Improving Image Captioning with Conditional Generative Adversarial Nets. In AAAI Conference on Artificial Intelligence. 8142--8150.Google Scholar
- Deng et.al. 2016. Image-to-markup generation with coarse-to-fine attention. ArXiv (2016). Google ScholarDigital Library
- Zhang et.al. [n.d.]. Multi-scale attention with dense encoder for hand written mathematical expression recognition. In 2018ICPR.Google Scholar
- Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. 2015. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.Google ScholarCross Ref
- Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3137--3146.Google ScholarCross Ref
- Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5630--5639.Google ScholarCross Ref
- Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg. 2003. Overview of the 2003 KDD Cup. Acm Sigkdd Explorations Newsletter 5, 2 (2003), 149--151. Google ScholarDigital Library
- Zhiting Hu, Zichao Yang, Xiaodan Liang, R. Salakhutdinov, and E. Xing. 2017. Toward Controlled Generation of Text. In International Conference on Machine Learning. 1587--1596. Google ScholarDigital Library
- Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. 2019. ViSiL: Fine-grained spatio-temporal video similarity learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6351--6360.Google ScholarCross Ref
- Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. In International Conference on Machine Learning. PMLR, 3519--3529.Google Scholar
- Stéphane Lavirotte and Loic Pottier. 1998. Mathematical formula recognition using graph grammar. In Document Recognition V. 44--52.Google Scholar
- Anh Duc Le and Masaki Nakagawa. 2017. Training an End-to-End System for Handwritten Mathematical Expression Recognition by Generated Patterns. International Conference on Document Analysis and Recognition (2017), 1056--1061.Google ScholarCross Ref
- Chang Liu, Fuchun Sun, ChanghuWang, FengWang, and Alan Yuille. 2017. MAT: A multimodal attentive translator for image captioning. In International Joint Conference on Artificial Intelligence. 4033--4039. Google ScholarDigital Library
- Shubo Ma and Yahong Han. 2016. Describing images by feeding LSTM with structural words. In IEEE International Conference on Multimedia and Expo. 1--6.Google ScholarCross Ref
- Erik G Miller and Paul A Viola. 1998. Ambiguity and constraint in mathematical expression recognition. In AAAI/IAAI. 784--791. Google ScholarDigital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318. Google ScholarDigital Library
- Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.Google ScholarCross Ref
- Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2016), 2298--2304.Google ScholarDigital Library
- Archana A Shinde and DG Chougule. 2012. Text pre-processing and text segmentation for OCR. International Journal of Computer Science Engineering and Technology 2, 1 (2012), 810--812.Google Scholar
- Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of Association for Computational Linguistics (2014), 207--218.Google ScholarCross Ref
- Masakazu Suzuki, Fumikazu Tamari, Ryoji Fukuda, Seiichi Uchida, and Toshihiro Kanahori. 2003. INFTY: an integrated OCR system for mathematical documents. In ACM Symposium on Document Engineering. 95--104. Google ScholarDigital Library
- Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision. 56--72.Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008. Google ScholarDigital Library
- Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google ScholarCross Ref
- JianWang, Yunchuan Sun, and ShenlingWang. 2019. Image to latex with densenet encoder and joint attention. Procedia Computer Science (2019), 374--380.Google Scholar
- Guanghui Xu, Shuaicheng Niu, Mingkui Tan, Yucheng Luo, Qing Du, and Qi Wu. 2021. Towards Accurate Text-based Image Captioning with Content Diversity Exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12637--12646.Google ScholarCross Ref
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048--2057. Google ScholarDigital Library
- Ke Xu, Yifan Zhang, Deheng Ye, Peilin Zhao, and Mingkui Tan. 2020. Relation- Aware Transformer for Portfolio Policy Learning. In International Joint Conference on Artificial Intelligence. 4647--4653.Google Scholar
- Jianshu Zhang, Jun Du, and Lirong Dai. 2018. Track, attend, and parse (tap): An end-to-end framework for online handwritten mathematical expression recognition. IEEE Transactions on Multimedia (2018), 221--233.Google Scholar
- Jianshu Zhang, Jun Du, Shiliang Zhang, Dan Liu, Yulong Hu, Jinshui Hu, Si Wei, and Lirong Dai. 2017. Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition. Pattern Recognition (2017), 196--206.Google Scholar
- Ting Zhang. 2017. New Architectures for Handwritten Mathematical Expressions Recognition. Ph.D. Dissertation.Google Scholar
- Wei Zhang, Zhiqiang Bai, and Yuesheng Zhu. 2019. An improved approach based on CNN-RNNs for mathematical expression recognition. In International Conference on Multimedia Systems and Signal Processing. 57--61. Google ScholarDigital Library
- Yifan Zhang, Peilin Zhao, Bin Li, et al. 2020. Cost-sensitive portfolio selection via deep reinforcement learning. IEEE Transactions on Knowledge and Data Engineering (2020).Google Scholar
Index Terms
- Structure-aware Mathematical Expression Recognition with Sequence-Level Modeling
Recommendations
A case study on mathematical expression recognition to GPU
The technology of mathematical expression identification and recognition extracts mathematical expressions in document images, and it has been studied for over a decade. Based on previous works, we develop an automatic recognition tool, named EqnEye, ...
Expression-invariant face recognition by facial expression transformations
In this paper, we present a method of expression-invariant face recognition that transforms input face image with an arbitrary expression into its corresponding neutral facial expression image. When a new face image with an arbitrary expression is ...
Static topographic modeling for facial expression recognition and analysis
Facial expression plays a key role in non-verbal face-to-face communication. It is a challenging task to develop an automatic facial expression reading and understanding system, especially, for recognizing the facial expression from a static image ...
Comments