SAMNet: Semantic Aware Multimodal Network for Emoji Drawing Classification

Gothe, Sourabh Vasant; Khurana, Rishabh; Vachhani, Jayesh Rajkumar; Rakshith, S.; Kashyap, Pranay

doi:10.1007/978-3-031-31417-9_10

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1777))

Included in the following conference series:

International Conference on Computer Vision and Image Processing

409 Accesses

Abstract

In the current era, the mode of communication through mobile devices is becoming more personalized with the evolution of touch-based input methods. While writing on touch-responsive devices, searching for emojis to capture the true intent is cumbersome. To solve this problem, the existing solutions consider either the text or only stroke-based drawings to predict the appropriate emojis. We do not leverage the full context by considering only a single input. While the user is digitally writing, it is challenging for the model to identify whether the intention is to write text or draw an emoji. Moreover, the model’s memory footprint and latency play an essential role in providing a seamless writing experience to the user. In this paper, we investigate the effectiveness of combining text and drawing as input to the model. We present SAMNet, a multimodal deep neural network that jointly learns the text and image features. Here image features are extracted from the stroke-based drawing and text from the previously written context. We also demonstrate the optimal way to fuse features from both modalities. The paper focuses on improving user experience and providing low latency on edge devices. We trained our model with a carefully crafted dataset of 63 emoji classes and evaluated the performance. We achieve a worst-case On-Device inference time of 60 ms and 76.74% top-3 prediction accuracy with a model size of 3.5 MB. We evaluated the results with the closest matching application-DigitalInk and found that SAMNet provided a 13.95% improvement in the top-3 prediction accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Audebert, N., Herold, C., Slimani, K., Vidal, C.: Multimodal deep networks for text and image-based document classification. In: Cellier, P., Driessens, K. (eds.) ECML PKDD 2019. CCIS, vol. 1167, pp. 427–443. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43823-4_35
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., Lehmann, S.: Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. arXiv preprint arXiv:1708.00524 (2017)
Gupta, A., et al.: Context-aware emoji prediction using deep learning. In: Dev, A., Agrawal, S.S., Sharma, A. (eds.) AIST 2021. CCIS, vol. 1546, pp. 244–254. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-95711-7_22
Chapter Google Scholar
Ha, D., Eck, D.: A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477 (2017)
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)
Keysers, D., Deselaers, T., Rowley, H.A., Wang, L.L., Carbune, V.: Multi-language online handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1180–1194 (2016)
Article Google Scholar
Kruk, J., Lubin, J., Sikka, K., Lin, X., Jurafsky, D., Divakaran, A.: Integrating text and image: determining multimodal document intent in Instagram posts. arXiv preprint arXiv:1904.09073 (2019)
Ma, W., Liu, R., Wang, L., Vosoughi, S.: Emoji prediction: extensions and benchmarking. arXiv preprint arXiv:2007.07389 (2020)
Mao, J., Xu, J., Jing, Y., Yuille, A.: Training and evaluating multimodal word embeddings with large-scale web annotated images. arXiv preprint arXiv:1611.08321 (2016)
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5(1), 39–46 (2002)
Article MATH Google Scholar
Prattichizzo, D., Meli, L., Malvezzi, M.: Digital handwriting with a finger or a stylus: a biomechanical comparison. IEEE Trans. Haptics 8(4), 356–370 (2015)
Article Google Scholar
Summaira, J., Li, X., Shoib, A.M., Li, S., Abdul, J.: Recent advances and trends in multimodal deep learning: a review. arXiv preprint arXiv:2105.11087 (2021)
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a compact task-agnostic BERT for resource-limited devices. arXiv preprint arXiv:2004.02984 (2020)
Yang, F., et al.: Exploring deep multimodal fusion of text and photo for hate speech classification. In: Proceedings of the Third Workshop on Abusive Language Online, pp. 11–18 (2019)
Google Scholar
Zahavy, T., Magnani, A., Krishnan, A., Mannor, S.: Is a picture worth a thousand words? A deep multi-modal fusion architecture for product classification in e-commerce. arXiv preprint arXiv:1611.09534 (2016)

Download references

Author information

Authors and Affiliations

Samsung R & D Institute, Bangalore, 560037, India
Sourabh Vasant Gothe, Rishabh Khurana, Jayesh Rajkumar Vachhani, S. Rakshith & Pranay Kashyap

Authors

Sourabh Vasant Gothe
View author publications
You can also search for this author in PubMed Google Scholar
Rishabh Khurana
View author publications
You can also search for this author in PubMed Google Scholar
Jayesh Rajkumar Vachhani
View author publications
You can also search for this author in PubMed Google Scholar
S. Rakshith
View author publications
You can also search for this author in PubMed Google Scholar
Pranay Kashyap
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Rishabh Khurana , Jayesh Rajkumar Vachhani , S. Rakshith or Pranay Kashyap .

Editor information

Editors and Affiliations

Visvesvaraya National Institute of Technology Nagpur, Nagpur, India
Deep Gupta
Visvesvaraya National Institute of Technology Nagpur, Nagpur, India
Kishor Bhurchandi
Indian Institute of Technology Ropar, Rupnagar, India
Subrahmanyam Murala
Indian Institute of Technology Roorkee, Roorkee, India
Balasubramanian Raman
Indian Institute of Technology Roorkee, Roorkee, India
Sanjeev Kumar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gothe, S.V., Khurana, R., Vachhani, J.R., Rakshith, S., Kashyap, P. (2023). SAMNet: Semantic Aware Multimodal Network for Emoji Drawing Classification. In: Gupta, D., Bhurchandi, K., Murala, S., Raman, B., Kumar, S. (eds) Computer Vision and Image Processing. CVIP 2022. Communications in Computer and Information Science, vol 1777. Springer, Cham. https://doi.org/10.1007/978-3-031-31417-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-31417-9_10
Published: 07 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31416-2
Online ISBN: 978-3-031-31417-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SAMNet: Semantic Aware Multimodal Network for Emoji Drawing Classification