Skip to main content

Meet Your Email Sender - Hybrid Approach to Email Signature Extraction

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2022)

Abstract

Email signature is considered imperative for effective business email communication. Despite the growth of social media, it is still a powerful tool that can be used as a business card in the online world which presents all business information including name, contact number and address to recipients. Signatures can vary a lot in their structure and content, so it is a great challenge to automatically extract them. In this paper we present a hybrid approach to automatic signature extraction. First step is to obtain the original most recently sent message from the entire email thread, cleaned from all disclaimers and superfluous lines, making the signature to be at the bottom of the email. Then we apply Support Vector Machine (SVM) Machine Learning (ML) technique to classify emails according to whether they contain a signature. To improve obtained results we apply a set of sophisticated Information Extraction (IE) rules. Finally, we extract signatures with a great success. We trained and tested our technique on a wide range of different data: Forge dataset, Enron with our own collection of emails and a large set of emails provided by our native English-speaking friends. We extracted signatures with precision 99.62% and recall 93.20%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Forge dataset. http://github.com/materials-data-facility/forge

  2. Mailgun, open sourcing our email signature parsing library. http://www.mailgun.com/blog/open-sourcing-our-email-signature-parsing-library/

  3. SVM, Scikit Learn Library. http://scikit-learn.org/stable/modules/svm.html

  4. Talon, the Mailgun’s Python library. http://github.com/mailgun/talon

  5. Text Minner, Email Signature Extractor. http://appsource.microsoft.com/en-us/product/office/wa104380692

  6. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval, vol. 463. ACM press New York (1999)

    Google Scholar 

  7. Carvalho, V.R., Cohen, W.W.: Learning to extract signature and reply lines from email. In: Proceedings of the Conference on Email and Anti-Spam, vol. 2004 (2004)

    Google Scholar 

  8. Chen, H., Hu, J., Sproat, R.W.: Integrating geometrical and linguistic analysis for email signature block parsing. ACM Trans. Inform. Syst. (TOIS) 17(4), 343–366 (1999)

    Article  Google Scholar 

  9. Graovac, J.: A variant of n-gram based language-independent text categorization. Intell. Data Anal. 18(4), 677–695 (2014)

    Article  Google Scholar 

  10. Graovac, J., Kovačević, J., Pavlović-Lažetić, G.: Hierarchical vs. flat n-gram-based text categorization: can we do better? Computer Science and Information Systems 14(1), 103–121 (2017)

    Google Scholar 

  11. Graovac, J., Mladenović, M., Tanasijević, I.: Ngramspd: Exploring optimal n-gram model for sentiment polarity detection in different languages. Intell. Data Anal. 23(2), 279–296 (2019)

    Article  Google Scholar 

  12. Joachims, T.: Learning to classify text using support vector machines: Methods, theory and algorithms. Kluwer Academic Publishers (2002)

    Google Scholar 

  13. Joachims, T.: A statistical learning model of text classification for svms. In: Learning to Classify Text Using Support Vector Machines, pp. 45–74. Springer (2002). https://doi.org/10.1007/978-1-4615-0907-3_4

  14. Lang, K.: The 20 newsgroups data set, version 20news-18828 (1995)

    Google Scholar 

  15. Lawson, N., Eustice, K., Perkowitz, M., Yetisgen-Yildiz, M.: Annotating large email datasets for named entity recognition with mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pp. 71–79 (2010)

    Google Scholar 

  16. Minkov, E., Wang, R.C., Cohen, W.: Extracting personal names from email: Applying named entity recognition to informal text. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, pp. 443–450 (2005)

    Google Scholar 

  17. Radicati, S.: Email market, 2021–2025. The Radicati Group Inc, Palo Alto, CA (2021)

    Google Scholar 

  18. Tanasijević, I.: Multimedial databases in managing the intagible cultural heritage. University of Belgrade (2021)

    Google Scholar 

  19. Tanasijević, I., Pavlović-Lažetić, G.: Herculb: content-based information extraction and retrieval for cultural heritage of the balkans. The electronic library (2020)

    Google Scholar 

  20. Tang, J., Li, H., Cao, Y., Tang, Z.: Email data cleaning. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 489–498 (2005)

    Google Scholar 

Download references

Acknowledgements

The work presented has been supported by the Ministry of Science and Technological Development, Republic of Serbia, through Projects No. 174021 and No. III47003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jelena Graovac .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Graovac, J., Tomašević, I., Pavlović-Lažetić, G. (2022). Meet Your Email Sender - Hybrid Approach to Email Signature Extraction. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21967-2_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21966-5

  • Online ISBN: 978-3-031-21967-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics