Abstract
Many machine generated emails carry important information which must be acted upon at scheduled time by the recipient. Thus, it becomes a natural goal to automatically extract such actionable information from these emails and communicate to the users. These emails are generated for many different domains, providing different types of services. However, such emails carry personal information, therefore, it becomes difficult to get access to large corpus of labeled data for supervised information extraction methods.
In this paper, we propose a novel method to automatically identify part of the email containing actionable information, called core region of the email, with the aid of a domain dictionary. Domain dictionary is generated based on the public information of the domain. The core regions are stored as template trees - a template tree is a sub-tree embedded in the email’s HTML DOM tree.
Our experiments over real data show, structure of the core region of the email, containing all the information of our interest, is very simple and it is 85%–98% smaller compared to the original email. Further, our experiments also show that the template trees are highly repetitive across diverse set of emails from a given service provider.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Di Castro, D., et al.: Enforcing k-anonymity in web mail auditing. In: Proceedings of International Conference on Web Search and Data Mining, WSDM 2016, San Francisco, California, USA, pp. 327–336 (2016)
Grbovic, M., et al.: How many folders do you really need? Classifying email into a handful of categories. In: Proceedings of International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, pp. 869–878 (2014)
Zhang, W., et al.: Annotating needles in the haystack without looking: product information extraction from emails. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, SIGKDD 2015, Sydney, NSW, Australia, pp. 2257–2266 (2015)
Liu, B., et al.: Mining data records in web pages. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, SIGKDD 2003, Washington, D.C., pp. 601–606 (2003)
Di Castro, D., et al.: You’ve got mail, and here is what you could do with it! Analyzing and predicting actions on email messages. In: Proceedings of International Conference on Web Search and Data Mining, WSDM 2016, San Francisco, California, USA, pp. 307–316 (2016)
Wendt, J.W., et al.: Hierarchical label propagation and discovery for machine generated email. In: Proceedings of International Conference on Web Search and Data Mining, WSDM 2016, San Francisco, California, USA, pp. 317–326 (2016)
Zhang, A., Garcia-Pueyo, L., Wendt, J.B., Najork, M., Broder, A.: Email category prediction. In: Proceedings of International Conference on World Wide Web Companion, WWW 2017, Perth, Australia, pp. 495–503 (2017)
Maarek, Y.: Is mail the next frontier in search and data mining? In: Proceedings of International Conference on Web Search and Data Mining, WSDM 2016, San Francisco, California, USA, p. 203 (2016)
Proskurniay, J., et al.: Template induction over unstructured email corpora. In: Proceedings of International Conference on World Wide Web, WWW 2017, Perth, Australia, pp. 1521–1530 (2017)
Avigdor-Elgrabli, N., et al.: Structural clustering of machine generated mail. In: Proceedings of ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, Indiana, USA, pp. 217–226 (2016)
Ailon, N., Karnin, Z.S., Liberty, E., Maarek, Y.: Threading machine generated email. In: Proceedings of International Conference on Web search and Data Mining, WSDM 2013, Rome, Italy, pp. 405–414 (2013)
Cohen, S., Or, N.: A general algorithm for subtree similarity-search. In: IEEE International Conference on Data Engineering, ICDE 2014, Chicago, Illinois (2014)
Tatarinov, I., et al.: Storing and querying ordered XML using a relational database system. In: Proceedings of International Conference on Management of Data, SIGMOD 2002, Madison, Wisconsin, pp. 204–215 (2002)
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML joins. In: Proceedings of International Conference on Management of Data, SIGMOD 2002, Madison, Wisconsin, pp. 287–298 (2002)
Furche, T., et al.: DIADEM: thousands of Websites to a Single Database. Proc. VLDB Endow. 7(14), 1845–1856 (2014)
Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011)
Arso, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of International Conference on Management of Data, SIGMOD 2003, pp. 337–348, San Diego, California (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Agarwal, M.K., Singh, J. (2018). Template Trees: Extracting Actionable Information from Machine Generated Emails. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2018. Lecture Notes in Computer Science(), vol 11030. Springer, Cham. https://doi.org/10.1007/978-3-319-98812-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-98812-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98811-5
Online ISBN: 978-3-319-98812-2
eBook Packages: Computer ScienceComputer Science (R0)