skip to main content
10.1145/2433396.2433447acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Threading machine generated email

Published: 04 February 2013 Publication History

Abstract

Viewing email messages as parts of a sequence or a thread is a convenient way to quickly understand their context. Current threading techniques rely on purely syntactic methods, matching sender information, subject line, and reply/forward prefixes. As such, they are mostly limited to personal conversations. In contrast, machine-generated email, which amount, as per our experiments, to more than 60% of the overall email traffic, requires a different kind of threading that should reflect how a sequence of emails is caused by a few related user actions. For example, purchasing goods from an online store will result in a receipt or a confirmation message, which may be followed, possibly after a few days, by a shipment notification message from an express shipping service. In today's mail systems, they will not be a part of the same thread, while we believe they should. In this paper, we focus on this type of threading that we coin "causal threading". We demonstrate that, by analyzing recurring patterns over hundreds of millions of mail users, we can infer a causality relation between these two individual messages. In addition, by observing multiple causal relations over common messages, we can generate "causal threads" over a sequence of messages. The four key stages of our approach consist of: (1) identifying messages that are instances of the same email type or "template" (generated by the same machine process on the sender side) (2) building a causal graph, in which nodes correspond to email templates and edges indicate potential causal relations (3) learning a causal relation prediction function, and (4) automatically "threading" the incoming email stream. We present detailed experimental results obtained by analyzing the inboxes of 12.5 million Yahoo! Mail users, who voluntarily opted-in for such research. Supervised editorial judgments show that we can identify more than 70% (recall rate) of all "causal threads" at a precision level of 90%. In addition, for a search scenario we show that we achieve a precision close to 80% at 90% recall. We believe that supporting causal threads in email clients opens new grounds for improving both email search and browsing experiences.

References

[1]
Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In SIGMOD Conference, pages 207--216, 1993.
[2]
Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB'94, pages 487--499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.
[3]
Mark Brownlow. Email and web mail statistics. Email Marketing Reports, October 2011. http://www.email-marketing-reports.com/metrics/email-statistics.htm.
[4]
comScore Inc. The 2010 us digital year in review. comScore Whitepaper, February 2011. http://www.comscore.com/Press_Events/Presentations_Whitepapers/2011/2010_US_Digital_Year_in_Review.
[5]
D. Crocker. Standard for the format off arpa internet text messages, August 1982.
[6]
Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In ICML, pages 124--133, 1999.
[7]
The Radicati Group. Email statistics report, 2011--2015, May 2011. http://www.radicati.com/?p=7269.
[8]
Jochen Hipp, Ulrich Güntzer, and Gholamreza Nakhaeizadeh. Algorithms for association rule mining - a general survey and comparison. SIGKDD Explorations, 2(1):58--64, 2000.
[9]
Ruoming Jin and Gagan Agrawal. An algorithm for in-core frequent itemset mining on streaming data. In Proceedings of the Fifth IEEE International Conference on Data Mining, ICDM'05, pages 210--217, Washington, DC, USA, 2005. IEEE Computer Society.
[10]
Yehuda Koren, Edo Liberty, Yoelle Maarek, and Roman Sandler. Automatically tagging email by leveraging other users' folders. In KDD, pages 913--921, 2011.
[11]
David D. Lewis and Kimberly A. Knowles. Threading electronic mail: a preliminary study. Inf. Process. Manage., 33:209--217, March 1997.
[12]
Einat Minkov, William W. Cohen, and Andrew Y. Ng. Contextual search and name disambiguation in email using graphs. In Proceedings of the 29th annual international ACM SIGIR conference, SIGIR'06, pages 27--34. ACM, 2006.
[13]
Ed. P. Resnick. Internet message format, 2001.
[14]
George Rebane and Judea Pearl. The recovery of causal poly-trees from statistical data. Int. J. Approx. Reasoning, pages 1--1, 1988.
[15]
Alexia Tsotsis. Comscore says you don't got mail: Web email usage declines, 59% among teens. TechCrunch, Feb 2011.
[16]
Yi-Chia Wang, Mahesh Joshi, William Cohen, and Carolyn Rosé. Recovering implicit thread structure in newsgroup style conversations. In Proceedings of the 2nd International Conference on Weblogs and Social Media (ICWSM II), 2008.
[17]
Jen-Yuan Yeh and Aaron Harnly. Email thread reassembly using similarity matching. In Proceedings of Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), 2006.

Cited By

View all
  • (2023)Knowledge Engineering from Email ArchivesGranular, Fuzzy, and Soft Computing10.1007/978-1-0716-2628-3_715(469-485)Online publication date: 30-Mar-2023
  • (2022)Search and Discovery in Personal Email CollectionsProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3501393(1617-1619)Online publication date: 11-Feb-2022
  • (2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining
February 2013
816 pages
ISBN:9781450318693
DOI:10.1145/2433396
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 February 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. algorithms
  2. email threading
  3. emamodels
  4. frequent sets and patterns
  5. user experience

Qualifiers

  • Research-article

Conference

WSDM 2013

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Knowledge Engineering from Email ArchivesGranular, Fuzzy, and Soft Computing10.1007/978-1-0716-2628-3_715(469-485)Online publication date: 30-Mar-2023
  • (2022)Search and Discovery in Personal Email CollectionsProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3501393(1617-1619)Online publication date: 11-Feb-2022
  • (2021)Large-Scale Information Extraction under Privacy-Aware ConstraintsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482027(4845-4848)Online publication date: 26-Oct-2021
  • (2021)Knowledge Engineering from Email ArchivesEncyclopedia of Complexity and Systems Science10.1007/978-3-642-27737-5_715-1(1-17)Online publication date: 22-Jul-2021
  • (2020)Email Classification Techniques—A ReviewData Science and Intelligent Applications10.1007/978-981-15-4474-3_21(181-189)Online publication date: 18-Jun-2020
  • (2020)Generic Key Value Extractions from EmailsBig Data Analytics10.1007/978-3-030-66665-1_13(193-208)Online publication date: 15-Dec-2020
  • (2019)Online template induction for machine-generated emailsProceedings of the VLDB Endowment10.14778/3342263.334226412:11(1235-1248)Online publication date: 1-Jul-2019
  • (2019)RiSER: Learning Better Representations for Richly Structured EmailsThe World Wide Web Conference10.1145/3308558.3313720(886-895)Online publication date: 13-May-2019
  • (2019)Large-Scale Information Extraction from Emails with Data ConstraintsBig Data Analytics10.1007/978-3-030-37188-3_8(124-139)Online publication date: 12-Dec-2019
  • (2018)Learning with sparse and biased feedback for personal searchProceedings of the 27th International Joint Conference on Artificial Intelligence10.5555/3304652.3304738(5219-5223)Online publication date: 13-Jul-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media