skip to main content
10.1145/3397271.3401284acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Public Access

A Test Collection for Relevance and Sensitivity

Published: 25 July 2020 Publication History

Abstract

Recent interest in the design of information retrieval systems that can balance an ability to find relevant content with an ability to protect sensitive content creates a need for test collections that are annotated for both relevance and sensitivity. This paper describes the development of such a test collection that is based on the Avocado Research Email Collection. Four people created search topics as a basis for assessing relevance, and two personas describing the sensitivities of representative (but fictional) content creators were created as a basis for assessing sensitivity. These personas were based on interviews with potential donors of historically significant email collections and with archivists who currently manage access to such collections. Two annotators then created relevance and sensitivity judgments for 65 topics, divided approximately equally between the two personas. Annotator agreement statistics indicate fairly good external reliability for both relevance and sensitivity annotations, and a baseline sensitivity classifier trained and evaluated using cross-validation achieved better than 80% $F_1$, suggesting that the resulting collection will likely be useful as a basis for comparing alternative retrieval systems that seek to balance relevance and sensitivity.

Supplementary Material

MP4 File (3397271.3401284.mp4)
Recent interest in the design of information retrieval systems that can balance an ability to find relevant content with an ability to protect sensitive content creates a need for test collections that are annotated for both relevance and sensitivity. This paper describes the development of such a test collection that is based on the Avocado Research Email Collection.

References

[1]
R Artstein and M Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics 34, 4 (2008), 555--596.
[2]
A Cooper. 2004. The origin of personas. Innovation 23, 1 (2004), 26--29.
[3]
G Cormack et al. 2010. Overview of the TREC 2010 Legal Track. In TREC.
[4]
M Hearst. 2005. Teaching applied NLP: Triumphs and tribulations. In ACL Workshop on Effective Tools and Methodologies for Teaching NLP and CL.
[5]
S Jabbari et al. 2006. Towards the Orwellian Nightmare: Separation of Business and Personal Emails. In COLING/ACL poster sessions.
[6]
K Martin and H Nissenbaum. 2016. Measuring Privacy: An Empirical Test Using Context to Expose Confounding Variables. Colum. Sci. & Tech. L. Rev.18 (2016).
[7]
G McDonald et al. 2017. Enhancing Sensitivity Classification with Semantic Features Using Word Embeddings. In ECIR.
[8]
D Mulligan et al. 2016. Privacy is an essentially contested concept: a multi-dimensional analytic for mapping privacy. Phil. Trans. Royal Soc. A374 (2016).
[9]
H Nissenbaum. 2009. Privacy in context: Technology, policy, and the integrity of social life. Stanford University Press.
[10]
D Oard et al. 2015. Avocado Research Email Collection. LDC2015T03.
[11]
J Pruitt and T Adlin. 2010.The persona lifecycle: keeping people in mind throughout product design. Elsevier.
[12]
M Sayed et al.2019. Jointly Modeling Relevance and Sensitivity for Search Among Sensitive Content. In SIGIR.

Cited By

View all
  • (2024)Cascading Ranking Pipelines for Sensitivity-Aware SearchAdvances in Information Retrieval10.1007/978-3-031-56069-9_41(331-333)Online publication date: 23-Mar-2024
  • (2023)Building a Multimodal Classifier of Email Behavior: Towards a Social Network Understanding of Organizational CommunicationInformation10.3390/info1412066114:12(661)Online publication date: 14-Dec-2023
  • (2022)Providing More Efficient Access to Government Records: A Use Case Involving Application of Machine Learning to Improve FOIA Review for the Deliberative Process PrivilegeJournal on Computing and Cultural Heritage 10.1145/348104515:1(1-19)Online publication date: 22-Jan-2022
  • Show More Cited By

Index Terms

  1. A Test Collection for Relevance and Sensitivity

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2020
    2548 pages
    ISBN:9781450380164
    DOI:10.1145/3397271
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 July 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. relevance
    2. sensitivity
    3. test collection

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    SIGIR '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)146
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cascading Ranking Pipelines for Sensitivity-Aware SearchAdvances in Information Retrieval10.1007/978-3-031-56069-9_41(331-333)Online publication date: 23-Mar-2024
    • (2023)Building a Multimodal Classifier of Email Behavior: Towards a Social Network Understanding of Organizational CommunicationInformation10.3390/info1412066114:12(661)Online publication date: 14-Dec-2023
    • (2022)Providing More Efficient Access to Government Records: A Use Case Involving Application of Machine Learning to Improve FOIA Review for the Deliberative Process PrivilegeJournal on Computing and Cultural Heritage 10.1145/348104515:1(1-19)Online publication date: 22-Jan-2022
    • (2022)Comparing Intrinsic and Extrinsic Evaluation of Sensitivity ClassificationAdvances in Information Retrieval10.1007/978-3-030-99739-7_25(215-222)Online publication date: 5-Apr-2022
    • (2021)Search with DiscretionProceedings of the ACM on Human-Computer Interaction10.1145/34492075:CSCW1(1-20)Online publication date: 22-Apr-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media