Abstract
The availability of user generated textual data in different activities online, such as tweets and reviews has been used in many machine learning models. However, the user generated text could be a privacy leakage source for the individuals’ private-attributes. In this paper, we study the privacy issues in the user generated text and propose a privacy-preserving text representation learning framework, \({DP}_{BERT}\), which learns the textual representation. Our proposed framework uses BERT to extract the sentences embedding to learn the textual representation that (1) is differentially private to protect against identity leakage (e.g., if a target instance in the data or not), (2) protects against leakage of private-attributes information (e.g., age, gender, location), and (3) maintains the high utility of the given text.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alnasser, W., Beigi, G., Liu, H.: An overview on protecting user private-attribute information on social networks. In: Cruz-Cunha, M.M., Mateus-Coelho, N.R. (eds.) Handbook of Research on Cyber Crime and Information Privacy, Chap. 6 (2020)
Beigi, G., Liu, H.: A survey on privacy in social media: identification, mitigation, and applications. ACM/IMS Trans. Data Sci. 1(1) (2020). https://doi.org/10.1145/3343038
Beigi, G., Shu, K., Guo, R., Wang, S., Liu, H.: Privacy preserving text representation learning, pp. 275–276 (2019). https://doi.org/10.1145/3342220.3344925
Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. J. Mach. Learn. Res. 12(29), 1069–1109 (2011). http://jmlr.org/papers/v12/chaudhuri11a.html
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Ethayarajh, K.: Unsupervised random walk sentence embeddings: a strong but simple baseline. In: Proceedings of the Third Workshop on Representation Learning for NLP, pp. 91–100. Association for Computational Linguistics, Melbourne, July 2018. https://doi.org/10.18653/v1/W18-3012. https://www.aclweb.org/anthology/W18-3012
Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4) (2010). https://doi.org/10.1145/1749603.1749605
Hovy, D., Johannsen, A., Søgaard, A.: User review sites as a resource for large-scale sociolinguistic studies. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, pp. 452–461. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2015). https://doi.org/10.1145/2736277.2741141
Hovy, D., Søgaard, A.: Tagging performance correlates with author age. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 483–488. Association for Computational Linguistics, Beijing, July 2015. https://doi.org/10.3115/v1/P15-2079. https://www.aclweb.org/anthology/P15-2079
Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2014)
Liu, P., et al.: Local differential privacy for social network publishing. Neurocomputing 391, 273–279 (2020). https://doi.org/10.1016/j.neucom.2018.11.104. http://www.sciencedirect.com/science/article/pii/S0925231219304229
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1410. https://www.aclweb.org/anthology/D19-1410
dos Santos, C., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of COLING 2014, The 25th International Conference on Computational Linguistics: Technical Papers, pp. 69–78. Dublin City University and Association for Computational Linguistics, Dublin, August 2014. https://www.aclweb.org/anthology/C14-1008
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Sweeney, L.: K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002). https://doi.org/10.1142/S0218488502001648
Wang, B., Kuo, C.C.J.: SBERT-WK: a sentence embedding method by dissecting BERT-based word models (2020)
Acknowledgement
This work, in part, is supported by the Saudi Arabian Cultural Mission (SACM) in the United States.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Alnasser, W., Beigi, G., Liu, H. (2021). Privacy Preserving Text Representation Learning Using BERT. In: Thomson, R., Hussain, M.N., Dancy, C., Pyke, A. (eds) Social, Cultural, and Behavioral Modeling. SBP-BRiMS 2021. Lecture Notes in Computer Science(), vol 12720. Springer, Cham. https://doi.org/10.1007/978-3-030-80387-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-80387-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80386-5
Online ISBN: 978-3-030-80387-2
eBook Packages: Computer ScienceComputer Science (R0)