Privacy–Preserving Text Similarity via Non-Prefix-Free Codes

Külekci, M. Oğuzhan; Habib, Ismail; Aghabaiglou, Amir

doi:10.1007/978-3-030-32047-8_9

Privacy–Preserving Text Similarity via Non-Prefix-Free Codes

M. Oğuzhan Külekci¹²,
Ismail Habib¹² &
Amir Aghabaiglou¹²

Conference paper
First Online: 23 September 2019

1053 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11807))

Abstract

Many methods have been proposed to compute the similarity score \(\alpha \leftarrow S(\mathcal {A},\mathcal {B})\) in between two plain documents \(\mathcal {A}\) and \(\mathcal {B}\). However, when their contents are confidential, special processing is required to protect privacy. A great extent of the solutions offered to date is mostly based on homomorphic encryption or secure multi-party computation techniques, where their computational cost inhibits the practical usage, especially on massive sets. In this study we propose an alternative by encoding the documents with non-prefix-free (NPF) coding before applying the preferred similarity metric S(). The NPF coding simply represents the symbols with variable-length codewords, where the codeword set is generated without the prefix-free restriction. Thus, a codeword may be a prefix of another, and without the explicit codeword boundary information, retrieving the original data from the encoded stream becomes hard due to the lack of unique decodability in non-prefix-free codes. We provide the combinatorial analysis of this hardness, and experimentally compare the similarity scores obtained on NPF encoded documents and on original plain text versions. We have considered normalized compression distance (NCD) and Jaccard coefficient (JC) for the similarity metric S(). When \(\mathcal {A^\prime }\) and \(\mathcal {B^\prime }\) denote the NPF-encoded documents, experiments conducted on METER corpus revealed that the difference between \(\alpha ^\prime \leftarrow S(\mathcal {A^\prime },\mathcal {B^\prime })\) and \(\alpha \leftarrow S(\mathcal {A},\mathcal {B})\) lie in the range of \(0.5\%\) and \(3\%\) for both NCD and JC.

This work has been supported by the TÜBİTAK grant number 117E865.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
For example, the number of anagrams of word MISSISSIPPI is \(\frac{11!}{4!1!2!4!}\) as letters i, m, p, s appears 4, 1, 2, and 4 times respectively.

References

Adaş, B., Bayraktar, E., Külekci, M.O.: Huffman codes versus augmented non-prefix-free codes. In: Bampis, E. (ed.) SEA 2015. LNCS, vol. 9125, pp. 315–326. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20086-6_24
Chapter Google Scholar
Bennett, C.H., Gács, P., Li, M., Vitányi, P.M., Zurek, W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)
Article MathSciNet Google Scholar
Blundo, C., De Cristofaro, E., Gasti, P.: Espresso: efficient privacy-preserving evaluation of sample set similarity. J. Comput. Secur. 22(3), 355–381 (2014)
Article Google Scholar
Buttler, D.: A short survey of document structure similarity algorithms. In: International Conference on Internet Computing, pp. 3–9 (2004)
Google Scholar
Buyrukbilen, S., Bakiras, S.: Secure similar document detection with simhash. In: Jonker, W., Petković, M. (eds.) SDM 2013. LNCS, vol. 8425, pp. 61–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06811-4_12
Chapter Google Scholar
Chen, X., Francia, B., Li, M., Mckinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Inf. Theory 50(7), 1545–1551 (2004)
Article MathSciNet Google Scholar
Cilibrasi, R.L., Vitanyi, P.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)
Article Google Scholar
Clough, P., Gaizauskas, R., Piao, S.S., Wilks, Y.: Meter: measuring text reuse. In: Proceedings of the 40th Annual Meeting of ACL, pp. 152–159 (2002)
Google Scholar
Fraenkel, A.S., Klein, S.T.: Complexity aspects of guessing prefix codes. Algorithmica 12(4–5), 409–419 (1994)
Article MathSciNet Google Scholar
Gentry, C.: Fully homomorphic encryption using ideal lattices. In: STOC, vol. 9, pp. 169–178 (2009)
Google Scholar
Gillman, D.W., Mohtashemi, M., Rivest, R.L.: On breaking a Huffman code. IEEE Trans. Inf. Theory 42(3), 972–976 (1996)
Article MathSciNet Google Scholar
Hammouda, K.M., Kamel, M.S.: Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16(10), 1279–1296 (2004)
Article Google Scholar
Jaccard, P.: Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull. Soc. Vaudoise Sci. Nat. 37, 547–579 (1901)
Google Scholar
Jiang, W., Murugesan, M., Clifton, C., Si, L.: Similar document detection with limited information disclosure. In: IEEE 24th International Conference on Data Engineering, pp. 735–743 (2008)
Google Scholar
Jiang, W., Samanthula, B.K.: N-gram based secure similar document detection. In: Li, Y. (ed.) DBSec 2011. LNCS, vol. 6818, pp. 239–246. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22348-8_19
Chapter Google Scholar
Külekci, M.O.: Uniquely decodable and directly accessible non-prefix-free codes via wavelet trees. In: IEEE International Symposium on Information Theory, pp. 1969–1973 (2013)
Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)
Article MathSciNet Google Scholar
Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)
Google Scholar
Muralidhar, R.B.: Substitution cipher with nonprefix codes. Master’s thesis, San Jose State University (2011)
Google Scholar
Murugesan, M., Jiang, W., Clifton, C., Si, L., Vaidya, J.: Efficient privacy-preserving similar document detection. VLDB J. 19(4), 457–475 (2010)
Article Google Scholar
Naehrig, M., Lauter, K., Vaikuntanathan, V.: Can homomorphic encryption be practical? In: Proceedings of the 3rd ACM Workshop on Cloud Computing Security Workshop, pp. 113–124. ACM (2011)
Google Scholar
Rubin, F.: Cryptographic aspects of data compression codes. Cryptologia 3(4), 202–205 (1979)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Informatics Institute, Istanbul Technical University, Istanbul, Turkey
M. Oğuzhan Külekci, Ismail Habib & Amir Aghabaiglou

Authors

M. Oğuzhan Külekci
View author publications
You can also search for this author in PubMed Google Scholar
Ismail Habib
View author publications
You can also search for this author in PubMed Google Scholar
Amir Aghabaiglou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Oğuzhan Külekci .

Editor information

Editors and Affiliations

ISTI-CNR, Pisa, Italy
Giuseppe Amato
ISTI-CNR, Pisa, Italy
Claudio Gennaro
New Jersey Institute of Technology, Newark, NJ, USA
Vincent Oria
University of Novi Sad, Novi Sad, Serbia
Miloš Radovanović

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Külekci, M.O., Habib, I., Aghabaiglou, A. (2019). Privacy–Preserving Text Similarity via Non-Prefix-Free Codes. In: Amato, G., Gennaro, C., Oria, V., Radovanović , M. (eds) Similarity Search and Applications. SISAP 2019. Lecture Notes in Computer Science(), vol 11807. Springer, Cham. https://doi.org/10.1007/978-3-030-32047-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-32047-8_9
Published: 23 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32046-1
Online ISBN: 978-3-030-32047-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics