Abstract
Many methods have been proposed to compute the similarity score \(\alpha \leftarrow S(\mathcal {A},\mathcal {B})\) in between two plain documents \(\mathcal {A}\) and \(\mathcal {B}\). However, when their contents are confidential, special processing is required to protect privacy. A great extent of the solutions offered to date is mostly based on homomorphic encryption or secure multi-party computation techniques, where their computational cost inhibits the practical usage, especially on massive sets. In this study we propose an alternative by encoding the documents with non-prefix-free (NPF) coding before applying the preferred similarity metric S(). The NPF coding simply represents the symbols with variable-length codewords, where the codeword set is generated without the prefix-free restriction. Thus, a codeword may be a prefix of another, and without the explicit codeword boundary information, retrieving the original data from the encoded stream becomes hard due to the lack of unique decodability in non-prefix-free codes. We provide the combinatorial analysis of this hardness, and experimentally compare the similarity scores obtained on NPF encoded documents and on original plain text versions. We have considered normalized compression distance (NCD) and Jaccard coefficient (JC) for the similarity metric S(). When \(\mathcal {A^\prime }\) and \(\mathcal {B^\prime }\) denote the NPF-encoded documents, experiments conducted on METER corpus revealed that the difference between \(\alpha ^\prime \leftarrow S(\mathcal {A^\prime },\mathcal {B^\prime })\) and \(\alpha \leftarrow S(\mathcal {A},\mathcal {B})\) lie in the range of \(0.5\%\) and \(3\%\) for both NCD and JC.
This work has been supported by the TÜBİTAK grant number 117E865.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
For example, the number of anagrams of word MISSISSIPPI is \(\frac{11!}{4!1!2!4!}\) as letters i, m, p, s appears 4, 1, 2, and 4 times respectively.
References
Adaş, B., Bayraktar, E., Külekci, M.O.: Huffman codes versus augmented non-prefix-free codes. In: Bampis, E. (ed.) SEA 2015. LNCS, vol. 9125, pp. 315–326. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20086-6_24
Bennett, C.H., Gács, P., Li, M., Vitányi, P.M., Zurek, W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)
Blundo, C., De Cristofaro, E., Gasti, P.: Espresso: efficient privacy-preserving evaluation of sample set similarity. J. Comput. Secur. 22(3), 355–381 (2014)
Buttler, D.: A short survey of document structure similarity algorithms. In: International Conference on Internet Computing, pp. 3–9 (2004)
Buyrukbilen, S., Bakiras, S.: Secure similar document detection with simhash. In: Jonker, W., Petković, M. (eds.) SDM 2013. LNCS, vol. 8425, pp. 61–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06811-4_12
Chen, X., Francia, B., Li, M., Mckinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Inf. Theory 50(7), 1545–1551 (2004)
Cilibrasi, R.L., Vitanyi, P.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)
Clough, P., Gaizauskas, R., Piao, S.S., Wilks, Y.: Meter: measuring text reuse. In: Proceedings of the 40th Annual Meeting of ACL, pp. 152–159 (2002)
Fraenkel, A.S., Klein, S.T.: Complexity aspects of guessing prefix codes. Algorithmica 12(4–5), 409–419 (1994)
Gentry, C.: Fully homomorphic encryption using ideal lattices. In: STOC, vol. 9, pp. 169–178 (2009)
Gillman, D.W., Mohtashemi, M., Rivest, R.L.: On breaking a Huffman code. IEEE Trans. Inf. Theory 42(3), 972–976 (1996)
Hammouda, K.M., Kamel, M.S.: Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16(10), 1279–1296 (2004)
Jaccard, P.: Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull. Soc. Vaudoise Sci. Nat. 37, 547–579 (1901)
Jiang, W., Murugesan, M., Clifton, C., Si, L.: Similar document detection with limited information disclosure. In: IEEE 24th International Conference on Data Engineering, pp. 735–743 (2008)
Jiang, W., Samanthula, B.K.: N-gram based secure similar document detection. In: Li, Y. (ed.) DBSec 2011. LNCS, vol. 6818, pp. 239–246. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22348-8_19
Külekci, M.O.: Uniquely decodable and directly accessible non-prefix-free codes via wavelet trees. In: IEEE International Symposium on Information Theory, pp. 1969–1973 (2013)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)
Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)
Muralidhar, R.B.: Substitution cipher with nonprefix codes. Master’s thesis, San Jose State University (2011)
Murugesan, M., Jiang, W., Clifton, C., Si, L., Vaidya, J.: Efficient privacy-preserving similar document detection. VLDB J. 19(4), 457–475 (2010)
Naehrig, M., Lauter, K., Vaikuntanathan, V.: Can homomorphic encryption be practical? In: Proceedings of the 3rd ACM Workshop on Cloud Computing Security Workshop, pp. 113–124. ACM (2011)
Rubin, F.: Cryptographic aspects of data compression codes. Cryptologia 3(4), 202–205 (1979)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Külekci, M.O., Habib, I., Aghabaiglou, A. (2019). Privacy–Preserving Text Similarity via Non-Prefix-Free Codes. In: Amato, G., Gennaro, C., Oria, V., Radovanović , M. (eds) Similarity Search and Applications. SISAP 2019. Lecture Notes in Computer Science(), vol 11807. Springer, Cham. https://doi.org/10.1007/978-3-030-32047-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-32047-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32046-1
Online ISBN: 978-3-030-32047-8
eBook Packages: Computer ScienceComputer Science (R0)