Skip to main content

Privacy–Preserving Text Similarity via Non-Prefix-Free Codes

  • Conference paper
  • First Online:
  • 1053 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11807))

Abstract

Many methods have been proposed to compute the similarity score \(\alpha \leftarrow S(\mathcal {A},\mathcal {B})\) in between two plain documents \(\mathcal {A}\) and \(\mathcal {B}\). However, when their contents are confidential, special processing is required to protect privacy. A great extent of the solutions offered to date is mostly based on homomorphic encryption or secure multi-party computation techniques, where their computational cost inhibits the practical usage, especially on massive sets. In this study we propose an alternative by encoding the documents with non-prefix-free (NPF) coding before applying the preferred similarity metric S(). The NPF coding simply represents the symbols with variable-length codewords, where the codeword set is generated without the prefix-free restriction. Thus, a codeword may be a prefix of another, and without the explicit codeword boundary information, retrieving the original data from the encoded stream becomes hard due to the lack of unique decodability in non-prefix-free codes. We provide the combinatorial analysis of this hardness, and experimentally compare the similarity scores obtained on NPF encoded documents and on original plain text versions. We have considered normalized compression distance (NCD) and Jaccard coefficient (JC) for the similarity metric S(). When \(\mathcal {A^\prime }\) and \(\mathcal {B^\prime }\) denote the NPF-encoded documents, experiments conducted on METER corpus revealed that the difference between \(\alpha ^\prime \leftarrow S(\mathcal {A^\prime },\mathcal {B^\prime })\) and \(\alpha \leftarrow S(\mathcal {A},\mathcal {B})\) lie in the range of \(0.5\%\) and \(3\%\) for both NCD and JC.

This work has been supported by the TÜBİTAK grant number 117E865.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    For example, the number of anagrams of word MISSISSIPPI is \(\frac{11!}{4!1!2!4!}\) as letters i, m, p, s appears 4, 1, 2, and 4 times respectively.

References

  1. Adaş, B., Bayraktar, E., Külekci, M.O.: Huffman codes versus augmented non-prefix-free codes. In: Bampis, E. (ed.) SEA 2015. LNCS, vol. 9125, pp. 315–326. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20086-6_24

    Chapter  Google Scholar 

  2. Bennett, C.H., Gács, P., Li, M., Vitányi, P.M., Zurek, W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)

    Article  MathSciNet  Google Scholar 

  3. Blundo, C., De Cristofaro, E., Gasti, P.: Espresso: efficient privacy-preserving evaluation of sample set similarity. J. Comput. Secur. 22(3), 355–381 (2014)

    Article  Google Scholar 

  4. Buttler, D.: A short survey of document structure similarity algorithms. In: International Conference on Internet Computing, pp. 3–9 (2004)

    Google Scholar 

  5. Buyrukbilen, S., Bakiras, S.: Secure similar document detection with simhash. In: Jonker, W., Petković, M. (eds.) SDM 2013. LNCS, vol. 8425, pp. 61–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06811-4_12

    Chapter  Google Scholar 

  6. Chen, X., Francia, B., Li, M., Mckinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Inf. Theory 50(7), 1545–1551 (2004)

    Article  MathSciNet  Google Scholar 

  7. Cilibrasi, R.L., Vitanyi, P.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)

    Article  Google Scholar 

  8. Clough, P., Gaizauskas, R., Piao, S.S., Wilks, Y.: Meter: measuring text reuse. In: Proceedings of the 40th Annual Meeting of ACL, pp. 152–159 (2002)

    Google Scholar 

  9. Fraenkel, A.S., Klein, S.T.: Complexity aspects of guessing prefix codes. Algorithmica 12(4–5), 409–419 (1994)

    Article  MathSciNet  Google Scholar 

  10. Gentry, C.: Fully homomorphic encryption using ideal lattices. In: STOC, vol. 9, pp. 169–178 (2009)

    Google Scholar 

  11. Gillman, D.W., Mohtashemi, M., Rivest, R.L.: On breaking a Huffman code. IEEE Trans. Inf. Theory 42(3), 972–976 (1996)

    Article  MathSciNet  Google Scholar 

  12. Hammouda, K.M., Kamel, M.S.: Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16(10), 1279–1296 (2004)

    Article  Google Scholar 

  13. Jaccard, P.: Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull. Soc. Vaudoise Sci. Nat. 37, 547–579 (1901)

    Google Scholar 

  14. Jiang, W., Murugesan, M., Clifton, C., Si, L.: Similar document detection with limited information disclosure. In: IEEE 24th International Conference on Data Engineering, pp. 735–743 (2008)

    Google Scholar 

  15. Jiang, W., Samanthula, B.K.: N-gram based secure similar document detection. In: Li, Y. (ed.) DBSec 2011. LNCS, vol. 6818, pp. 239–246. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22348-8_19

    Chapter  Google Scholar 

  16. Külekci, M.O.: Uniquely decodable and directly accessible non-prefix-free codes via wavelet trees. In: IEEE International Symposium on Information Theory, pp. 1969–1973 (2013)

    Google Scholar 

  17. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)

    Article  MathSciNet  Google Scholar 

  18. Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)

    Google Scholar 

  19. Muralidhar, R.B.: Substitution cipher with nonprefix codes. Master’s thesis, San Jose State University (2011)

    Google Scholar 

  20. Murugesan, M., Jiang, W., Clifton, C., Si, L., Vaidya, J.: Efficient privacy-preserving similar document detection. VLDB J. 19(4), 457–475 (2010)

    Article  Google Scholar 

  21. Naehrig, M., Lauter, K., Vaikuntanathan, V.: Can homomorphic encryption be practical? In: Proceedings of the 3rd ACM Workshop on Cloud Computing Security Workshop, pp. 113–124. ACM (2011)

    Google Scholar 

  22. Rubin, F.: Cryptographic aspects of data compression codes. Cryptologia 3(4), 202–205 (1979)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Oğuzhan Külekci .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Külekci, M.O., Habib, I., Aghabaiglou, A. (2019). Privacy–Preserving Text Similarity via Non-Prefix-Free Codes. In: Amato, G., Gennaro, C., Oria, V., Radovanović , M. (eds) Similarity Search and Applications. SISAP 2019. Lecture Notes in Computer Science(), vol 11807. Springer, Cham. https://doi.org/10.1007/978-3-030-32047-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32047-8_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32046-1

  • Online ISBN: 978-3-030-32047-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics