Skip to main content

Min-Hash Sketches

  • Reference work entry
  • First Online:

Years and Authors of Summarized Original Work

  • 1985; Flajolet, Martin

  • 1997; Broder

  • 1997; Cohen

Problem Definition

MinHash sketches (also known as min-wise sketches) are randomized summary structures of subsets which support set union operations and approximate processing of cardinality and similarity queries.

Set-union support, also called mergeability, means that a sketch of the union of two sets can be computed from the sketches of the two sets. In particular, this applies when the second set is a single element. The queries supported by MinHash sketches include cardinality (of a subset from its sketch) and similarity (of two subsets from their sketches).

Sketches are useful for massive data analysis. Working with sketches often means that instead of explicitly maintaining and manipulating very large subsets (or equivalently 0/1 vectors), we can instead maintain the much smaller sketches and can still query properties of these subsets.

We denote the universe of elements by Uand its...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   1,599.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   1,999.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  1. Alon N, Matias Y, Szegedy M (1999) The space complexity of approximating the frequency moments. J Comput Syst Sci 58:137–147

    Article  MathSciNet  MATH  Google Scholar 

  2. Bar-Yossef Z, Jayram TS, Kumar R, Sivakumar D, Trevisan L (2002) Counting distinct elements in a data stream. In: RANDOM, Cambridge. ACM

    Book  MATH  Google Scholar 

  3. Brewer KRW, Early LJ, Joyce SF (1972) Selecting several samples from a single population. Aust J Stat 14(3):231–239

    Article  Google Scholar 

  4. Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences, Salerno. IEEE, pp 21–29

    Google Scholar 

  5. Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci 60(3):630–659

    Article  MathSciNet  MATH  Google Scholar 

  6. Cohen E (1997) Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci 55:441–453

    Article  MathSciNet  MATH  Google Scholar 

  7. Cohen E (2014) All-distances sketches, revisited: HIP estimators for massive graphs analysis. In: PODS, Snowbird. ACM. http://arxiv.org/abs/1306.3284

  8. Cohen E (2014) Estimation for monotone sampling: competitiveness and customization. In: PODC, Paris. ACM. http://arxiv.org/abs/1212.0243, full version http://arxiv.org/abs/1212.0243

  9. Cohen E, Kaplan H (2007) Summarizing data using bottom-k sketches. In: PODC, Portland. ACM

    Book  MATH  Google Scholar 

  10. Cohen E, Kaplan H (2009) Leveraging discarded samples for tighter estimation of multiple-set aggregates. In: SIGMETRICS, Seattle. ACM

    Book  Google Scholar 

  11. Cohen E, Delling D, Fuchs F, Goldberg A, Goldszmidt M, Werneck R (2013) Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths. In: COSN, Boston. ACM

    Book  Google Scholar 

  12. Cohen E, Delling D, Pajor T, Werneck RF (2014) Sketch-based influence maximization and computation: scaling up with guarantees. In: CIKM. ACM. http://research.microsoft.com/apps/pubs/?id=226623, full version http://research.microsoft.com/apps/pubs/?id=226623

  13. Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31:182–209

    Article  MathSciNet  MATH  Google Scholar 

  14. Flajolet P, Fusy E, Gandouet O, Meunier F (2007) Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In: Analysis of algorithms (AOFA), Juan des Pins

    Google Scholar 

  15. Heule S, Nunkesser M, Hall A (2013) HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: EDBT, Genoa

    Book  Google Scholar 

  16. Indyk P (1999) A small approximately min-wise independent family of hash functions. In: Proceedings of the 10th ACM-SIAM symposium on discrete algorithms, Baltimore. ACM-SIAM

    Google Scholar 

  17. Li P, Church KW, Hastie T (2008) One sketch for all: theory and application of conditional random sampling. In: NIPS, Vancouver

    Google Scholar 

  18. Li P, Owen AB, Zhang CH (2012) One permutation hashing. In: NIPS, Lake Tahoe

    Google Scholar 

  19. Ohlsson E (1998) Sequential poisson sampling. J Off Stat 14(2):149–162

    Google Scholar 

  20. Rosén B (1972) Asymptotic theory for successive sampling with varying probabilities without replacement, I. Ann Math Stat 43(2):373–397. http://www.jstor.org/stable/2239977

  21. Rosén B (1997) Asymptotic theory for order sampling. J Stat Plan Inference 62(2):135–158

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this entry

Cite this entry

Cohen, E. (2016). Min-Hash Sketches. In: Kao, MY. (eds) Encyclopedia of Algorithms. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2864-4_573

Download citation

Publish with us

Policies and ethics