Min-Hash Sketches

Cohen, Edith

doi:10.1007/978-1-4939-2864-4_573

Edith Cohen^2,3

354 Accesses

Years and Authors of Summarized Original Work

1985; Flajolet, Martin
1997; Broder
1997; Cohen

Problem Definition

MinHash sketches (also known as min-wise sketches) are randomized summary structures of subsets which support set union operations and approximate processing of cardinality and similarity queries.

Set-union support, also called mergeability, means that a sketch of the union of two sets can be computed from the sketches of the two sets. In particular, this applies when the second set is a single element. The queries supported by MinHash sketches include cardinality (of a subset from its sketch) and similarity (of two subsets from their sketches).

Sketches are useful for massive data analysis. Working with sketches often means that instead of explicitly maintaining and manipulating very large subsets (or equivalently 0/1 vectors), we can instead maintain the much smaller sketches and can still query properties of these subsets.

We denote the universe of elements by Uand its...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 1,999.99; Price excludes VAT (USA)

Hardcover Book: USD 1,999.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Alon N, Matias Y, Szegedy M (1999) The space complexity of approximating the frequency moments. J Comput Syst Sci 58:137–147
Article MathSciNet MATH Google Scholar
Bar-Yossef Z, Jayram TS, Kumar R, Sivakumar D, Trevisan L (2002) Counting distinct elements in a data stream. In: RANDOM, Cambridge. ACM
Book MATH Google Scholar
Brewer KRW, Early LJ, Joyce SF (1972) Selecting several samples from a single population. Aust J Stat 14(3):231–239
Article Google Scholar
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences, Salerno. IEEE, pp 21–29
Google Scholar
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci 60(3):630–659
Article MathSciNet MATH Google Scholar
Cohen E (1997) Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci 55:441–453
Article MathSciNet MATH Google Scholar
Cohen E (2014) All-distances sketches, revisited: HIP estimators for massive graphs analysis. In: PODS, Snowbird. ACM. http://arxiv.org/abs/1306.3284
Cohen E (2014) Estimation for monotone sampling: competitiveness and customization. In: PODC, Paris. ACM. http://arxiv.org/abs/1212.0243, full version http://arxiv.org/abs/1212.0243
Cohen E, Kaplan H (2007) Summarizing data using bottom-k sketches. In: PODC, Portland. ACM
Book MATH Google Scholar
Cohen E, Kaplan H (2009) Leveraging discarded samples for tighter estimation of multiple-set aggregates. In: SIGMETRICS, Seattle. ACM
Book Google Scholar
Cohen E, Delling D, Fuchs F, Goldberg A, Goldszmidt M, Werneck R (2013) Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths. In: COSN, Boston. ACM
Book Google Scholar
Cohen E, Delling D, Pajor T, Werneck RF (2014) Sketch-based influence maximization and computation: scaling up with guarantees. In: CIKM. ACM. http://research.microsoft.com/apps/pubs/?id=226623, full version http://research.microsoft.com/apps/pubs/?id=226623
Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31:182–209
Article MathSciNet MATH Google Scholar
Flajolet P, Fusy E, Gandouet O, Meunier F (2007) Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In: Analysis of algorithms (AOFA), Juan des Pins
Google Scholar
Heule S, Nunkesser M, Hall A (2013) HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: EDBT, Genoa
Book Google Scholar
Indyk P (1999) A small approximately min-wise independent family of hash functions. In: Proceedings of the 10th ACM-SIAM symposium on discrete algorithms, Baltimore. ACM-SIAM
Google Scholar
Li P, Church KW, Hastie T (2008) One sketch for all: theory and application of conditional random sampling. In: NIPS, Vancouver
Google Scholar
Li P, Owen AB, Zhang CH (2012) One permutation hashing. In: NIPS, Lake Tahoe
Google Scholar
Ohlsson E (1998) Sequential poisson sampling. J Off Stat 14(2):149–162
Google Scholar
Rosén B (1972) Asymptotic theory for successive sampling with varying probabilities without replacement, I. Ann Math Stat 43(2):373–397. http://www.jstor.org/stable/2239977
Rosén B (1997) Asymptotic theory for order sampling. J Stat Plan Inference 62(2):135–158
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Tel Aviv University, Tel Aviv, Israel, Israel
Edith Cohen
Stanford University, Stanford, CA, USA
Edith Cohen

Authors

Edith Cohen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USA
Ming-Yang Kao

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Cohen, E. (2016). Min-Hash Sketches. In: Kao, MY. (eds) Encyclopedia of Algorithms. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2864-4_573

Download citation

DOI: https://doi.org/10.1007/978-1-4939-2864-4_573
Published: 22 April 2016
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-2863-7
Online ISBN: 978-1-4939-2864-4
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics