Years and Authors of Summarized Original Work
-
1985; Flajolet, Martin
-
1997; Broder
-
1997; Cohen
Problem Definition
MinHash sketches (also known as min-wise sketches) are randomized summary structures of subsets which support set union operations and approximate processing of cardinality and similarity queries.
Set-union support, also called mergeability, means that a sketch of the union of two sets can be computed from the sketches of the two sets. In particular, this applies when the second set is a single element. The queries supported by MinHash sketches include cardinality (of a subset from its sketch) and similarity (of two subsets from their sketches).
Sketches are useful for massive data analysis. Working with sketches often means that instead of explicitly maintaining and manipulating very large subsets (or equivalently 0/1 vectors), we can instead maintain the much smaller sketches and can still query properties of these subsets.
We denote the universe of elements by Uand its...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Alon N, Matias Y, Szegedy M (1999) The space complexity of approximating the frequency moments. J Comput Syst Sci 58:137–147
Bar-Yossef Z, Jayram TS, Kumar R, Sivakumar D, Trevisan L (2002) Counting distinct elements in a data stream. In: RANDOM, Cambridge. ACM
Brewer KRW, Early LJ, Joyce SF (1972) Selecting several samples from a single population. Aust J Stat 14(3):231–239
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences, Salerno. IEEE, pp 21–29
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci 60(3):630–659
Cohen E (1997) Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci 55:441–453
Cohen E (2014) All-distances sketches, revisited: HIP estimators for massive graphs analysis. In: PODS, Snowbird. ACM. http://arxiv.org/abs/1306.3284
Cohen E (2014) Estimation for monotone sampling: competitiveness and customization. In: PODC, Paris. ACM. http://arxiv.org/abs/1212.0243, full version http://arxiv.org/abs/1212.0243
Cohen E, Kaplan H (2007) Summarizing data using bottom-k sketches. In: PODC, Portland. ACM
Cohen E, Kaplan H (2009) Leveraging discarded samples for tighter estimation of multiple-set aggregates. In: SIGMETRICS, Seattle. ACM
Cohen E, Delling D, Fuchs F, Goldberg A, Goldszmidt M, Werneck R (2013) Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths. In: COSN, Boston. ACM
Cohen E, Delling D, Pajor T, Werneck RF (2014) Sketch-based influence maximization and computation: scaling up with guarantees. In: CIKM. ACM. http://research.microsoft.com/apps/pubs/?id=226623, full version http://research.microsoft.com/apps/pubs/?id=226623
Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31:182–209
Flajolet P, Fusy E, Gandouet O, Meunier F (2007) Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In: Analysis of algorithms (AOFA), Juan des Pins
Heule S, Nunkesser M, Hall A (2013) HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: EDBT, Genoa
Indyk P (1999) A small approximately min-wise independent family of hash functions. In: Proceedings of the 10th ACM-SIAM symposium on discrete algorithms, Baltimore. ACM-SIAM
Li P, Church KW, Hastie T (2008) One sketch for all: theory and application of conditional random sampling. In: NIPS, Vancouver
Li P, Owen AB, Zhang CH (2012) One permutation hashing. In: NIPS, Lake Tahoe
Ohlsson E (1998) Sequential poisson sampling. J Off Stat 14(2):149–162
Rosén B (1972) Asymptotic theory for successive sampling with varying probabilities without replacement, I. Ann Math Stat 43(2):373–397. http://www.jstor.org/stable/2239977
Rosén B (1997) Asymptotic theory for order sampling. J Stat Plan Inference 62(2):135–158
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this entry
Cite this entry
Cohen, E. (2016). Min-Hash Sketches. In: Kao, MY. (eds) Encyclopedia of Algorithms. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2864-4_573
Download citation
DOI: https://doi.org/10.1007/978-1-4939-2864-4_573
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-2863-7
Online ISBN: 978-1-4939-2864-4
eBook Packages: Computer ScienceReference Module Computer Science and Engineering