ABSTRACT
Counting the number of distinct items in a dataset is a well known computational problem with numerous applications. Sometimes, exact counting is infeasible, and one must use some approximation method. One approach to approximation is to estimate the number of distinct items from a random sample. This approach is useful, for example, when the dataset is too big, or when only a sample is available, but not the entire data. Moreover, it can considerably speed up the computation. In statistics, this problem is known as the \em Unseen Species Problem. In this paper, we propose an estimation method for this problem, which is especially suitable for cases where the sample is much smaller than the entire set, and the number of repetitions of each item is relatively small. Our method is simple in comparison to known methods, and gives good enough estimates to make it useful in certain real life datasets that arise in data mining scenarios. We demonstrate our method on real data where the task at hand is to estimate the number of duplicate URLs.
- . Chung, M. L. Mortensen, C. Binnig, T. Kraska,Estimating the Impact of Unknown Unknowns on Aggregate Query Results,SIGMOD 2016, 861--876, 2016.Google Scholar
- . Efron and R. Thisted, Estimating the number of unseen species (How many words did Shakespeare know?) Biometrika 63(3), 435--447, 1976.Google Scholar
- . A. Fisher, A. S. Corbet, and C. B. Williams, The relation between the number of species and the number of individuals in a random sample of an animal population, Journal of Animal Ecology 12(1), 42--58, 1943.Google ScholarCross Ref
- . J. Good and G. H. Toulmin. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43(1--2), 45--63, 1956.Google Scholar
- . M. Kane, J. Nelson, D. P. Woodruff. An Optimal Algorithm for the Distinct Elements Problem. Proceedings of the 29-th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems 41--52, 2010.Google ScholarDigital Library
- . Orlitsky, A. T. Suresh and Y. Wu, Optimal prediction of the number of unseen species, PNAS 113(47), 13283--13288, 2016. Proceedings of the SIGMOD Conference 2016.Google Scholar
Index Terms
- Estimating the Number of Distinct Items in a Database by Sampling
Recommendations
Hybrid algorithms for recommending new items
HetRec '11: Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender SystemsDespite recommender systems based on collaborative filtering typically outperform content-based systems in terms of recommendation quality, they suffer from the new item problem, i.e., they are not able to recommend items that have few or no ratings. ...
Improved Collaborative Filtering Recommendation via Non-Commonly-Rated Items
ICICSE '15: Proceedings of the 2015 Eighth International Conference on Internet Computing for Science and Engineering (ICICSE)Collaborative filtering (CF) in recommendation systems has made great success in making automatic score predictions by using users' ratings on commonly-rated items. However, due to data sparsity and cold starting, in real systems, common-rated items ...
User-Specific Feature-Based Similarity Models for Top-n Recommendation of New Items
Survey Paper, Regular Papers and Special Section on Participatory Sensing and Crowd IntelligenceRecommending new items for suitable users is an important yet challenging problem due to the lack of preference history for the new items. Noncollaborative user modeling techniques that rely on the item features can be used to recommend new items. ...
Comments