ABSTRACT
Leveraging the characteristics of YouTube video id space and exploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical modeling and analysis, we demonstrate that the estimator based on this method is unbiased, and provide bounds on its variance and confidence interval. These bounds enable us to judiciously select sample sizes to control estimation errors. We evaluate our sampling method and validate the sampling results using two distinct collections of YouTube video id's (namely, treating each collection as if it were the "true" collection of YouTube videos). We then apply our sampling method to the live YouTube system, and estimate that there are a total of roughly 500 millions YouTube videos by May, 2011. Finally, using an unbiased collection of YouTube videos sampled by our method, we show that YouTube video view count statistics collected by prior methods (e.g., through crawling of related video links) are highly skewed, significantly under-estimating the number of videos with very small view counts (<1000); we also shed lights on the bounds for the total storage YouTube must have and the network capacity needed to delivery YouTube videos.
- Ellacoya data shows web traffic overtakes peer-to-peers as largest percentage of bandwidth on the network. http://www.ellacoya.com/news/pdf/2007/NXTcommEllacoyaMediaAlert.pdf.Google Scholar
- Google test : hyphen and underscore. http://www.prweaver.com/blog/2004/08/26/2-hyphen-and-underscore/.Google Scholar
- Word separators used by google. http://www.internetofficer.com/seo/google-word-separator/.Google Scholar
- V. K. Adhikari, S. Jain, Y. Chen, and Z.-L. Zhang. Reverse Engineering the YouTube Video Delivery Cloud. In HotMD'11.Google Scholar
- V. K. Adhikari, S. Jain, Y. Chen, and Z.-L. Zhang. Vivisecting youtube: An active measurement study. Technical report. http://www.cs.umn.edu/research/technical_reports.php?page=report&report_id=11-012.Google Scholar
- V. K. Adhikari, S. Jain, and Z.-L. Zhang. Youtube traffic dynamics and its interplay with a tier-1 isp: an isp perspective. In Proceedings of the 10th annual conference on Internet measurement, IMC'10, pages 431--443, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social networking services. In Proceedings of the 16th International Conference on World Wide Web, Alberta, Canada, 2007. Google ScholarDigital Library
- Z. Bar-yossef and M. Gurevich. Random sampling from a search engine's index. In WWW'06: World Wide Web Conference, pages 367--376. Google ScholarDigital Library
- M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC'07, pages 1--14, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- A. Finamore, M. Mellia, M. Munafo, R. Torres, and S. R. Rao. YouTube Everywhere: Impact of Device and Infrastructure Synergies on User Experience. In IMC'11. Google ScholarDigital Library
- W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov chain monte carlo in practice. In Operations Research, 1996.Google Scholar
- P. Gill, M. Arlitt, Z. Li, and A. Mahanti. Youtube traffic characterization: a view from the edge. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC'07, pages 15--28, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- I. Hacking. An Introduction to Probability and Inductive Logic. Cambridge University Press, 2001.Google ScholarCross Ref
- L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In Proceedings of the 20th international conference on World wide web, WWW'11, pages 597--606, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- C. Labovitz et al. Internet inter-domain traffic. In SIGCOMM'10. Google ScholarDigital Library
- L. Lovász. Random walks on graphs: A survey. Combinatorics Paul Erdos is Eighty, 2(1):1--46, 1993.Google Scholar
- A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC'07, pages 29--42, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- PlanetLab. An open platform for developing, deploying, and accessing planetary-scale services. http://www.planet-lab.org.Google Scholar
- A. Rasti, M. Torkjazi, R. Rejaie, N. Duffield, W. Willinger, and D. Stutzbach. Respondent-driven sampling for characterizing unstructured overlays. In Proceedings of IEEE INFOCOM 2009 Mini-Conference, April 2009.Google ScholarCross Ref
- R. Rejaie, M. Torkjazi, M. Valafar, and W. Willinger. Sizing up online social networks. Network, IEEE, 24(5):32 --37, 2010. Google ScholarDigital Library
- B. Ribeiro and D. Towsley. Estimating and sampling graphs with multidimensional random walks. In Proceedings of the 10th annual conference on Internet measurement, IMC'10, pages 390--403, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, and W. Willinger. On unbiased sampling for unstructured peer-to-peer networks. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, IMC'06, pages 27--40, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- S. Wasserman and K. Faust. Social Network Analysis. Methods and Applications. 1994.Google Scholar
- C. Wilson, B. Boe, A. Sala, K. Puttaswamy, and B. Y. Zhao. User interactions in social networks and their implications. In EuroSys, pages 205--218, 2009. Google ScholarDigital Library
- S. Ye and F. Wu. Estimating the size of online social networks. In Social Computing (SocialCom), 2010 IEEE Second International Conference on, pages 169 --176, 2010. Google ScholarDigital Library
- YouTube. YouTube statistics. www.youtube.com/t/press_statistics.Google Scholar
Index Terms
- Counting YouTube videos via random prefix sampling
Recommendations
Exploring the user-generated content (UGC) uploading behavior on youtube
WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide WebYouTube is the world's largest video sharing platform where both professional and non-professional users participate in creating, uploading, and viewing content. In this work, we analyze content in the music category created by the non-professionals, ...
Fast crawling methods of exploring content distributed over large graphs
Despite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content ...
Line segment sampling with blue-noise properties
Line segment sampling has recently been adopted in many rendering algorithms for better handling of a wide range of effects such as motion blur, defocus blur and scattering media. A question naturally raised is how to generate line segment samples with ...
Comments