skip to main content
10.1145/2068816.2068851acmconferencesArticle/Chapter ViewAbstractPublication PagesimcConference Proceedingsconference-collections
research-article

Counting YouTube videos via random prefix sampling

Published:02 November 2011Publication History

ABSTRACT

Leveraging the characteristics of YouTube video id space and exploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical modeling and analysis, we demonstrate that the estimator based on this method is unbiased, and provide bounds on its variance and confidence interval. These bounds enable us to judiciously select sample sizes to control estimation errors. We evaluate our sampling method and validate the sampling results using two distinct collections of YouTube video id's (namely, treating each collection as if it were the "true" collection of YouTube videos). We then apply our sampling method to the live YouTube system, and estimate that there are a total of roughly 500 millions YouTube videos by May, 2011. Finally, using an unbiased collection of YouTube videos sampled by our method, we show that YouTube video view count statistics collected by prior methods (e.g., through crawling of related video links) are highly skewed, significantly under-estimating the number of videos with very small view counts (<1000); we also shed lights on the bounds for the total storage YouTube must have and the network capacity needed to delivery YouTube videos.

References

  1. Ellacoya data shows web traffic overtakes peer-to-peers as largest percentage of bandwidth on the network. http://www.ellacoya.com/news/pdf/2007/NXTcommEllacoyaMediaAlert.pdf.Google ScholarGoogle Scholar
  2. Google test : hyphen and underscore. http://www.prweaver.com/blog/2004/08/26/2-hyphen-and-underscore/.Google ScholarGoogle Scholar
  3. Word separators used by google. http://www.internetofficer.com/seo/google-word-separator/.Google ScholarGoogle Scholar
  4. V. K. Adhikari, S. Jain, Y. Chen, and Z.-L. Zhang. Reverse Engineering the YouTube Video Delivery Cloud. In HotMD'11.Google ScholarGoogle Scholar
  5. V. K. Adhikari, S. Jain, Y. Chen, and Z.-L. Zhang. Vivisecting youtube: An active measurement study. Technical report. http://www.cs.umn.edu/research/technical_reports.php?page=report&report_id=11-012.Google ScholarGoogle Scholar
  6. V. K. Adhikari, S. Jain, and Z.-L. Zhang. Youtube traffic dynamics and its interplay with a tier-1 isp: an isp perspective. In Proceedings of the 10th annual conference on Internet measurement, IMC'10, pages 431--443, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social networking services. In Proceedings of the 16th International Conference on World Wide Web, Alberta, Canada, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Bar-yossef and M. Gurevich. Random sampling from a search engine's index. In WWW'06: World Wide Web Conference, pages 367--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC'07, pages 1--14, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Finamore, M. Mellia, M. Munafo, R. Torres, and S. R. Rao. YouTube Everywhere: Impact of Device and Infrastructure Synergies on User Experience. In IMC'11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov chain monte carlo in practice. In Operations Research, 1996.Google ScholarGoogle Scholar
  12. P. Gill, M. Arlitt, Z. Li, and A. Mahanti. Youtube traffic characterization: a view from the edge. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC'07, pages 15--28, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. Hacking. An Introduction to Probability and Inductive Logic. Cambridge University Press, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  14. L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In Proceedings of the 20th international conference on World wide web, WWW'11, pages 597--606, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Labovitz et al. Internet inter-domain traffic. In SIGCOMM'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Lovász. Random walks on graphs: A survey. Combinatorics Paul Erdos is Eighty, 2(1):1--46, 1993.Google ScholarGoogle Scholar
  17. A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC'07, pages 29--42, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. PlanetLab. An open platform for developing, deploying, and accessing planetary-scale services. http://www.planet-lab.org.Google ScholarGoogle Scholar
  19. A. Rasti, M. Torkjazi, R. Rejaie, N. Duffield, W. Willinger, and D. Stutzbach. Respondent-driven sampling for characterizing unstructured overlays. In Proceedings of IEEE INFOCOM 2009 Mini-Conference, April 2009.Google ScholarGoogle ScholarCross RefCross Ref
  20. R. Rejaie, M. Torkjazi, M. Valafar, and W. Willinger. Sizing up online social networks. Network, IEEE, 24(5):32 --37, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Ribeiro and D. Towsley. Estimating and sampling graphs with multidimensional random walks. In Proceedings of the 10th annual conference on Internet measurement, IMC'10, pages 390--403, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, and W. Willinger. On unbiased sampling for unstructured peer-to-peer networks. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, IMC'06, pages 27--40, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Wasserman and K. Faust. Social Network Analysis. Methods and Applications. 1994.Google ScholarGoogle Scholar
  24. C. Wilson, B. Boe, A. Sala, K. Puttaswamy, and B. Y. Zhao. User interactions in social networks and their implications. In EuroSys, pages 205--218, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Ye and F. Wu. Estimating the size of online social networks. In Social Computing (SocialCom), 2010 IEEE Second International Conference on, pages 169 --176, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. YouTube. YouTube statistics. www.youtube.com/t/press_statistics.Google ScholarGoogle Scholar

Index Terms

  1. Counting YouTube videos via random prefix sampling

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        IMC '11: Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference
        November 2011
        612 pages
        ISBN:9781450310130
        DOI:10.1145/2068816

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 November 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate277of1,083submissions,26%

        Upcoming Conference

        IMC '24
        ACM Internet Measurement Conference
        November 4 - 6, 2024
        Madrid , AA , Spain

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader