Abstract
In this article, we revisit the task of movie box-office revenues prediction using multi-type features. The movie box-office revenues are affected by numerous factors. Previous work with discriminative models assumes these factors are identically and independently distributed. The correlations between these factors are rarely considered, which limited the performances of discriminative models in this task. To address these problems, we investigate a novel Gaussian copula regression model. Based on this model, we do not need to make any prior assumptions about the marginal distributions of the features. In particular, we perform a cumulative probability estimation on each of the smoothed features. The estimation learns the marginal distributions and maps all features into a uniform vector space. Sequentially, we bridge the marginal distributions with a copula function to create their joint distribution, and learn the dependency structure between them. Moreover, we propose a computational-efficient approximate algorithm for responsible variable inference. Experimental results on two movie datasets from Chinese and U.S. market show that our approach outperforms strong discriminative regression baselines.
摘要
本文中, 我们讨论利用多种特征进行电影票房预测的任务。影响电影票房的因素有很多。之前的工作采用的判别模型假设影响电影票房的这些因素是独立同分布的。这些因素之间的关联性很少被考虑, 这样的假设限制了判别模型在此任务上的效果。为了处理这些问题, 我们采用了一个全新的高斯连接回归模型。基于此模型, 我们不需要对特征的边缘分布作任何先验假设。特别地, 我们首先对平滑处理后的特征进行累积概率分布进行估计。通过估计我们学习到了特征的边缘分布, 同时将特征投影到同一向量空间。随后, 我们通过高斯连接函数将这些边缘分布转化为它们的联合分布, 同时获得这些边缘分布之间的依赖关系。此外, 我们还针对联合分布提出了一种高效的因变量推断的近似算法。在两个来自美国和中国电影市场的数据集上的实验结果证明我们的方法表现优于判别模型基线方法。
Similar content being viewed by others
References
Liu T, Ding X, Chen Y, et al. Predicting movie box-office revenues by exploiting large-scale social media content. Multimedia Tools Appl, 2016, 75: 1509–1528
Zhou D H, Han W B, Wang Y J, et al. Information diffusion network inferring and pathway tracking. Sci China Inf Sci, 2015, 58: 092111
Duan J, Chen Y, Liu T, et al. Mining intention-related products on online q&a community. J Comput Sci Tech, 2015, 30: 1054–1062
Ding X, Liu T, Duan J, et al. Mining user consumption intention from social media using domain adaptive convolutional neural network. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, 2015. 2389–2395
Wang H, Can D, Kazemzadeh A, et al. A system for real-time twitter sentiment analysis of 2012 U.S. presidential election cycle. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics System Demonstrations, Jeju Island, 2012. 115–120
Bollen J, Mao H, Zeng X. Twitter mood predicts the stock market. J Comput Sci, 2011, 2: 1–8
Ding X, Zhang Y, Liu T, et al. Using structured events to predict stock price movement: an empirical investigation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 1415–1425
Asur S, Huberman B A. Predicting the future with social media. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT). Washington: IEEE Computer Society, 2010. 492–499
Pan R K, Sinha S. The statistical laws of popularity: universal properties of the box-office dynamics of motion pictures. New J Phys, 2010, 12: 5004
Sklar M. Fonctions de répartition à n dimensions et leurs marges. Publications de l’Institut de Statistique de L’Université de Paris, 1959, 8: 229–231
Härdle W, Kleinow T, Stahl G. Applied Quantitative Finance: Theory and Computational Tools. Berlin: Springer, 2013
Eickhoff C, Vries A P, Collins-Thompson K. Copulas for information retrieval. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2013. 663–672
Wang W Y, Wen M. I can has cheezburger? A nonparanormal approach to combining textual and visual information for predicting and generating popular meme descriptions. In: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, 2015. 355–365
Elidan G. Copula bayesian networks. Advances Neural Inf Process Syst, 2010, 23: 559–567
Fujimaki R, Sogawa Y, Morinaga S. Online heterogeneous mixture modeling with marginal and copula selection. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, 2011. 645–653
Sharda R, Delen D. Predicting box-office success of motion pictures with neural networks. Expert Syst Appl, 2006, 30: 243–254
Zhang L, Luo J, Yang S. Forecasting box office revenue of movies with bp neural network. Expert Syst Appl, 2009, 36: 6580–6587
Mishne G, Glance N S. Predicting movie sales from blogger sentiment. In: Proceedings of AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, Stanford, 2006. 155–158
Zhang W B, Skiena S. Improving movie gross prediction through news analysis. In: Proceedings of the IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology. Washington: IEEE Computer Society, 2009. 301–304
Joshi M, Das D, Gimpel K, et al. Movie reviews and revenues: an experiment in text regression. In: Proceedings of Human Language Technologies: the Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, 2010. 293–296
Mesty´an M, Yasseri T, Kertész J. Early prediction of movie box office success based on wikipedia activity big data. Plos One, 2013, 8: e71226
Zhang L, Singh V. Bivariate flood frequency analysis using the copula method. J Hydrol Eng, 2006, 11: 150–164
Wang W Y, Hua Z. A semiparametric gaussian copula regression model for predicting financial risks from earnings calls. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, 2014. 1155–1165
Nelsen R B. An Introduction to Copulas. New York: Springer, 2013
Joe H. Multivariate Models and Multivariate Dependence Concepts. Boca Raton: CRC Press, 1997
Yan J, Leeuw J D, Zeileis A. Enjoy the joy of copulas: with a package copula. J Stat Softw, 2007, 21: 1–21
Bird S. Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, Sydney, 2006. 69–72
Toutanova K, Manning C D. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction With the 38th Annual Meeting of the Association for Computational Linguistics- Volume 13, Hong Kong, 2000. 63–70
Manning C D, Surdeanu M, Bauer J, et al. The stanford corenlp natural language processing toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, 2014. 55–60
Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B, 2005, 67: 301–320
Smola A, Vapnik V. Support vector regression machines. Adv Neural Inf Process Syst, 1997, 9: 155–161
Acknowledgements
This work was supported by National Basic Research Program of China (Grant No. 2014CB340503), and National Natural Science Foundation of China (Grant Nos. 71532004, 61133012, 61472107).
Author information
Authors and Affiliations
Corresponding author
Additional information
Conflict of interest The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Duan, J., Ding, X. & Liu, T. A Gaussian copula regression model for movie box-office revenues prediction. Sci. China Inf. Sci. 60, 092103 (2017). https://doi.org/10.1007/s11432-015-0905-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-015-0905-6