ABSTRACT
Online display advertisers extensively use the concept of a user segment to cluster users into targetable groups. When the sizes of such segments are less than the desired value for campaign budgets, there is a need to use probabilistic modeling to expand the size. This process is termed look-alike modeling. Given the multitude of data providers and on-line data sources, there are thousands of segments for each targetable consumer extracted from billions of online (even offline) actions performed by millions of users. The majority of advertisers, marketers and publishers have to use large scale distributed infrastructures to create thousands of user segments on a daily basis. Developing accurate data mining models efficiently within such platforms is a challenging task. The volume and variety of data can be a significant bottleneck for non-disk resident algorithms, since operating time for training and scoring hundreds of segments with millions of targetable users is non-trivial.
In this paper, we present a novel k-means based distributed in-database algorithm for look-alike modeling implemented within the nPario database system. We demonstrate the utility of the algorithm: accurate, invariant of size and skew of the targetable audience(very few positive examples), and dependent linearly on the capacity and number of nodes in the distributed environment. To the best of our knowledge this is the first ever commercially deployed distributed look-alike modeling implementation to solve this problem. We compare the performance of our algorithm with other distributed and non-distributed look-alike modeling techniques, and report the results over a multi-core environment.
- D. J. Abadi, P. A. Boncz, and S. Harizopoulos. Column-oriented database systems. Proceedings of the VLDB Endowment, 2(2):1664--1665, 2009. Google ScholarDigital Library
- D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores: How different are they really? In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 967--980. ACM, 2008. Google ScholarDigital Library
- A. Bindra, S. Pokuri, K. Uppala, and A. Teredesai. Distributed big advertiser data mining. In 2012 IEEE 12th International Conference on Data Mining Workshops (ICDMW), pages 914--914. IEEE, 2012. Google ScholarDigital Library
- A. Broder and V. Josifovski. Computational advertising MS&E239. Stanford University Course Materials, 2011.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-RDBMS analytics. In Proceedings of the 2012 International Conference on Management of Data, pages 325--336. ACM, 2012. Google ScholarDigital Library
- J. A. Hartigan and M. A. Wong. Algorithm AS 136: A k-means clustering algorithm. Applied Statistics, pages 100--108, 1979.Google Scholar
- J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The MADlib analytics library: or MAD skills, the SQL. Proceedings of the VLDB Endowment, 5(12):1700--1711, 2012. Google ScholarDigital Library
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM computing surveys (CSUR), 31(3):264--323, 1999. Google ScholarDigital Library
- I. A. B. P. C. LLP. IAB Internet advertising revenue report. www.iab.net, 2011.Google Scholar
- A. Mangalampalli, A. Ratnaparkhi, A. O. Hatch, A. Bagherjeiran, R. Parekh, and V. Pudi. A feature-pair-based associative classification approach to look-alike modeling for conversion-oriented user-targeting in tail campaigns. In Proceedings of the 20th International Conference Companion on World Wide Web, pages 85--86. ACM, 2011. Google ScholarDigital Library
- S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in Action. Manning Publications Co., 2011. Google ScholarDigital Library
- N. Sinha, V. Ahuja, and Y. Medury. Cluster analysis for consumer segmentation using a brand customer centricity calculator. Apeejay Business Review, page 68.Google Scholar
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, et al. C-store: a column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases, pages 553--564. VLDB Endowment, 2005. Google ScholarDigital Library
- H. Wang, D. Huo, J. Huang, Y. Xu, L. Yan, W. Sun, and X. Li. An approach for improving k-means algorithm on market segmentation. In 2010 International Conference on System Science and Engineering (ICSSE), pages 368--372. IEEE, 2010.Google ScholarCross Ref
- M. Wedel and W. A. Kamakura. Market Segmentation: Conceptual and Methodological Foundations, volume 8. Springer, 2000.Google Scholar
- J. Yan, D. Shen, T. Mah, N. Liu, Z. Chen, and Y. Li. Behavioral targeting online advertising. Online Multimedia Advertising: Techniques and Technologies, pages 213--232, 2011.Google ScholarCross Ref
Index Terms
- Audience segment expansion using distributed in-database k-means clustering
Recommendations
Audience Expansion for Online Social Network Advertising
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningOnline social network advertising platforms, such as that provided by LinkedIn, generally allow marketers to specify targeting options so that their ads appear to a desired demographic. Audience Expansion is a technique developed at LinkedIn to simplify ...
High-resolution imaging using a wideband MIMO radar system with two distributed arrays
Imaging a fast maneuvering target has been an active research area in past decades. Usually, an array antenna with multiple elements is implemented to avoid the motion compensations involved in the Inverse synthetic aperture radar (ISAR) imaging. ...
An Optimal Distributed K-Means Clustering Algorithm Based on CloudStack
FCST '15: Proceedings of the 2015 Ninth International Conference on Frontier of Computer Science and TechnologyClustering algorithm is applied to many fields, especially in the data mining. Due to the increasing number of the data, it's too hard for the clustering algorithm to afford the computation time in traditional computing model. When handling with big ...
Comments