Abstract
We present Silverback+, a scalable probabilistic framework for accurate association rule and frequent item-set mining of large-scale social behavioral data. Silverback+ tackles the problem of efficient storage utilization and management via: (1) probabilistic columnar infrastructure and (2) using Bloom filters and sampling techniques. In addition, probabilistic pruning techniques based on Apriori method are developed, for accelerating the mining of frequent item-sets. The proposed target-driven techniques yield a significant reduction of the size of the frequent item-set candidates, as well as the required number of repetitive membership checks through a novel list intersection algorithm. Extensive experimental evaluations demonstrate the benefits of this context-aware consideration and incorporation of the infrastructure limitations when utilizing the corresponding research techniques. When compared to the traditional Hadoop-based approach for improving scalability by straightforwardly adding more hosts, Silverback+ exhibits a much better runtime performance, with negligible loss of accuracy.





Similar content being viewed by others
Notes
References
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD’93. ACM, pp 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the VLDB Endow, VLDB’94, pp 487–499
Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: SIGMOD’98. ACM, New York, NY, USA, pp 85–93
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors, vol 13. ACM, New York, pp 422–426
Cao H, Wolfson O, Trajcevski G (2006) Spatio-temporal data reduction with deterministic error bounds. VLDB J 15(3):211–228
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: OSDI’06. USENIX Association, pp 15–15
Chen J, Stallaer J (2014) An economic analysis of online advertising using behavioral targeting. MIS Quarterly 38(2):429–449
Chung S, Luo C (2003) Parallel mining of maximal frequent itemsets from databases. In: ICTAI’03, pp 134–139
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, Yang C ( 2001) Finding interesting associations without support pruning, vol 13. IEEE, pp 64–78
Cormode G, Garofalakis MN (2008) Approximate continuous querying over distributed streams. ACM Trans Database Syst 33(2):1–39
Grupcev V, Yuan Y, Tu Y-C, Huang J, Chen S, Pandit S, Weng M (2013) Approximate algorithms for computing spatial distance histograms with accuracy guarantees. IEEE Trans Knowl Data Eng 25(9):1982–1996
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: SIGMOD’00. ACM, pp 1–12
Hofmann T, Buhmann J (1997) Pairwise data clustering by deterministic annealing, vol 19. IEEE, pp 1–14
Kallman R, Kimura H, Natkins J, Pavlo A, Rasin A, Zdonik S, Jones EPC, Madden S, Stonebraker M, Zhang Y, Hugg J, Abadi DJ (2008) H-store: a high-performance, distributed main memory transaction processing system, vol 1, VLDB Endowment, pp 1496–1499
Kendall M (1938) A new measure of rank correlation, vol 30. Biometrika Trust, pp 81–93
Kimura N, Latifi S (2005) A survey on data compression in wireless sensor networks. In: ITCC (2), pp 8–13
Kumar A, Grupcev V, Yuan Y, Huang J, Tu YC, Shen G (2014) Computing spatial distance histograms for large scientific data sets on-the-fly, vol 26. IEEE, pp 2410–2424
Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system, vol 44. ACM, New York, pp 35–40
Lan B, Ooi BC, Tan K-L (2002) Efficient indexing structures for mining frequent patterns. In: ICDE’02, pp 453–462
Lee J, Bengio S, Kim S, Lebanon G, Singer Y (2014) Local collaborative ranking. In: Proceedings of the 23rd international conference on World Wide Web. In: WWW’14. ACM, New York, NY, USA, pp 85–96
Li H, Wang Y, Zhang D, Zhang M, Chang E (2008) Pfp: parallel fp-growth for query recommendation. In: RecSys’08, pp 107–114
Lin M-Y, Lee P-Y, Hsueh S-C ( 2012) Apriori-based frequent itemset mining algorithms on mapreduce. In: ICUIMC’12
Ozkural E, Aykanat C (2004) A space optimization for FP-growth. In: FIMI
Pu IM (2006) Fundamental data compression. Elsevier, Amsterdam
Qiu L, Li Y, Wu X (2007) Preserving privacy in association rule mining with Bloom filters. J Intell Inf Syst 29(3):253–278
Sparse matrices (2014) http://docs.scipy.org/doc/scipy/reference/sparse.html
Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison Wesley, Reading
Turrisi R, Jaccard J (2003) Interaction effects in multiple regression, vol 72. Sage, London
Vitter JS (1985) Random sampling with a reservoir, vol 11. ACM, New York, pp 37–57
Xie Y, Chen Z, Zhang K, Patwary M, Cheng Y, Liu H, Agrawal A, Choudhary A ( 2013) Graphical modeling of macro behavioral targeting in social networks. In: SDM, pp 740–748
Xie Y, Cheng Y, Honbo D, Zhang K, Agrawal A, Choudhary AN, Gao Y, Gou J (2012) Probabilistic macro behavioral targeting. In: DUBMMSM, pp 7–10
Xie Y, Palsetia D, Trajcevski G, Agrawal A, Choudhary AN (2014) Silverback: scalable association mining for temporal data in columnar probabilistic databases. In: ICDE, pp 1072–1083
Ye Y, Chiang C-C (2006) A parallel apriori algorithm for frequent itemsets mining. In: SERA’06. IEEE, pp 87–94
Zaki MJ (2000) Scalable algorithms for association mining, vol 12. IEEE Educational Activities Department, Piscataway, pp 372–390
Zaki MJ, Parthasarathy S, Li W (1997) A localized algorithm for parallel association mining. In: SPAA’97, pp 321–330
Acknowledgments
This work is supported in part by the following Grants: NSF awards CCF-1029166, IIS-1343639, CCF-1409601, CNS-0910952 and III 1213038; DOE awards DE-SC0007456, DE-SC0014330; ONR Grant N00014-14-1-0215.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xie, Y., Chen, Z., Palsetia, D. et al. SILVERBACK+: scalable association mining via fast list intersection for columnar social data. Knowl Inf Syst 50, 969–997 (2017). https://doi.org/10.1007/s10115-016-0962-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-0962-8