Skip to main content
Log in

SILVERBACK+: scalable association mining via fast list intersection for columnar social data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We present Silverback+, a scalable probabilistic framework for accurate association rule and frequent item-set mining of large-scale social behavioral data. Silverback+ tackles the problem of efficient storage utilization and management via: (1) probabilistic columnar infrastructure and (2) using Bloom filters and sampling techniques. In addition, probabilistic pruning techniques based on Apriori method are developed, for accelerating the mining of frequent item-sets. The proposed target-driven techniques yield a significant reduction of the size of the frequent item-set candidates, as well as the required number of repetitive membership checks through a novel list intersection algorithm. Extensive experimental evaluations demonstrate the benefits of this context-aware consideration and incorporation of the infrastructure limitations when utilizing the corresponding research techniques. When compared to the traditional Hadoop-based approach for improving scalability by straightforwardly adding more hosts, Silverback+ exhibits a much better runtime performance, with negligible loss of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://www.mongodb.org.

References

  1. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD’93. ACM, pp 207–216

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the VLDB Endow, VLDB’94, pp 487–499

  3. Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: SIGMOD’98. ACM, New York, NY, USA, pp 85–93

  4. Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors, vol 13. ACM, New York, pp 422–426

    MATH  Google Scholar 

  5. Cao H, Wolfson O, Trajcevski G (2006) Spatio-temporal data reduction with deterministic error bounds. VLDB J 15(3):211–228

    Article  Google Scholar 

  6. Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: OSDI’06. USENIX Association, pp 15–15

  7. Chen J, Stallaer J (2014) An economic analysis of online advertising using behavioral targeting. MIS Quarterly 38(2):429–449

    Google Scholar 

  8. Chung S, Luo C (2003) Parallel mining of maximal frequent itemsets from databases. In: ICTAI’03, pp 134–139

  9. Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, Yang C ( 2001) Finding interesting associations without support pruning, vol 13. IEEE, pp 64–78

  10. Cormode G, Garofalakis MN (2008) Approximate continuous querying over distributed streams. ACM Trans Database Syst 33(2):1–39

    Article  Google Scholar 

  11. Grupcev V, Yuan Y, Tu Y-C, Huang J, Chen S, Pandit S, Weng M (2013) Approximate algorithms for computing spatial distance histograms with accuracy guarantees. IEEE Trans Knowl Data Eng 25(9):1982–1996

    Article  Google Scholar 

  12. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: SIGMOD’00. ACM, pp 1–12

  13. Hofmann T, Buhmann J (1997) Pairwise data clustering by deterministic annealing, vol 19. IEEE, pp 1–14

  14. Kallman R, Kimura H, Natkins J, Pavlo A, Rasin A, Zdonik S, Jones EPC, Madden S, Stonebraker M, Zhang Y, Hugg J, Abadi DJ (2008) H-store: a high-performance, distributed main memory transaction processing system, vol 1, VLDB Endowment, pp 1496–1499

  15. Kendall M (1938) A new measure of rank correlation, vol 30. Biometrika Trust, pp 81–93

  16. Kimura N, Latifi S (2005) A survey on data compression in wireless sensor networks. In: ITCC (2), pp 8–13

  17. Kumar A, Grupcev V, Yuan Y, Huang J, Tu YC, Shen G (2014) Computing spatial distance histograms for large scientific data sets on-the-fly, vol 26. IEEE, pp 2410–2424

  18. Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system, vol 44. ACM, New York, pp 35–40

    Google Scholar 

  19. Lan B, Ooi BC, Tan K-L (2002) Efficient indexing structures for mining frequent patterns. In: ICDE’02, pp 453–462

  20. Lee J, Bengio S, Kim S, Lebanon G, Singer Y (2014) Local collaborative ranking. In: Proceedings of the 23rd international conference on World Wide Web. In: WWW’14. ACM, New York, NY, USA, pp 85–96

  21. Li H, Wang Y, Zhang D, Zhang M, Chang E (2008) Pfp: parallel fp-growth for query recommendation. In: RecSys’08, pp 107–114

  22. Lin M-Y, Lee P-Y, Hsueh S-C ( 2012) Apriori-based frequent itemset mining algorithms on mapreduce. In: ICUIMC’12

  23. Ozkural E, Aykanat C (2004) A space optimization for FP-growth. In: FIMI

  24. Pu IM (2006) Fundamental data compression. Elsevier, Amsterdam

    Google Scholar 

  25. Qiu L, Li Y, Wu X (2007) Preserving privacy in association rule mining with Bloom filters. J Intell Inf Syst 29(3):253–278

    Article  Google Scholar 

  26. Sparse matrices (2014) http://docs.scipy.org/doc/scipy/reference/sparse.html

  27. Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison Wesley, Reading

    Google Scholar 

  28. Turrisi R, Jaccard J (2003) Interaction effects in multiple regression, vol 72. Sage, London

    Google Scholar 

  29. Vitter JS (1985) Random sampling with a reservoir, vol 11. ACM, New York, pp 37–57

    MATH  Google Scholar 

  30. Xie Y, Chen Z, Zhang K, Patwary M, Cheng Y, Liu H, Agrawal A, Choudhary A ( 2013) Graphical modeling of macro behavioral targeting in social networks. In: SDM, pp 740–748

  31. Xie Y, Cheng Y, Honbo D, Zhang K, Agrawal A, Choudhary AN, Gao Y, Gou J (2012) Probabilistic macro behavioral targeting. In: DUBMMSM, pp 7–10

  32. Xie Y, Palsetia D, Trajcevski G, Agrawal A, Choudhary AN (2014) Silverback: scalable association mining for temporal data in columnar probabilistic databases. In: ICDE, pp 1072–1083

  33. Ye Y, Chiang C-C (2006) A parallel apriori algorithm for frequent itemsets mining. In: SERA’06. IEEE, pp 87–94

  34. Zaki MJ (2000) Scalable algorithms for association mining, vol 12. IEEE Educational Activities Department, Piscataway, pp 372–390

    Google Scholar 

  35. Zaki MJ, Parthasarathy S, Li W (1997) A localized algorithm for parallel association mining. In: SPAA’97, pp 321–330

Download references

Acknowledgments

This work is supported in part by the following Grants: NSF awards CCF-1029166, IIS-1343639, CCF-1409601, CNS-0910952 and III 1213038; DOE awards DE-SC0007456, DE-SC0014330; ONR Grant N00014-14-1-0215.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yusheng Xie.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, Y., Chen, Z., Palsetia, D. et al. SILVERBACK+: scalable association mining via fast list intersection for columnar social data. Knowl Inf Syst 50, 969–997 (2017). https://doi.org/10.1007/s10115-016-0962-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0962-8

Keywords

Navigation