Harnessing Hundreds of Millions of Cases: Case-Based Prediction at Industrial Scale

Jalali, Vahid; Leake, David

doi:10.1007/978-3-030-01081-2_11

Harnessing Hundreds of Millions of Cases: Case-Based Prediction at Industrial Scale

Vahid Jalali¹⁶ &
David Leake¹⁷

Conference paper
First Online: 09 October 2018

1031 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11156))

Abstract

Building predictive models is central to many big data applications. However, model building is computationally costly at scale. An appealing alternative is bypassing model building by applying case-based prediction to reason directly from data. However, to our knowledge case-based prediction still has not been applied at true industrial scale. In previous work we introduced a knowledge-light/data intensive approach to case-based prediction, using ensembles of automatically-generated adaptations. We developed foundational scaleup methods, using Locality Sensitive Hashing (LSH) for fast approximate nearest neighbor retrieval of both cases and adaptation rules, and tested them for millions of cases. This paper presents research on extending these methods to address the practical challenges raised by case bases of hundreds of millions of cases for a real world industrial e-commerce application. Handling this application required addressing how to keep LSH practical for skewed data; the resulting efficiency gains in turn enabled applying an adaptation generation strategy that previously was computationally infeasible. Experimental results show that our CBR approach achieves accuracy comparable to or better than state of the art machine learning methods commonly applied, while avoiding their model-building cost. This supports the opportunity to harness CBR for industrial scale prediction.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://www.kaggle.com/zurfer/rtb/data.

References

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K. (ed.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Beaver, I., Dumoulin, J.: Applying mapreduce to learning user preferences in near real-time. In: Delany, S.J., Ontañón, S. (eds.) ICCBR 2013. LNCS, vol. 7969, pp. 15–28. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39056-2_2
Chapter Google Scholar
Bi, Z., Faloutsos, C., Korn, F.: The “DGX” distribution for mining massive, skewed data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2001, pp. 17–26. ACM, New York (2001)
Google Scholar
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT 2010, pp. 177–186. Physica-Verlag HD, Heidelberg (2010). https://doi.org/10.1007/978-3-7908-2604-3_16
Chapter Google Scholar
Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 875–886. Springer, Boston (2010). https://doi.org/10.1007/0-387-25465-X_40
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG 2004, pp. 253–262. ACM, New York (2004)
Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. VLDB 99, 518–529 (1999)
Google Scholar
Hanney, K., Keane, M.T.: Learning adaptation rules from a case-base. In: Smith, I., Faltings, B. (eds.) EWCBR 1996. LNCS, vol. 1168, pp. 179–192. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0020610
Chapter Google Scholar
Houeland, T.G., Aamodt, A.: The utility problem for lazy learners - towards a non-eager approach. In: Bichindaritz, I., Montani, S. (eds.) ICCBR 2010. LNCS (LNAI), vol. 6176, pp. 141–155. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14274-1_12
Chapter Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 604–613. ACM, New York (1998)
Google Scholar
Jalali, V., Leake, D.: CBR meets big data: a case study of large-scale adaptation rule generation. In: Hüllermeier, E., Minor, M. (eds.) ICCBR 2015. LNCS, vol. 9343, pp. 181–196. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24586-7_13
Chapter Google Scholar
Jalali, V., Leake, D.: Scaling up ensemble of adaptations for classification by approximate nearest neighbor retrieval. In: Aha, D.W., Lieber, J. (eds.) ICCBR 2017. LNCS (LNAI), vol. 10339, pp. 154–169. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61030-6_11
Chapter Google Scholar
Jalali, V., Leake, D., Forouzandehmehr, N.: Ensemble of adaptations for classification: learning adaptation rules for categorical features. In: Goel, A., Díaz-Agudo, M.B., Roth-Berghofer, T. (eds.) ICCBR 2016. LNCS, vol. 9969, pp. 186–202. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47096-2_13
Chapter Google Scholar
Jalali, V., Leake, D.: A context-aware approach to selecting adaptations for case-based reasoning. In: Brézillon, P., Blackburn, P., Dapoigny, R. (eds.) CONTEXT 2013. LNCS, vol. 8175, pp. 101–114. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40972-1_8
Chapter Google Scholar
Jalali, V., Leake, D.: Extending case adaptation with automatically-generated ensembles of adaptation rules. In: Delany, S.J., Ontañón, S. (eds.) ICCBR 2013. LNCS, vol. 7969, pp. 188–202. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39056-2_14
Chapter Google Scholar
Jalali, V., Leake, D.: Adaptation-guided case base maintenance. In: Proceedings of the Twenty-Eighth Conference on Artificial Intelligence, pp. 1875–1881. AAAI Press (2014)
Google Scholar
Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In: IEEE International Conference on Computer Vision ICCV (2009)
Google Scholar
Leake, D., Smyth, B., Wilson, D., Yang, Q. (eds.): Maintaining Case-Based Reasoning Systems. Blackwell, Malden (2001). Special issue of Computational Intelligence 17(2) (2001)
Google Scholar
Leetaru, K., Schrodt, P.A.: GDELT: global data on events, location, and tone. ISA Annual Convention (2013)
Google Scholar
Lin, Y.B., Ping, X.O., Ho, T.W., Lai, F.: Processing and analysis of imbalanced liver cancer patient data by case-based reasoning. In: The 7th 2014 Biomedical Engineering International Conference, pp. 1–5, November 2014
Google Scholar
Malof, J., Mazurowski, M., Tourassi, G.: The effect of class imbalance on case selection for case-based classifiers: an empirical study in the context of medical decision support. Neural Netw. 25, 141–145 (2012)
Article Google Scholar
Meng, X., et al.: MLlib: machine learning in apache spark. CoRR abs/1505.06807 (2015)
Google Scholar
Mühleisen, H., Bizer, C.: Web data commons - extracting structured data from two large web corpora. In: Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M. (eds.) WWW 2012 Workshop on Linked Data on the Web, Lyon, France, 16 April 2012. CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012)
Google Scholar
Ontañón, S., Plaza, E.: Collaborative case retention strategies for CBR agents. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 392–406. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45006-8_31
Chapter MATH Google Scholar
Palmer, C.R., Faloutsos, C.: Density biased sampling: an improved method for data mining and clustering. SIGMOD Rec. 29(2), 82–92 (2000)
Article Google Scholar
Rojas, J.A.R., Kery, M.B., Rosenthal, S., Dey, A.: Sampling techniques to improve big data exploration. In: 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV), pp. 26–35, October 2017
Google Scholar
Salamó, M., López-Sánchez, M.: Adaptive case-based reasoning using retention and forgetting strategies. Knowl. Based Syst. 24(2), 230–247 (2011)
Article Google Scholar
Smyth, B., Cunningham, P.: The utility problem analysed. In: Smith, I., Faltings, B. (eds.) EWCBR 1996. LNCS, vol. 1168, pp. 392–399. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0020625
Chapter Google Scholar
Smyth, B., Keane, M.: Remembering to forget: a competence-preserving case deletion policy for case-based reasoning systems. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 377–382. Morgan Kaufmann, San Mateo (1995)
Google Scholar
Smyt, B., McKenna, E.: Footprint-based retrieval. In: Althoff, K.-D., Bergmann, R., Branting, L.K. (eds.) ICCBR 1999. LNCS, vol. 1650, pp. 343–357. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48508-2_25
Chapter Google Scholar
Upadhyaya, S.R.: Parallel approaches to machine learning a comprehensive survey. J. Parallel Distrib. Comput. 73(3), 284–292 (2013). Models and Algorithms for High-Performance Distributed Data Mining
Google Scholar
Zhu, J., Yang, Q.: Remembering to add: competence-preserving case-addition policies for case base maintenance. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pp. 234–241. Morgan Kaufmann (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Walmartlabs, Sunnyvale, CA, 94086, USA
Vahid Jalali
School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, 47408, USA
David Leake

Authors

Vahid Jalali
View author publications
You can also search for this author in PubMed Google Scholar
David Leake
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vahid Jalali .

Editor information

Editors and Affiliations

Wright State University, Dayton, OH, USA
Michael T. Cox
Mälardalen University, Västeras, Sweden
Peter Funk
Mälardalen University, Västeras, Sweden
Shahina Begum

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jalali, V., Leake, D. (2018). Harnessing Hundreds of Millions of Cases: Case-Based Prediction at Industrial Scale. In: Cox, M., Funk, P., Begum, S. (eds) Case-Based Reasoning Research and Development. ICCBR 2018. Lecture Notes in Computer Science(), vol 11156. Springer, Cham. https://doi.org/10.1007/978-3-030-01081-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-01081-2_11
Published: 09 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01080-5
Online ISBN: 978-3-030-01081-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics