Skip to main content
Log in

Efficient Model Store and Reuse in an OLML Database System

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models. Yet, for these neural network models, it is necessary to label a tremendous amount of training data, which is prohibitively expensive in reality. In this paper, we propose OnLine Machine Learning (OLML) database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data. An efficient model reuse algorithm AdaReuse is developed in the OLML database. Specifically, AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality, through which a group of trained models with high reuse potential for the training task could be selected efficiently. Then, multi selected models will be trained iteratively to encourage diverse models, with which a better training effect could be achieved by ensemble. We evaluate AdaReuse on two types of natural language processing (NLP) tasks, and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited. Based on AdaReuse, we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models. Usability studies are conducted to illustrate the OLML database could properly store the trained models, and reuse the trained models efficiently in new training tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? arXiv:1411.1792, 2014. https://arxiv.org/abs/1411.1792, Nov. 2020.

  2. Wang W, Wang S, Gao J, Zhang M, Chen G, Ng T K, Ooi B C. Rafiki: Machine learning as an analytics service system. arXiv:1804.06087, 2018. https://arxiv.org/abs/1804.06087, Apr. 2021.

  3. Zhang W, Jiang J, Shao Y, Cui B. Efficient diversity-driven ensemble for deep neural networks. In Proc. the 36th IEEE International Conference on Data Engineering, Apr. 2020, pp.73-84. https://doi.org/10.1109/ICDE48307.2020.00014.

  4. Derakhshan B, Mahdiraji A R, Abedjan Z, Rabl T, Markl V. Optimizing machine learning workloads in collaborative environments. In Proc. the 2020 ACM SIGMOD International Conference on Management of Data, Jun. 2020, pp.1701-1716. https://doi.org/10.1145/3318464.3389715.

  5. Schapire R E. Explaining AdaBoost. In Empirical Inference, Schölkopf B, Luo Z, Vovk V (eds.), Springer, 2013, pp.37-52. https://doi.org/10.1007/978-3-642-41136-6_5.

  6. Zhao Z, Chen H, Zhang J, Zhao X, Liu T, Lu W, Chen X, Deng H, Ju Q, Du X. UER: An open-source toolkit for pre-training models. arXiv:1909.05658, 2019. https://arxiv.org/abs/1909.05658, April 2021.

  7. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781, Jan. 2021.

  8. Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1532-1543. https://doi.org/10.3115/v1/D14-1162.

  9. Zhao Z, Liu T, Li S, Li B, Du X. Ngram2vec: Learning improved word representations from ngram co-occurrence statistics. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.244-253. https://doi.org/10.18653/v1/D17-1023.

  10. Dai A M, Olah C, Le Q V. Document embedding with paragraph vectors. arXiv:1507.07998, 2015. https://arxiv.org/abs/1507.07998, April 2021.

  11. Axelrod A, He X, Gao J. Domain adaptation via pseudo in-domain data selection. In Proc. the 2011 Conference on Empirical Methods in Natural Language Processing, Jul. 2011, pp.355-362.

  12. Chen B, Huang F. Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In Proc. the 20th SIGNLL Conference on Computational Natural Language Learning, Aug. 2016, pp.314-323. https://doi.org/10.18653/v1/K16-1031.

  13. Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359. https://doi.org/10.1109/TKDE.2009.191.

    Article  Google Scholar 

  14. Freitag M, Al-Onaizan Y. Fast domain adaptation for neural machine translation. arXiv:1612.06897, 2016. https://arxiv.org/abs/1612.06897, Dec. 2020.

  15. Zhang C, Kumar A, Ré C. Materialization optimizations for feature selection workloads. ACM Transactions on Database Systems, 2016, 41(1): Article No. 2. https://doi.org/10.1145/2877204.

  16. Nguyen C, Hassner T, Seeger M, Archambeau C. LEEP: A new measure to evaluate transferability of learned representations. In Proc. the 37th International Conference on Machine Learning, July 2020, pp.7294-7305.

  17. Dietterich T G. Ensemble methods in machine learning. In Proc. the 1st International Workshop on Multiple Classifier Systems, Jun. 2000, pp.1-15. https://doi.org/10.1007/3-540-450149_1.

  18. Fu F, Jiang J, Shao Y, Cui B. An experimental evaluation of large scale GBDT systems. Proceedings of the VLDB Endowment, 2019, 12(11): 1357-1370. https://doi.org/10.14778/3342263.3342273.

  19. Breiman L. Stacked regressions. Machine Learning, 1996, 24(1): 49-64. https://doi.org/10.1023/A:1018046112532.

    Article  MATH  Google Scholar 

  20. Ding Y X, Zhou Z H. Boosting-based reliable model reuse. In Proc. the 12th Asian Conference on Machine Learning, November 2020, pp.145-160.

  21. Miao H, Li A, Davis LS, Deshpande A. ModelHub: Towards unified data and lifecycle management for deep learning. arXiv:1611.06224, 2016. https://arxiv.org/abs/1611.06224, Nov. 2020.

  22. Vartak M, Subramanyam H, Lee W E, Viswanathan S, Husnoo S, Madden S, Zaharia M. MODELDB: A system for machine learning model management. In Proc. the Workshop on Human-in-the-Loop Data Analytics, June 26–July 1, 2016, Article No. 14. https://doi.org/10.1145/2939502.2939516.

  23. Bhardwaj A, Bhattacherjee S, Chavan A, Deshpande A, Elmore A J, Madden S, Parameswaran A G. Datahub: Collaborative data science & dataset version management at scale. arXiv:1409.0798, 2014. https://arxiv.org/abs/1409.0798, April 2021.

  24. Kraska T, Talwalkar A, Duchi J C, Griffith R, Franklin M J, Jordan M I. MLbase: A distributed machine-learning system. In Proc. the 6th Biennial Conference on Innovative Data Systems Research, Jan. 2013.

  25. Xin D, Ma L, Liu J, Macke S, Song S, Parameswaran A. HELIX: Accelerating human-in-the-loop machine learning. arXiv:1808.01095, 2018. https://arxiv.org/abs/1808.01095, April 2021.

  26. Xu L, Dong Q, Liao Y, Yu C, Tian Y, Liu W, Li L, Liu C, Zhang X. CLUENER2020: Fine-grained named entity recognition dataset and benchmark for Chinese. arXiv:2001.04351, 2020. https://arxiv.org/abs/2001.04351, Jan. 2021.

  27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. arXiv:1706.03762, 2017. https://arxiv.org/abs/1706.03762, April 2021.

  28. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019. https://arxiv.org/abs/1907.11692, April 2021.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao-Yong Du.

Supplementary Information

ESM 1

(PDF 354 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cui, JW., Lu, W., Zhao, X. et al. Efficient Model Store and Reuse in an OLML Database System. J. Comput. Sci. Technol. 36, 792–805 (2021). https://doi.org/10.1007/s11390-021-1353-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-021-1353-5

Keywords

Navigation