ABSTRACT
Event-processing systems can support high-quality reactions to events by providing context to the event agents. When this context consists of a large amount of data, it helps to train an analytic model for it. In a continuously running solution, this model must be kept up-to-date, otherwise quality degrades. Unfortunately, ripple-through effects make training (whether from scratch or incremental) expensive. This paper tackles the problem of keeping training cost low and model quality high. We propose AQuA, a quality-directed adaptive analytics retraining framework. AQuA incrementally tracks model quality and only retrains when necessary. AQuA can identify both gradual and abrupt model drift. We implement several retraining strategies in AQuA, and find that a sliding-window strategy consistently outperforms the rest. AQuA is simple to implement over off-the-shelf big-data platforms. We evaluate AQuA on two real-world datasets and three widely-used machine learning algorithms, and show that AQuA effectively balances model quality against training effort.
- T. Al-Khateeb, M. M. Masud, L. Khan, C. C. Aggarwal, J. Han, and B. M. Thuraisingham. Stream classification with recurring and novel class detection using class-based ensemble. In International Conference on Data Mining (ICDM), pages 31--40, 2012. Google ScholarDigital Library
- M. Arnold, D. Grove, B. Herta, M. Hind, M. Hirzel, A. Iyengar, L. Mandel, V. Saraswat, A. Shinnar, J. Siméon, M. Takeuchi, O. Tardieu, and W. Zhang. META: Middleware for events, transactions, and analytics. IBM Journal of Research and Development, 60(2-3):15:1--15:10, 2016.Google ScholarDigital Library
- D. Bertsekas. Nonlinear Programming. Athena Scientific, 1995.Google Scholar
- P. Bhatotia, U. A. Acar, F. P. Junqueira, and R. Rodrigues. Slider: Incremental sliding window analytics. In International Middleware Conference, pages 61--72, 2014. Google ScholarDigital Library
- P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini. Incoop: MapReduce for incremental computations. In Symposium on Cloud Computing (SoCC), 2011. Google ScholarDigital Library
- L. Bottou. Online learning and stochastic approximations. On-Line Learning in Neural Networks, pages 9--42, 1998. Google ScholarDigital Library
- Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. In Conference on Very Large Data Bases (VLDB), pages 285--296, 2010.Google ScholarDigital Library
- Y. Cao and J. Yang. Towards making systems forget with machine unlearning. In Symposium on Security and Privacy, pages 463--480, 2015. Google ScholarDigital Library
- F. Chirigati, J. Siméon, M. Hirzel, and J. Freire. Virtual lightweight snapshots for consistent analytics in NoSQL stores. In International Conference on Data Engineering (ICDE), Industrial Track, 2016.Google ScholarCross Ref
- J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Conference on Neural Information Processing Systems (NIPS), 2012.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Operating Systems Design and Implementation (OSDI), pages 137--150, 2004. Google ScholarDigital Library
- C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144, 2014. Google ScholarDigital Library
- M. Enoki, J. Siméon, H. Horii, and M. Hirzel. Event processing over a distributed JSON store: Design and performance. In Conference on Web Information System Engineering (WISE), pages 395--404, 2014.Google ScholarCross Ref
- S. Esteves, J. a. N. Silva, J. a. P. Carvalho, and L. Veiga. Incremental dataflow execution, resource efficiency and probabilistic guarantees with Fuzzy Boolean nets. Journal of Parallel and Distributed Compututing (JPDC), 2015. Google ScholarDigital Library
- M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 37--48, 2012. Google ScholarDigital Library
- C. L. Forgy. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence, 19:17--37, 1982.Google ScholarDigital Library
- Giraph. http://giraph.apache.org/. Retrieved February, 2016.Google Scholar
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Operating Systems Design and Implementation (OSDI), pages 17--30, 2012. Google ScholarDigital Library
- T. Griffin and L. Libkin. Incremental maintenance of views with duplicates. In International Conference on Management of Data (SIGMOD), pages 328--339, 1995. Google ScholarDigital Library
- P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic management of data and computation in datacenters. In Operating Systems Design and Implementation (OSDI), 2010. Google ScholarDigital Library
- S. C. H. Hoi, R. Jin, P. Zhao, and T. Yang. Online multiple kernel classification. Machine Learning, 90(2):289--316, 2013. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, 2007. Google ScholarDigital Library
- S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--137, Sept. 2006. Google ScholarDigital Library
- D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In Symposium on Cloud Computing (SoCC), pages 51--62, 2010. Google ScholarDigital Library
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In International Conference on Very Large Data Bases (VLDB), volume 5, pages 716--727, Apr. 2012. Google ScholarDigital Library
- Mahout. http://mahout.apache.org/. Retrieved February, 2016.Google Scholar
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In International Conference on Management of Data (SIGMOD), pages 135--146, 2010. Google ScholarDigital Library
- F. Marvasti. Nonuniform Sampling: Theory and Practice. Information Technology. Kluwer, New York, 2001.Google Scholar
- M. M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. A multi-partition multi-chunk ensemble technique to classify concept-drifting data streams. In Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD), pages 363--375, 2009. Google ScholarDigital Library
- F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In Conference on Innovative Data Systems Research (CIDR), 2013.Google Scholar
- T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1997. Google ScholarDigital Library
- M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. Google ScholarDigital Library
- F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Conference on Neural Information Processing Systems (NIPS), 2011.Google Scholar
- D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In Operating Systems Design and Implementation (OSDI), pages 251--264, 2010. Google ScholarDigital Library
- L. Popa, M. Budiu, Y. Yu, and M. Isard. DryadInc: Reusing work in large-scale computations. In Workshop on Hot Topics in Cloud Computing (HotCloud), 2009. Google ScholarDigital Library
- J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of Naive Bayes text classifiers. In International Conference on Machine Learning (ICML), 2003.Google Scholar
- N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In International Conference on Management of Data (SIGMOD), pages 979--990, 2014. Google ScholarDigital Library
- A. Shinnar, D. Cunningham, B. Herta, and V. Saraswat. M3R: Increased performance for in-memory Hadoop jobs. In Conference on Very Large Data Bases, Industrial Track, pages 1736--1747, 2012. Google ScholarDigital Library
- A. Shinnar, J. Siméon, and M. Hirzel. A pattern calculus for rule languages: Expressiveness, compilation, and mechanization. In European Conference on Object-Oriented Programming (ECOOP), pages 542--567, 2015.Google Scholar
- Spark MLib. http://spark.apache.org/docs/1.1.1/mllib-guide.html. Retrieved February, 2016.Google Scholar
- K. Tangwongsan, M. Hirzel, S. Schneider, and K.-L. Wu. General incremental sliding-window aggregation. In Conference on Very Large Data Bases (VLDB), pages 702--713, 2015. Google ScholarDigital Library
- L. Torrey and J. Shavlik. Transfer learning. In E. S. Olivas, editor, Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques: Algorithms, Methods, and Techniques. IGI Global, 2009.Google Scholar
- L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, Aug. 1990. Google ScholarDigital Library
- H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, and I. S. Dhillon. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. In International Conference on Very Large Data Bases (VLDB), pages 975--986, Sept. 2014. Google ScholarDigital Library
- Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the Netflix prize. In Conference on Algorithmic Aspects in Information and Management (AAIM), pages 337--348, 2008. Google ScholarDigital Library
Index Terms
- AQuA: adaptive quality analytics
Recommendations
AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects
Building dependable distributed systems from commercial off-the-shelf components is of growing practical importance. For both cost and production reasons, there is interest in approaches and architectures that facilitate building such systems. The AQuA ...
Transductive Multilabel Learning via Label Set Propagation
The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
FamiWare: a family of event-based middleware for ambient intelligence
Most of the middlewares currently available focus on one type of device (e.g., TinyOS sensors) and/or are designed with one requirement in mind (e.g., data management). This is an important limitation since most of the AmI applications work with several ...
Comments