skip to main content
10.1145/2933267.2933309acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
research-article

AQuA: adaptive quality analytics

Published:13 June 2016Publication History

ABSTRACT

Event-processing systems can support high-quality reactions to events by providing context to the event agents. When this context consists of a large amount of data, it helps to train an analytic model for it. In a continuously running solution, this model must be kept up-to-date, otherwise quality degrades. Unfortunately, ripple-through effects make training (whether from scratch or incremental) expensive. This paper tackles the problem of keeping training cost low and model quality high. We propose AQuA, a quality-directed adaptive analytics retraining framework. AQuA incrementally tracks model quality and only retrains when necessary. AQuA can identify both gradual and abrupt model drift. We implement several retraining strategies in AQuA, and find that a sliding-window strategy consistently outperforms the rest. AQuA is simple to implement over off-the-shelf big-data platforms. We evaluate AQuA on two real-world datasets and three widely-used machine learning algorithms, and show that AQuA effectively balances model quality against training effort.

References

  1. T. Al-Khateeb, M. M. Masud, L. Khan, C. C. Aggarwal, J. Han, and B. M. Thuraisingham. Stream classification with recurring and novel class detection using class-based ensemble. In International Conference on Data Mining (ICDM), pages 31--40, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Arnold, D. Grove, B. Herta, M. Hind, M. Hirzel, A. Iyengar, L. Mandel, V. Saraswat, A. Shinnar, J. Siméon, M. Takeuchi, O. Tardieu, and W. Zhang. META: Middleware for events, transactions, and analytics. IBM Journal of Research and Development, 60(2-3):15:1--15:10, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Bertsekas. Nonlinear Programming. Athena Scientific, 1995.Google ScholarGoogle Scholar
  4. P. Bhatotia, U. A. Acar, F. P. Junqueira, and R. Rodrigues. Slider: Incremental sliding window analytics. In International Middleware Conference, pages 61--72, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini. Incoop: MapReduce for incremental computations. In Symposium on Cloud Computing (SoCC), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Bottou. Online learning and stochastic approximations. On-Line Learning in Neural Networks, pages 9--42, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. In Conference on Very Large Data Bases (VLDB), pages 285--296, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Cao and J. Yang. Towards making systems forget with machine unlearning. In Symposium on Security and Privacy, pages 463--480, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Chirigati, J. Siméon, M. Hirzel, and J. Freire. Virtual lightweight snapshots for consistent analytics in NoSQL stores. In International Conference on Data Engineering (ICDE), Industrial Track, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  10. J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Conference on Neural Information Processing Systems (NIPS), 2012.Google ScholarGoogle Scholar
  11. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Operating Systems Design and Implementation (OSDI), pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Enoki, J. Siméon, H. Horii, and M. Hirzel. Event processing over a distributed JSON store: Design and performance. In Conference on Web Information System Engineering (WISE), pages 395--404, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  14. S. Esteves, J. a. N. Silva, J. a. P. Carvalho, and L. Veiga. Incremental dataflow execution, resource efficiency and probabilistic guarantees with Fuzzy Boolean nets. Journal of Parallel and Distributed Compututing (JPDC), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 37--48, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. L. Forgy. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence, 19:17--37, 1982.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Giraph. http://giraph.apache.org/. Retrieved February, 2016.Google ScholarGoogle Scholar
  18. J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Operating Systems Design and Implementation (OSDI), pages 17--30, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Griffin and L. Libkin. Incremental maintenance of views with duplicates. In International Conference on Management of Data (SIGMOD), pages 328--339, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic management of data and computation in datacenters. In Operating Systems Design and Implementation (OSDI), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. C. H. Hoi, R. Jin, P. Zhao, and T. Yang. Online multiple kernel classification. Machine Learning, 90(2):289--316, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--137, Sept. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In Symposium on Cloud Computing (SoCC), pages 51--62, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In International Conference on Very Large Data Bases (VLDB), volume 5, pages 716--727, Apr. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mahout. http://mahout.apache.org/. Retrieved February, 2016.Google ScholarGoogle Scholar
  27. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In International Conference on Management of Data (SIGMOD), pages 135--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. F. Marvasti. Nonuniform Sampling: Theory and Practice. Information Technology. Kluwer, New York, 2001.Google ScholarGoogle Scholar
  29. M. M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. A multi-partition multi-chunk ensemble technique to classify concept-drifting data streams. In Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD), pages 363--375, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In Conference on Innovative Data Systems Research (CIDR), 2013.Google ScholarGoogle Scholar
  31. T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Conference on Neural Information Processing Systems (NIPS), 2011.Google ScholarGoogle Scholar
  34. D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In Operating Systems Design and Implementation (OSDI), pages 251--264, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. Popa, M. Budiu, Y. Yu, and M. Isard. DryadInc: Reusing work in large-scale computations. In Workshop on Hot Topics in Cloud Computing (HotCloud), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of Naive Bayes text classifiers. In International Conference on Machine Learning (ICML), 2003.Google ScholarGoogle Scholar
  37. N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In International Conference on Management of Data (SIGMOD), pages 979--990, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Shinnar, D. Cunningham, B. Herta, and V. Saraswat. M3R: Increased performance for in-memory Hadoop jobs. In Conference on Very Large Data Bases, Industrial Track, pages 1736--1747, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Shinnar, J. Siméon, and M. Hirzel. A pattern calculus for rule languages: Expressiveness, compilation, and mechanization. In European Conference on Object-Oriented Programming (ECOOP), pages 542--567, 2015.Google ScholarGoogle Scholar
  40. Spark MLib. http://spark.apache.org/docs/1.1.1/mllib-guide.html. Retrieved February, 2016.Google ScholarGoogle Scholar
  41. K. Tangwongsan, M. Hirzel, S. Schneider, and K.-L. Wu. General incremental sliding-window aggregation. In Conference on Very Large Data Bases (VLDB), pages 702--713, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. L. Torrey and J. Shavlik. Transfer learning. In E. S. Olivas, editor, Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques: Algorithms, Methods, and Techniques. IGI Global, 2009.Google ScholarGoogle Scholar
  43. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, Aug. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, and I. S. Dhillon. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. In International Conference on Very Large Data Bases (VLDB), pages 975--986, Sept. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the Netflix prize. In Conference on Algorithmic Aspects in Information and Management (AAIM), pages 337--348, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. AQuA: adaptive quality analytics

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      DEBS '16: Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems
      June 2016
      456 pages
      ISBN:9781450340212
      DOI:10.1145/2933267

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 June 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate130of553submissions,24%

      Upcoming Conference

      DEBS '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader