research-article

AQuA: adaptive quality analytics

Authors:
Wei Zhang

IBM Research

IBM Research
View Profile

,
Martin Hirzel

IBM Research

IBM Research
View Profile

,
David Grove

IBM Research

IBM Research
View Profile

DEBS '16: Proceedings of the 10th ACM International Conference on Distributed and Event-based SystemsJune 2016Pages 169–180https://doi.org/10.1145/2933267.2933309

Published:13 June 2016Publication History

DEBS '16: Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems

Pages 169–180

ABSTRACT

Event-processing systems can support high-quality reactions to events by providing context to the event agents. When this context consists of a large amount of data, it helps to train an analytic model for it. In a continuously running solution, this model must be kept up-to-date, otherwise quality degrades. Unfortunately, ripple-through effects make training (whether from scratch or incremental) expensive. This paper tackles the problem of keeping training cost low and model quality high. We propose AQuA, a quality-directed adaptive analytics retraining framework. AQuA incrementally tracks model quality and only retrains when necessary. AQuA can identify both gradual and abrupt model drift. We implement several retraining strategies in AQuA, and find that a sliding-window strategy consistently outperforms the rest. AQuA is simple to implement over off-the-shelf big-data platforms. We evaluate AQuA on two real-world datasets and three widely-used machine learning algorithms, and show that AQuA effectively balances model quality against training effort.

References

T. Al-Khateeb, M. M. Masud, L. Khan, C. C. Aggarwal, J. Han, and B. M. Thuraisingham. Stream classification with recurring and novel class detection using class-based ensemble. In International Conference on Data Mining (ICDM), pages 31--40, 2012. Google ScholarDigital Library
M. Arnold, D. Grove, B. Herta, M. Hind, M. Hirzel, A. Iyengar, L. Mandel, V. Saraswat, A. Shinnar, J. Siméon, M. Takeuchi, O. Tardieu, and W. Zhang. META: Middleware for events, transactions, and analytics. IBM Journal of Research and Development, 60(2-3):15:1--15:10, 2016.Google ScholarDigital Library
D. Bertsekas. Nonlinear Programming. Athena Scientific, 1995.Google Scholar
P. Bhatotia, U. A. Acar, F. P. Junqueira, and R. Rodrigues. Slider: Incremental sliding window analytics. In International Middleware Conference, pages 61--72, 2014. Google ScholarDigital Library
P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini. Incoop: MapReduce for incremental computations. In Symposium on Cloud Computing (SoCC), 2011. Google ScholarDigital Library
L. Bottou. Online learning and stochastic approximations. On-Line Learning in Neural Networks, pages 9--42, 1998. Google ScholarDigital Library
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. In Conference on Very Large Data Bases (VLDB), pages 285--296, 2010.Google ScholarDigital Library
Y. Cao and J. Yang. Towards making systems forget with machine unlearning. In Symposium on Security and Privacy, pages 463--480, 2015. Google ScholarDigital Library
F. Chirigati, J. Siméon, M. Hirzel, and J. Freire. Virtual lightweight snapshots for consistent analytics in NoSQL stores. In International Conference on Data Engineering (ICDE), Industrial Track, 2016.Google ScholarCross Ref
J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Conference on Neural Information Processing Systems (NIPS), 2012.Google Scholar
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Operating Systems Design and Implementation (OSDI), pages 137--150, 2004. Google ScholarDigital Library
C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144, 2014. Google ScholarDigital Library
M. Enoki, J. Siméon, H. Horii, and M. Hirzel. Event processing over a distributed JSON store: Design and performance. In Conference on Web Information System Engineering (WISE), pages 395--404, 2014.Google ScholarCross Ref
S. Esteves, J. a. N. Silva, J. a. P. Carvalho, and L. Veiga. Incremental dataflow execution, resource efficiency and probabilistic guarantees with Fuzzy Boolean nets. Journal of Parallel and Distributed Compututing (JPDC), 2015. Google ScholarDigital Library
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 37--48, 2012. Google ScholarDigital Library
C. L. Forgy. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence, 19:17--37, 1982.Google ScholarDigital Library
Giraph. http://giraph.apache.org/. Retrieved February, 2016.Google Scholar
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Operating Systems Design and Implementation (OSDI), pages 17--30, 2012. Google ScholarDigital Library
T. Griffin and L. Libkin. Incremental maintenance of views with duplicates. In International Conference on Management of Data (SIGMOD), pages 328--339, 1995. Google ScholarDigital Library
P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic management of data and computation in datacenters. In Operating Systems Design and Implementation (OSDI), 2010. Google ScholarDigital Library
S. C. H. Hoi, R. Jin, P. Zhao, and T. Yang. Online multiple kernel classification. Machine Learning, 90(2):289--316, 2013. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, 2007. Google ScholarDigital Library
S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--137, Sept. 2006. Google ScholarDigital Library
D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In Symposium on Cloud Computing (SoCC), pages 51--62, 2010. Google ScholarDigital Library
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In International Conference on Very Large Data Bases (VLDB), volume 5, pages 716--727, Apr. 2012. Google ScholarDigital Library
Mahout. http://mahout.apache.org/. Retrieved February, 2016.Google Scholar
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In International Conference on Management of Data (SIGMOD), pages 135--146, 2010. Google ScholarDigital Library
F. Marvasti. Nonuniform Sampling: Theory and Practice. Information Technology. Kluwer, New York, 2001.Google Scholar
M. M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. A multi-partition multi-chunk ensemble technique to classify concept-drifting data streams. In Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD), pages 363--375, 2009. Google ScholarDigital Library
F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In Conference on Innovative Data Systems Research (CIDR), 2013.Google Scholar
T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1997. Google ScholarDigital Library
M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. Google ScholarDigital Library
F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Conference on Neural Information Processing Systems (NIPS), 2011.Google Scholar
D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In Operating Systems Design and Implementation (OSDI), pages 251--264, 2010. Google ScholarDigital Library
L. Popa, M. Budiu, Y. Yu, and M. Isard. DryadInc: Reusing work in large-scale computations. In Workshop on Hot Topics in Cloud Computing (HotCloud), 2009. Google ScholarDigital Library
J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of Naive Bayes text classifiers. In International Conference on Machine Learning (ICML), 2003.Google Scholar
N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In International Conference on Management of Data (SIGMOD), pages 979--990, 2014. Google ScholarDigital Library
A. Shinnar, D. Cunningham, B. Herta, and V. Saraswat. M3R: Increased performance for in-memory Hadoop jobs. In Conference on Very Large Data Bases, Industrial Track, pages 1736--1747, 2012. Google ScholarDigital Library
A. Shinnar, J. Siméon, and M. Hirzel. A pattern calculus for rule languages: Expressiveness, compilation, and mechanization. In European Conference on Object-Oriented Programming (ECOOP), pages 542--567, 2015.Google Scholar
Spark MLib. http://spark.apache.org/docs/1.1.1/mllib-guide.html. Retrieved February, 2016.Google Scholar
K. Tangwongsan, M. Hirzel, S. Schneider, and K.-L. Wu. General incremental sliding-window aggregation. In Conference on Very Large Data Bases (VLDB), pages 702--713, 2015. Google ScholarDigital Library
L. Torrey and J. Shavlik. Transfer learning. In E. S. Olivas, editor, Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques: Algorithms, Methods, and Techniques. IGI Global, 2009.Google Scholar
L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, Aug. 1990. Google ScholarDigital Library
H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, and I. S. Dhillon. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. In International Conference on Very Large Data Bases (VLDB), pages 975--986, Sept. 2014. Google ScholarDigital Library
Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the Netflix prize. In Conference on Algorithmic Aspects in Information and Management (AAIM), pages 337--348, 2008. Google ScholarDigital Library

Index Terms

AQuA: adaptive quality analytics
1. Applied computing
  1. Enterprise computing
    1. Event-driven architectures

Recommendations

AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects

Building dependable distributed systems from commercial off-the-shelf components is of growing practical importance. For both cost and production reasons, there is interest in approaches and architectures that facilitate building such systems. The AQuA ...
Read More
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Read More
FamiWare: a family of event-based middleware for ambient intelligence

Most of the middlewares currently available focus on one type of device (e.g., TinyOS sensors) and/or are designed with one requirement in mind (e.g., data management). This is an important limitation since most of the AmI applications work with several ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DEBS '16: Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems
June 2016
456 pages
ISBN:9781450340212
DOI:10.1145/2933267
General Chairs:
Avigdor Gal
Technion, Israel
,
Matthias Weidlich
Humboldt-Universität zu Berlin, Germany
,
Program Chairs:
Vana Kalogeraki
Athens University of Economics and Business, Greece
,
Nalini Venkasubramanian
University of California, Irvine
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
context
events
machine learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate130of553submissions,24%
Upcoming Conference
DEBS '24

Sponsor:

sigmod

sigmod

The 18th ACM International Conference on Distributed and Event-based Systems

June 24 - 28, 2024

Villeurbanne , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 141
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

AQuA: adaptive quality analytics

DEBS '16: Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects

Transductive Multilabel Learning via Label Set Propagation

FamiWare: a family of event-based middleware for ambient intelligence