Abstract
Data stream processing is an important function in many online applications such as network traffic analysis, web applications, and financial data analysis. Computing summaries of data stream is challenging since streaming data is generally unbounded, and cannot be permanently stored or accessed more than once. In this paper, we have proposed two counter based hierarchical (CHS) \(\epsilon \)–approximation algorithms to create hierarchical summaries of one dimensional data. CHS maintains a data structure, where each entry contains the incoming data item and an associated counter to store its frequency. Since every item in streaming data cannot be stored, CHS only maintains frequent items (known as hierarchical heavy hitters) at various levels of generalization hierarchy by exploiting the natural hierarchy of the data. The algorithm guarantees accuracy of count within an \(\epsilon \) bound. Furthermore, using aperiodic (CHS-A) and periodic (CHS-P) compression strategy the proposed technique offers improved space complexities of \(O(\frac{\eta }{\epsilon })\) and \(O(\frac{\eta }{\epsilon }\log \epsilon N)\), respectively. We provide theoretical proofs for both space and time requirements of CHS algorithm. We have also experimentally compared the proposed algorithm with the existing benchmark techniques. Experimental results show that the proposed algorithm requires fewer updates per element of data, and uses a moderate amount of bounded memory. Moreover, precision-recall analysis demonstrates that CHS algorithm provides a high quality output compared to existing benchmark techniques. For the experimental validation, we have used both synthetic data derived from an open source generator, and real benchmark data sets from an international Internet Service Provider.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
http://www.tcpdump.org/manpages/tcpdump.1.html, Accessed: 23/02/2015.
- 3.
https://www.wireshark.org/, Accessed: 23/02/2015.
References
Estan, C., Varghese, G.: New directions in traffic measurement and accounting. SIGCOMM Comput. Commun. Rev. 32(4), 323–336 (2002)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 693–703. Springer, Heidelberg (2002)
Metwally, A., Agrawal, D.P., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2005)
Lin, Y., Liu, H.: Separator: sifting hierarchical heavy hitters accurately from data streams. In: Alhajj, R., Gao, H., Li, X., Li, J., Zaïane, O.R. (eds.) ADMA 2007. LNCS (LNAI), vol. 4632, pp. 170–182. Springer, Heidelberg (2007)
Mitzenmacher, M., Steinke, T., Thaler, J.: Hierarchical heavy hitters with the space saving algorithm, arXiv 1102
Truong, P., Guillemin, F.: Identification of heavyweight address prefix pairs in IP traffic. In: 21st International Teletraffic Congress, 2009. ITC 21 2009, pp. 1–8. IEEE (2009)
Jose, L., Yu, M., Rexford, J.: Online measurement of large traffic aggregates on commodity switches. In: Proceedings of the USENIX HotICE Workshop (2011)
da Cruz, M.A., Correa, S., Cardoso, K.V., et al.: Accurate online detection of bidimensional hierarchical heavy hitters in software-defined networks. In: 2013 IEEE Latin-America Conference on Communications (LATINCOM), pp. 1–6. IEEE (2013)
Moshref, M., Yu, M., Govindan, R., Vahdat, A.: Dream: dynamic resource allocation for software-defined measurement. In: ACM SIGCOMM 2014, pp. 419–430. ACM (2014)
Hernández, C., Navarro, A.G., MarÃn, M.: Managing massive graphs, universidad de chile (2014). http://users.dcc.uchile.cl/~gnavarro/algoritmos/tesiscecilia.pdf, Ph.D. thesis, Citeseer (2009)
Kalliola, A., Aura, T., Šćepanović, S.: Denial-of-service mitigation for internet services. In: Bernsmed, K., Fischer-Hübner, S. (eds.) NordSec 2014. LNCS, vol. 8788, pp. 213–228. Springer, Heidelberg (2014)
Leeder, M.A.: Providing customized information to a user based on identifying a trend, US Patent 8,649,779, 11 February 2014
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in streaming data. ACM Trans. Knowl. Discov. Data (TKDD) 1(4), 1–48 (2008)
Hershberger, J., Shrivastava, N., Suri, S., Tóth, C.D.: Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In: Proceedings of Principles of database systems, pp. 338–347. ACM (2005)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of Very Large Data Bases, VLDB Endowment, pp. 346–357 (2002)
Micheel, J., Graham, I., Brownlee, N.: The auckland data set: an access link observed. In: Proceedings of Access Networks and Systems, pp. 19–30 (2001)
Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst. (TOIS) 21(1), 64–93 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Shah, Z., Mahmood, A.N., Barlow, M. (2016). Computing Hierarchical Summary of the Data Streams. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-31750-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)