Skip to main content
Log in

Adaptive watermark generation mechanism based on time series prediction for stream processing

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time. In reality, streaming data usually arrives out-of-order due to factors such as network delay. The data stream processing framework commonly adopts the watermark mechanism to address the data disorderedness. Watermark is a special kind of data inserted into the data stream with a timestamp, which helps the framework to decide whether the data received is late and thus be discarded. Traditional watermark generation strategies are periodic; they cannot dynamically adjust the watermark distribution to balance the responsiveness and accuracy. This paper proposes an adaptive watermark generation mechanism based on the time series prediction model to address the above limitation. This mechanism dynamically adjusts the frequency and timing of watermark distribution using the disordered data ratio and other lateness properties of the data stream to improve the system responsiveness while ensuring acceptable result accuracy. We implement the proposed mechanism on top of Flink and evaluate it with real-world datasets. The experiment results show that our mechanism is superior to the existing watermark distribution strategies in terms of both system responsiveness and result accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Iqbal M H, Soomro T R. Big data analysis: apache storm perspective. International Journal of Computer Trends and Technology, 2015, 19(1): 9–14

    Article  Google Scholar 

  2. Armbrust M, Das T, Torres J, Yavuz B, Zaharia M. Structured streaming: a declarative api for real-time applications in apache spark. In: Proceedings of the 2018 International Conference on Management of Data. 2018, 601–613

  3. Carbone P, Katsifodimos A, Sweden S, Tzoumas K. Apache flink: stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4): 28–37

    Google Scholar 

  4. Akidau T, Schmidt E, Whittle S, Bradshaw R Perry F. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 2015, 8(12): 1792–1803

    Article  Google Scholar 

  5. Akidau T, Balikov A, Bekiroğlu K, Chernyak S, Haberman L. Mill-Wheel: fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 2013, 6(11): 1033–1044

    Article  Google Scholar 

  6. Awad A, Traub J, Sakr S. Adaptive watermarks: a concept drift-based approach for predicting event-time progress in data streams. In: Proceedings of the 22nd International Conference on Extending Database Technology.

  7. Barlow H B. Unsupervised learning. Neural Computation, 1989, 1(3): 295–311

    Article  Google Scholar 

  8. Gers F A, Eck D, Schmidhuber J. Applying LSTM to time series predictable through time-window approaches. In: Proceedings of International Conference on Artificial Neural Networks. 2001, 669–676

  9. Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. 2016, 785–794

  10. Tucker P A, Maier D, Sheard T, Fegaras L. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(3): 555–568

    Article  Google Scholar 

  11. Sun D, Hwang S. DSSP: stream split processing model for high correctness of out-of-order data processing. In: Proceedings of the 1st IEEE International Conference on Artificial Intelligence and Knowledge Engineering. 2018, 193–197

  12. Mutschler C, Philippsen M. Distributed low-latency out-of-order event processing for high data rate sensor streams. In: Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing. 2013, 1133–1144

  13. Babu S, Srivastava U, Widom J. Exploiting k-constraints to reduce memory overhead in continuous queries over data streams. ACM Transactions on Database Systems, 2004, 29(3): 545–580

    Article  Google Scholar 

  14. Kuralenok I E, Marshalkin N, Trofimov A, Novikov B. An optimistic approach to handle out-of-order events within analytical stream processing. In: Proceedings of CEUR Workshop Proceedings. 2018, 22–29

  15. Dries A, Röckert U. Adaptive concept drift detection. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2009, 2(5–6): 311–327

    Article  Google Scholar 

  16. Bifet A, Gavalda R. Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM International Conference on Data Mining. 2007, 443–448

  17. Thein K M M. Apache kafka: next generation distributed messaging system. International Journal of Scientific Engineering and Technology Research, 2014, 3(47): 9478–9483

    Google Scholar 

  18. Das S. Time Series Analysis. Princeton University Press, Princeton, NJ, 1994

    Google Scholar 

  19. Gal Y, Ghahramani Z. A theoretically grounded application of dropout in recurrent neural networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1019–1027

  20. Sanjappa S, Ahmed M. Analysis of logs by using logstash. In: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. 2017, 579–585

Download references

Acknowledgements

We would like to thank anonymous reviewers for their valuable feedbacks. This work was supported by National Key Research and Development Program of China (2020YFB1506703) and the National Natural Science Foundation of China (Grant No. 62072018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hailong Yang.

Additional information

Yang Song is a master student in School of Computer Science and Engineering. Beihang University, China. He is currently working on big data processing optimization. His research interests include distributed trace system and data streaming systems.

Yunchun Li received the PhD degree in Computer Science from Beihang University, China in 2008. He went to University of Illinois at Urbana-Champaign (UIUC), USA as a visiting scholar in 2010. Now he is the director of Network Information Center and the professor of School of Computer Science and Engineering, Beihang University, China. He is the author of over 60 articles. His research interests include big data, cloud computing, parallel computing.

Hailong Yang is an associate professor in School of Computer Science and Engineering, Beihang University, China. He received the PhD degree in the School of Computer Science and Engineering, Beihang University, China in 2014. He has been involved in several scientific projects such as performance analysis for big data systems and performance optimization for large scale applications. His research interests include parallel and distributed computing and HPC.

Jun Xu is a senior engineer in Beijing Simulation Center of the Second Institute of CASIC, China. She received the PhD degree of computer science and technology in Zhejiang University, China in 2011. Her research interest is modeling and simulation of weapon equipment system.

Zerong Luan is an undergraduate in Beijing University of Technology major on biomedical engineering, China. His research interests are proteome data analysis based on deep learning, medical image processing based on high performance computing, medical instrument design based on embedded system and AI chips.

Wei Li received the PhD degree in computer science from Beihang University, China. She is currently an associate professor with the School of Computer Science and Engineering, Beihang University, China. Her current research interests include network measurement, network virtualization, cloud computing.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, Y., Li, Y., Yang, H. et al. Adaptive watermark generation mechanism based on time series prediction for stream processing. Front. Comput. Sci. 15, 156213 (2021). https://doi.org/10.1007/s11704-020-0206-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-020-0206-7

Keywords

Navigation