Skip to main content
Log in

DotStar: breaking the scalability and performance barriers in parsing regular expressions

  • Special Issue Paper
  • Published:
Computer Science - Research and Development

Abstract

Regular expressions (shortened as regexp ) are widely used to parse data, detect recurrent patterns and information, and are a common choice for defining configurable rules for a variety of systems. In fact, many data-intensive applications rely on regexp matching as the first line of defense to perform on-line data filtering. Unfortunately, few solutions can keep up with the increasing data rate and complexity of sets containing hundreds of expressions. In this paper we present DotStar (.*), a complete algorithmic solution and a software tool-chain, that can compile large sets of regexp into an automaton that can take advantage of the vector/SIMD extensions available on many commodity multi-core processors. DotStar relies on several algorithmic innovations to transform the user-provided regexp set into a sequence of manageable intermediate representations. The resulting automaton is both space and time efficient, and can search in a single pass without backtracking. The experimental evaluation, performed on a family of state-of-the-art processors, shows that DotStar can efficiently handle both small sets of regexp, used in protocol parsing, and larger sets designed for Network Intrusion Detection Systems (NIDS), achieving a performance between 1 and 5 Gbit/sec per core.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Caron P, Ziadi D (2000) Characterization of Glushkov automata. Theor Comput Sci 233(12):75–90

    Article  MATH  MathSciNet  Google Scholar 

  2. Cisco. Cisco IOS IPS Deployment Guide

  3. Gluskov VM (1961) The abstract theory of automata. Russ Math Surv 16:1–53

    Article  MathSciNet  Google Scholar 

  4. Goyal N, Ormont J, Smith R, Sankaralingam K (2008) Signature matching in network processing using SIMD/GPU architectures

  5. Kumar S, Dharmapurikar S, Yu F, Crowley P, Turner J (2006) Algorithms to accelerate multiple regular expressions matching for deep packet inspection. SIGCOMM Comput Commun Rev 36(4):339–350

    Article  Google Scholar 

  6. Kumar S, Chandrasekaran B, Turner J, Varghese G (2007) Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In: ANCS ’07: proceedings of the 3rd ACM/IEEE symposium on architecture for networking and communications systems. ACM, New York, pp 155–164

    Chapter  Google Scholar 

  7. Lawrence Berkeley National Laboratory Bro Intrusion Detection System

  8. Lee J, Hwang SH, Park N, Lee S-W, Jun S, Kim YS (2007) A high performance nids using fpga-based regular expression matching. In: SAC ’07: proceedings of the 2007 ACM symposium on applied computing. ACM, New York, pp 1187–1191

    Chapter  Google Scholar 

  9. Levandoski J, Sommer E, Strait M Application layer packet classifier for Linux

  10. Lu W, Chiu K, Pan Y (2006) A parallel approach to xml parsing. In: The 7th IEEE/ACM international conference on grid computing (Grid2006), Barcelona, Spain, September 28–29, 2006

  11. Martens W, Neven F, Schwentick T (2004) Complexity of decision problems for simple regular expressions. In: MFCS, pp 889–900

  12. Navarro G, Raffinot M (2002) Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, New York

    MATH  Google Scholar 

  13. Petrini F, Agarwal V, Pasetto D (2009) SCAMPI: a scalable cam-based algorithm for multiple packet inspection. In: Proc intl conf for high performance computing, networking, storage and analysis (SuperComputing’09), Portland, OR, November 2009

  14. Scarpazza DP, Russell GF (2009) High-performance regular expression scanning on the cell/b.e. processor. In: ICS ’09: proceedings of the 23rd international conference on supercomputing. ACM, New York, pp 14–25

    Chapter  Google Scholar 

  15. Sourcefire Inc. SNORT network intrusion detection system

  16. Sourdis I, Pnevmatikatos D (2003) Fast, large-scale string match for a 10 gbps fpga-based network intrusion. In: FPL 2003, pp 880–889

  17. Suresh DC, Guo Z, Najjar WA (2006) Automatic compilation framework for bloom filter based intrusion detection. In: Workshop on applied reconfigurable computing

  18. van Lunteren J, Rohrer J, Atasu K, Hagleitner C (2009) Regular expression acceleration at multiple tens of gb/s. In: Workshop on accelerators for high performance architectures, international conference on supercomputing

  19. Yu F, Chen Z, Diao Y, Lakshman TV, Katz RH (2006) Fast and memory-efficient regular expression matching for deep packet inspection. In: ANCS ’06: proceedings of the 2006 ACM/IEEE symposium on architecture for networking and communications systems. ACM, New York, pp 93–102

    Chapter  Google Scholar 

  20. Zhang W, van Engelen R (2006) A table-driven streaming xml parsing methodology for high-performance web services. In: ICWS ’06: proceedings of the IEEE international conference on web services. IEEE Comput Soc, Washington, pp 197–204

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davide Pasetto.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pasetto, D., Petrini, F. & Agarwal, V. DotStar: breaking the scalability and performance barriers in parsing regular expressions. Comput Sci Res Dev 25, 93–104 (2010). https://doi.org/10.1007/s00450-010-0106-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-010-0106-4

Keywords

Navigation