Skip to main content
Log in

Feature-rich Regular Expression Matching Accelerator for Text Analytics

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

The volume of textual data accessible on our planet is increasing every day. Extracting information hidden in this “Big Data” is a computationally intensive task. A key step of information extraction is the conversion of free text into a structured format. This step is typically achieved using regular expressions (regexs) and dictionaries. Unlike network intrusion detection systems, information extraction systems detect and report where precisely the specific and relevant information starts and ends within text documents. To improve precision and to eliminate ambiguity, regex matchers used in information extraction systems must support start and end offset position reporting, capturing groups, and specific regex-matching semantics, such as leftmost matching. This work describes a scalable regex-matching accelerator that supports such advanced regex-matching features and can be efficiently implemented in reconfigurable logic. Experiments on proprietary and open source regex sets comprising hundreds of regexs demonstrate an up to sixfold improvement of the area-delay product with respect to previous work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23

Similar content being viewed by others

References

  1. Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S., & Huaiyu, Z. (2008). SystemT: A system for declarative information extraction. SIGMOD Record, 37(4), 7–13.

    Article  Google Scholar 

  2. Wei-Dong, Z., & et al. IBM Watson Content Analytics: Discovering Actionable Insight from Your Content. IBM Redbooks, 2014. IBM Watson is a trademark of IBM Corporation in the United States, other countries, or both.

  3. Sidhu, R., & Prasanna, V.K. (2001). Fast regular expression matching using FPGAs. In Proceedings FCCM ’01 (pp. 227–238).

  4. Yang, Y.-H.E., Jiang, W., & Prasanna, V.K. (2008). Compact architecture for high-throughput regular expression matching on FPGA. In Proceedings ANCS (pp. 30–39).

  5. Kubilay, A. (2014). Resource-efficient regular expression matching for text analytics. In Proceedings ASAP (pp. 1–8).

  6. Hopcroft, J.E., Motwani, R., & Jeffrey, D. (2000). Ullman. Introduction to Automata Theory, Languages, and Computation: Addison Wesley.

  7. Kumar, S., Chandrasekaran, B., Turner, J., & Varghese, G. (2007). Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In Proceedings ANCS (pp. 155–164).

  8. van Lunteren, J., Hagleitner, C., Heil, T., Biran, G., Shvadron, U., & Atasu, K. (2012). Designing a programmable wire-speed regular-expression matching accelerator. In Proceedings MICRO (pp. 461–472).

  9. Floyd, R.W., & Ullman, J.D. (1982). The compilation of regular expressions into integrated circuits. J. ACM, 29(3), 603–622.

    Article  MathSciNet  MATH  Google Scholar 

  10. van Lunteren, J., Rohrer, J., Atasu, K., & Hagleitner, C. (2009). Regular expression acceleration at multiple tens of Gb/s. In 1st Workshop on Accelerators for High-performance Architectures in conjunction with ICS.

  11. Sourdis, I., Bispo, J., Cardoso, J.M., & Vassiliadis, S. (2008). Regular expression matching in reconfigurable hardware. Journal Signal Processes System, 51(1), 99–121.

    Article  Google Scholar 

  12. Smith, R., Estan, C., Jha, S., & Kong, S. (2008). Deflating the big bang: Fast and scalable deep packet inspection with extended finite automata. In Proceedings SIGCOMM ’08 (pp. 207–218): ACM.

  13. Baker, Z.K., & Prasanna, V.K. (2004). A methodology for synthesis of efficient intrusion detection systems on FPGAs. In Proceedings FCCM ’04 (pp. 135–144).

  14. Lin, C.-H., Huang, C.-T., Jiang, C.-P., & Chang, S.-C. (2007). Optimization of pattern matching circuits for regular expression on FPGA. IEEE Transaction Very Large Scale Integrative System, 15(12), 1303–1310.

    Article  Google Scholar 

  15. Atasu, K., Polig, R., Rohrer, J., & Hagleitner, C. (2013). Exploring the design space of programmable regular expression matching accelerators. Journal of Systemd Architecture - Embedded System Design, 59(10-D), 1184–1196.

    Article  Google Scholar 

  16. Yamagaki, N., Sidhu, R.P.S., & Kamiya, S. (2008). High-speed regular expression matching engine using multi-character NFA. In Proceedings FPL (pp. 131–136).

  17. Becchi, M., & Crowley, P. (2007). A hybrid finite automaton for practical deep packet inspection. In Proceedings CoNEXT.

  18. Yang, Y.-H.E., & Prasanna, V.K. (2011). Space-time tradeoff in regular expression matching with semi-deterministic finite automata. In Proceedings INFOCOM (pp. 1853– 1861).

  19. Nakahara, H., Sasao, T., & Matsuura, M. (2011). A regular expression matching circuit based on a decomposed automaton. In Proceedings ARC (pp. 16–28).

  20. Pao, D., Or, N.L., & Cheung, R.C.C. (2013). A memory-based NFA regular expression match engine for signature-based intrusion detection. Computer Communication, 36(10-11), 1255– 1267.

    Article  Google Scholar 

  21. Atasu, K., Polig, R., Hagleitner, C., & Reiss, F.R. (2013). Hardware-accelerated regular expression matching for high throughput text analytics. In Proceedings FPL (pp. 1–7).

  22. Ruehle, M.D. (2012). Detection of patterns in a data stream. US Patent No.: US 8,190, 738, B2.

    Google Scholar 

  23. Srinivasan, M., & Stravoytoy, A. (2011). Determining regular expression match lengths. US Patent No.: US 8,051, 085, B1.

    Google Scholar 

  24. Amarù, L.G., Martina, M., & Masera, G. (2012). High speed architectures for finding the first two maximum/minimum values. IEEE Transactions VLSI System, 20(12), 2342–2346.

    Article  Google Scholar 

  25. Yuce, B., Fatih Ugurdag, H., Gören, S., & Dündar, G. (2013). A fast circuit topology for finding the maximum of N k-bit numbers. In Proceedings IEEE Symposium on Computer Arithmetic (pp. 59–66).

  26. Burke, E.K., Marecek, J., Parkes, A.J., & Rudová, H. (2010). A supernodal formulation of vertex colouring with applications in course timetabling. Annals OR, 179(1), 105– 130.

    Article  MathSciNet  MATH  Google Scholar 

  27. COIN-OR branch and cut, an lp-based branch-and-cut library. http://www.coin-or.org/projects/Cbc.xml. Accessed: 2010-07-05.

  28. Application layer packet classifier for Linux (L7-filter). http://l7-filter.sourceforge.net/. Accessed: 2008-11-23. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

  29. Emerging threats rules. http://www.emergingthreats.net/.

  30. Ilie, L., & Sheng, Y. (2003). Follow automata. Information Computer, 186(1), 140–162.

    Article  MathSciNet  MATH  Google Scholar 

  31. Jiang, T., & Ravikumar, B. (1993). Minimal NFA problems are hard. SIAM Journal Computer, 22(6), 1117–1141.

    Article  MathSciNet  MATH  Google Scholar 

  32. Bispo, J., Sourdis, I., Cardoso, J.M.P., & Vassiliadis, S. (2007). Synthesis of regular expressions targeting FPGAs: Current status and open issues. In Proceedings ARC, volume 4419 of Lecture Notes in Computer Science (pp. 179–190): Springer.

  33. Heyse, K, Bruneel, K, & Stroobandt, D (2013). Proving correctness of regular expression matchers with constrained repetition. Electronic Letters, 49(1), 41–42.

    Article  Google Scholar 

  34. Atasu, K., Dorfler, F., van Lunteren, J., & Hagleitner, C. (2013). Hardware-accelerated regular expression matching with overlap handling on IBM PowerEN processor. In Proceedings IPDPS.

  35. Coole, J. (2010). Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing. In Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES/ISSS ’10 (pp. 13–22).

Download references

Acknowledgments

We thank Christoph Hagleitner and Raphael Polig from IBM Research - Zurich, and the anonymous reviewers for their technical comments, and Charlotte Bolliger from IBM Research - Zurich for her language-related corrections and comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kubilay Atasu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Atasu, K. Feature-rich Regular Expression Matching Accelerator for Text Analytics. J Sign Process Syst 85, 355–371 (2016). https://doi.org/10.1007/s11265-015-1052-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-1052-y

Keywords

Navigation