Feature-rich Regular Expression Matching Accelerator for Text Analytics

Atasu, Kubilay

doi:10.1007/s11265-015-1052-y

Feature-rich Regular Expression Matching Accelerator for Text Analytics

Published: 09 October 2015

Volume 85, pages 355–371, (2016)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Kubilay Atasu¹

737 Accesses
6 Citations
Explore all metrics

Abstract

The volume of textual data accessible on our planet is increasing every day. Extracting information hidden in this “Big Data” is a computationally intensive task. A key step of information extraction is the conversion of free text into a structured format. This step is typically achieved using regular expressions (regexs) and dictionaries. Unlike network intrusion detection systems, information extraction systems detect and report where precisely the specific and relevant information starts and ends within text documents. To improve precision and to eliminate ambiguity, regex matchers used in information extraction systems must support start and end offset position reporting, capturing groups, and specific regex-matching semantics, such as leftmost matching. This work describes a scalable regex-matching accelerator that supports such advanced regex-matching features and can be efficiently implemented in reconfigurable logic. Experiments on proprietary and open source regex sets comprising hundreds of regexs demonstrate an up to sixfold improvement of the area-delay product with respect to previous work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Scalable Document-Based Architecture for Text Analysis

Practical Random Access to SLP-Compressed Texts

Smaller Representation of Compiled Regular Expressions

References

Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S., & Huaiyu, Z. (2008). SystemT: A system for declarative information extraction. SIGMOD Record, 37(4), 7–13.
Article Google Scholar
Wei-Dong, Z., & et al. IBM Watson Content Analytics: Discovering Actionable Insight from Your Content. IBM Redbooks, 2014. IBM Watson is a trademark of IBM Corporation in the United States, other countries, or both.
Sidhu, R., & Prasanna, V.K. (2001). Fast regular expression matching using FPGAs. In Proceedings FCCM ’01 (pp. 227–238).
Yang, Y.-H.E., Jiang, W., & Prasanna, V.K. (2008). Compact architecture for high-throughput regular expression matching on FPGA. In Proceedings ANCS (pp. 30–39).
Kubilay, A. (2014). Resource-efficient regular expression matching for text analytics. In Proceedings ASAP (pp. 1–8).
Hopcroft, J.E., Motwani, R., & Jeffrey, D. (2000). Ullman. Introduction to Automata Theory, Languages, and Computation: Addison Wesley.
Kumar, S., Chandrasekaran, B., Turner, J., & Varghese, G. (2007). Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In Proceedings ANCS (pp. 155–164).
van Lunteren, J., Hagleitner, C., Heil, T., Biran, G., Shvadron, U., & Atasu, K. (2012). Designing a programmable wire-speed regular-expression matching accelerator. In Proceedings MICRO (pp. 461–472).
Floyd, R.W., & Ullman, J.D. (1982). The compilation of regular expressions into integrated circuits. J. ACM, 29(3), 603–622.
Article MathSciNet MATH Google Scholar
van Lunteren, J., Rohrer, J., Atasu, K., & Hagleitner, C. (2009). Regular expression acceleration at multiple tens of Gb/s. In 1st Workshop on Accelerators for High-performance Architectures in conjunction with ICS.
Sourdis, I., Bispo, J., Cardoso, J.M., & Vassiliadis, S. (2008). Regular expression matching in reconfigurable hardware. Journal Signal Processes System, 51(1), 99–121.
Article Google Scholar
Smith, R., Estan, C., Jha, S., & Kong, S. (2008). Deflating the big bang: Fast and scalable deep packet inspection with extended finite automata. In Proceedings SIGCOMM ’08 (pp. 207–218): ACM.
Baker, Z.K., & Prasanna, V.K. (2004). A methodology for synthesis of efficient intrusion detection systems on FPGAs. In Proceedings FCCM ’04 (pp. 135–144).
Lin, C.-H., Huang, C.-T., Jiang, C.-P., & Chang, S.-C. (2007). Optimization of pattern matching circuits for regular expression on FPGA. IEEE Transaction Very Large Scale Integrative System, 15(12), 1303–1310.
Article Google Scholar
Atasu, K., Polig, R., Rohrer, J., & Hagleitner, C. (2013). Exploring the design space of programmable regular expression matching accelerators. Journal of Systemd Architecture - Embedded System Design, 59(10-D), 1184–1196.
Article Google Scholar
Yamagaki, N., Sidhu, R.P.S., & Kamiya, S. (2008). High-speed regular expression matching engine using multi-character NFA. In Proceedings FPL (pp. 131–136).
Becchi, M., & Crowley, P. (2007). A hybrid finite automaton for practical deep packet inspection. In Proceedings CoNEXT.
Yang, Y.-H.E., & Prasanna, V.K. (2011). Space-time tradeoff in regular expression matching with semi-deterministic finite automata. In Proceedings INFOCOM (pp. 1853– 1861).
Nakahara, H., Sasao, T., & Matsuura, M. (2011). A regular expression matching circuit based on a decomposed automaton. In Proceedings ARC (pp. 16–28).
Pao, D., Or, N.L., & Cheung, R.C.C. (2013). A memory-based NFA regular expression match engine for signature-based intrusion detection. Computer Communication, 36(10-11), 1255– 1267.
Article Google Scholar
Atasu, K., Polig, R., Hagleitner, C., & Reiss, F.R. (2013). Hardware-accelerated regular expression matching for high throughput text analytics. In Proceedings FPL (pp. 1–7).
Ruehle, M.D. (2012). Detection of patterns in a data stream. US Patent No.: US 8,190, 738, B2.
Google Scholar
Srinivasan, M., & Stravoytoy, A. (2011). Determining regular expression match lengths. US Patent No.: US 8,051, 085, B1.
Google Scholar
Amarù, L.G., Martina, M., & Masera, G. (2012). High speed architectures for finding the first two maximum/minimum values. IEEE Transactions VLSI System, 20(12), 2342–2346.
Article Google Scholar
Yuce, B., Fatih Ugurdag, H., Gören, S., & Dündar, G. (2013). A fast circuit topology for finding the maximum of N k-bit numbers. In Proceedings IEEE Symposium on Computer Arithmetic (pp. 59–66).
Burke, E.K., Marecek, J., Parkes, A.J., & Rudová, H. (2010). A supernodal formulation of vertex colouring with applications in course timetabling. Annals OR, 179(1), 105– 130.
Article MathSciNet MATH Google Scholar
COIN-OR branch and cut, an lp-based branch-and-cut library. http://www.coin-or.org/projects/Cbc.xml. Accessed: 2010-07-05.
Application layer packet classifier for Linux (L7-filter). http://l7-filter.sourceforge.net/. Accessed: 2008-11-23. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Emerging threats rules. http://www.emergingthreats.net/.
Ilie, L., & Sheng, Y. (2003). Follow automata. Information Computer, 186(1), 140–162.
Article MathSciNet MATH Google Scholar
Jiang, T., & Ravikumar, B. (1993). Minimal NFA problems are hard. SIAM Journal Computer, 22(6), 1117–1141.
Article MathSciNet MATH Google Scholar
Bispo, J., Sourdis, I., Cardoso, J.M.P., & Vassiliadis, S. (2007). Synthesis of regular expressions targeting FPGAs: Current status and open issues. In Proceedings ARC, volume 4419 of Lecture Notes in Computer Science (pp. 179–190): Springer.
Heyse, K, Bruneel, K, & Stroobandt, D (2013). Proving correctness of regular expression matchers with constrained repetition. Electronic Letters, 49(1), 41–42.
Article Google Scholar
Atasu, K., Dorfler, F., van Lunteren, J., & Hagleitner, C. (2013). Hardware-accelerated regular expression matching with overlap handling on IBM PowerEN processor. In Proceedings IPDPS.
Coole, J. (2010). Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing. In Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES/ISSS ’10 (pp. 13–22).

Download references

Acknowledgments

We thank Christoph Hagleitner and Raphael Polig from IBM Research - Zurich, and the anonymous reviewers for their technical comments, and Charlotte Bolliger from IBM Research - Zurich for her language-related corrections and comments.

Author information

Authors and Affiliations

IBM Research - Zurich, Säumerstrasse 4, CH-8803, Rüschlikon, Switzerland
Kubilay Atasu

Authors

Kubilay Atasu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kubilay Atasu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Atasu, K. Feature-rich Regular Expression Matching Accelerator for Text Analytics. J Sign Process Syst 85, 355–371 (2016). https://doi.org/10.1007/s11265-015-1052-y

Download citation

Received: 30 November 2014
Revised: 15 July 2015
Accepted: 21 September 2015
Published: 09 October 2015
Issue Date: December 2016
DOI: https://doi.org/10.1007/s11265-015-1052-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature-rich Regular Expression Matching Accelerator for Text Analytics

Abstract

Access this article

Similar content being viewed by others

A Scalable Document-Based Architecture for Text Analysis

Practical Random Access to SLP-Compressed Texts

Smaller Representation of Compiled Regular Expressions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature-rich Regular Expression Matching Accelerator for Text Analytics

Abstract

Access this article

Similar content being viewed by others

A Scalable Document-Based Architecture for Text Analysis

Practical Random Access to SLP-Compressed Texts

Smaller Representation of Compiled Regular Expressions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation