Abstract
The volume of textual data accessible on our planet is increasing every day. Extracting information hidden in this “Big Data” is a computationally intensive task. A key step of information extraction is the conversion of free text into a structured format. This step is typically achieved using regular expressions (regexs) and dictionaries. Unlike network intrusion detection systems, information extraction systems detect and report where precisely the specific and relevant information starts and ends within text documents. To improve precision and to eliminate ambiguity, regex matchers used in information extraction systems must support start and end offset position reporting, capturing groups, and specific regex-matching semantics, such as leftmost matching. This work describes a scalable regex-matching accelerator that supports such advanced regex-matching features and can be efficiently implemented in reconfigurable logic. Experiments on proprietary and open source regex sets comprising hundreds of regexs demonstrate an up to sixfold improvement of the area-delay product with respect to previous work.
Similar content being viewed by others
References
Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S., & Huaiyu, Z. (2008). SystemT: A system for declarative information extraction. SIGMOD Record, 37(4), 7–13.
Wei-Dong, Z., & et al. IBM Watson Content Analytics: Discovering Actionable Insight from Your Content. IBM Redbooks, 2014. IBM Watson is a trademark of IBM Corporation in the United States, other countries, or both.
Sidhu, R., & Prasanna, V.K. (2001). Fast regular expression matching using FPGAs. In Proceedings FCCM ’01 (pp. 227–238).
Yang, Y.-H.E., Jiang, W., & Prasanna, V.K. (2008). Compact architecture for high-throughput regular expression matching on FPGA. In Proceedings ANCS (pp. 30–39).
Kubilay, A. (2014). Resource-efficient regular expression matching for text analytics. In Proceedings ASAP (pp. 1–8).
Hopcroft, J.E., Motwani, R., & Jeffrey, D. (2000). Ullman. Introduction to Automata Theory, Languages, and Computation: Addison Wesley.
Kumar, S., Chandrasekaran, B., Turner, J., & Varghese, G. (2007). Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In Proceedings ANCS (pp. 155–164).
van Lunteren, J., Hagleitner, C., Heil, T., Biran, G., Shvadron, U., & Atasu, K. (2012). Designing a programmable wire-speed regular-expression matching accelerator. In Proceedings MICRO (pp. 461–472).
Floyd, R.W., & Ullman, J.D. (1982). The compilation of regular expressions into integrated circuits. J. ACM, 29(3), 603–622.
van Lunteren, J., Rohrer, J., Atasu, K., & Hagleitner, C. (2009). Regular expression acceleration at multiple tens of Gb/s. In 1st Workshop on Accelerators for High-performance Architectures in conjunction with ICS.
Sourdis, I., Bispo, J., Cardoso, J.M., & Vassiliadis, S. (2008). Regular expression matching in reconfigurable hardware. Journal Signal Processes System, 51(1), 99–121.
Smith, R., Estan, C., Jha, S., & Kong, S. (2008). Deflating the big bang: Fast and scalable deep packet inspection with extended finite automata. In Proceedings SIGCOMM ’08 (pp. 207–218): ACM.
Baker, Z.K., & Prasanna, V.K. (2004). A methodology for synthesis of efficient intrusion detection systems on FPGAs. In Proceedings FCCM ’04 (pp. 135–144).
Lin, C.-H., Huang, C.-T., Jiang, C.-P., & Chang, S.-C. (2007). Optimization of pattern matching circuits for regular expression on FPGA. IEEE Transaction Very Large Scale Integrative System, 15(12), 1303–1310.
Atasu, K., Polig, R., Rohrer, J., & Hagleitner, C. (2013). Exploring the design space of programmable regular expression matching accelerators. Journal of Systemd Architecture - Embedded System Design, 59(10-D), 1184–1196.
Yamagaki, N., Sidhu, R.P.S., & Kamiya, S. (2008). High-speed regular expression matching engine using multi-character NFA. In Proceedings FPL (pp. 131–136).
Becchi, M., & Crowley, P. (2007). A hybrid finite automaton for practical deep packet inspection. In Proceedings CoNEXT.
Yang, Y.-H.E., & Prasanna, V.K. (2011). Space-time tradeoff in regular expression matching with semi-deterministic finite automata. In Proceedings INFOCOM (pp. 1853– 1861).
Nakahara, H., Sasao, T., & Matsuura, M. (2011). A regular expression matching circuit based on a decomposed automaton. In Proceedings ARC (pp. 16–28).
Pao, D., Or, N.L., & Cheung, R.C.C. (2013). A memory-based NFA regular expression match engine for signature-based intrusion detection. Computer Communication, 36(10-11), 1255– 1267.
Atasu, K., Polig, R., Hagleitner, C., & Reiss, F.R. (2013). Hardware-accelerated regular expression matching for high throughput text analytics. In Proceedings FPL (pp. 1–7).
Ruehle, M.D. (2012). Detection of patterns in a data stream. US Patent No.: US 8,190, 738, B2.
Srinivasan, M., & Stravoytoy, A. (2011). Determining regular expression match lengths. US Patent No.: US 8,051, 085, B1.
Amarù, L.G., Martina, M., & Masera, G. (2012). High speed architectures for finding the first two maximum/minimum values. IEEE Transactions VLSI System, 20(12), 2342–2346.
Yuce, B., Fatih Ugurdag, H., Gören, S., & Dündar, G. (2013). A fast circuit topology for finding the maximum of N k-bit numbers. In Proceedings IEEE Symposium on Computer Arithmetic (pp. 59–66).
Burke, E.K., Marecek, J., Parkes, A.J., & Rudová, H. (2010). A supernodal formulation of vertex colouring with applications in course timetabling. Annals OR, 179(1), 105– 130.
COIN-OR branch and cut, an lp-based branch-and-cut library. http://www.coin-or.org/projects/Cbc.xml. Accessed: 2010-07-05.
Application layer packet classifier for Linux (L7-filter). http://l7-filter.sourceforge.net/. Accessed: 2008-11-23. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Emerging threats rules. http://www.emergingthreats.net/.
Ilie, L., & Sheng, Y. (2003). Follow automata. Information Computer, 186(1), 140–162.
Jiang, T., & Ravikumar, B. (1993). Minimal NFA problems are hard. SIAM Journal Computer, 22(6), 1117–1141.
Bispo, J., Sourdis, I., Cardoso, J.M.P., & Vassiliadis, S. (2007). Synthesis of regular expressions targeting FPGAs: Current status and open issues. In Proceedings ARC, volume 4419 of Lecture Notes in Computer Science (pp. 179–190): Springer.
Heyse, K, Bruneel, K, & Stroobandt, D (2013). Proving correctness of regular expression matchers with constrained repetition. Electronic Letters, 49(1), 41–42.
Atasu, K., Dorfler, F., van Lunteren, J., & Hagleitner, C. (2013). Hardware-accelerated regular expression matching with overlap handling on IBM PowerEN processor. In Proceedings IPDPS.
Coole, J. (2010). Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing. In Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES/ISSS ’10 (pp. 13–22).
Acknowledgments
We thank Christoph Hagleitner and Raphael Polig from IBM Research - Zurich, and the anonymous reviewers for their technical comments, and Charlotte Bolliger from IBM Research - Zurich for her language-related corrections and comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Atasu, K. Feature-rich Regular Expression Matching Accelerator for Text Analytics. J Sign Process Syst 85, 355–371 (2016). https://doi.org/10.1007/s11265-015-1052-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-015-1052-y