skip to main content
10.1145/1871840.1871848acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text

Published: 26 October 2010 Publication History

Abstract

Regular Expressions have been used for Information Extraction tasks in a variety of domains. The alphabet of the regular expression can either be the relevant tokens corresponding to the entity of interest or individual characters in which case the alphabet size becomes very large. The presence of noise in unstructured text documents along with increased alphabet size of the regular expressions poses a significant challenge for entity extraction tasks, and also for algorithmically learning complex regular expressions. In this paper, we present a novel algorithm for regular expression learning which clusters similar matches to obtain the corresponding regular expressions, identifies and eliminates noisy clusters, and finally uses weighted disjunction of the most promising candidate regular expressions to obtain the final expression. The experimental results demonstrate high value of both precision and recall of this final expression, which reinforces the applicability of our approach in entity extraction tasks of practical importance.

References

[1]
R. Ananthanarayanan, V. Chenthamarakshan, P. M. Deshpande, and R. Krishnapuram. Rule based synonyms for entity extraction from noisy text. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 31--38, 2008.
[2]
E. Brill. Pattern-based disambiguation for natural language processing. In Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora, pages 1--8, 2000.
[3]
F. Ciravegna. Adaptive information extraction from text by rule induction and generalisation. In IJCAI'01: Proceedings of the 17th international joint conference on Artificial intelligence, pages 1251--1256, 2001.
[4]
L. Dey and S. K. M. Haque. Opinion mining from noisy text data. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 83--90, 2008.
[5]
P. Dupont. Incremental regular inference. In ICGI '96: Proceedings of the 3rd International Colloquium on Grammatical Inference, pages 222--237, 1996.
[6]
P. Dupont, L. Miclet, and E. Vidal. What is the search space of the regular inference? In ICGI '94: Proceedings of the Second International Colloquium on Grammatical Inference and Applications, pages 25--37, 1994.
[7]
D. Freitag. Machine learning for information extraction in informal domains. Mach. Learn., 39(2--3):169--202, 2000.
[8]
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.
[9]
P. Kluegl, M. Atzmueller, and F. Puppe. Textmarker: A tool for rule-based information extraction. In Proceedings of the Biennial GSCL Conference 2009, 2nd UIMA@GSCL Workshop, pages 233--240, 2009.
[10]
V. Krishnan and C. D. Manning. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 1121--1128, 2006.
[11]
Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular expression learning for information extraction. In EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 21--30, 2008.
[12]
B. Liu. Web Data Mining Exploring Hyperlinks, Contents and Usage Data. Springer, 2006.
[13]
Matcher. Matcher class in java. Website, 2010. http://download.oracle.com/docs/cd/E17476_01/javase/1.4.2/docs/api/java/util/regex/Matcher.html.
[14]
E. Minkov, R. C. Wang, and W. W. Cohen. Extracting personal names from email: Applying named entity recognition to informal text. In HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 443--450, 2005.
[15]
E. W. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51(1):5--37, JANUARY 1989.
[16]
G. Navarro and M. Raffinot. Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, New York, NY, USA, 2002.
[17]
G. Navarro and M. Raffinot. New techniques for regular expression searching. Algorithmica, 41(2):89--116, 2005.
[18]
Pattern. Pattern class in java. Website, 2010. http://download.oracle.com/docs/cd/E17476_01/javase/1.4.2/docs/api/java/util/regex/Pattern.html.
[19]
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE '08: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pages 933--942, 2008.
[20]
S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34(1--3):233--272, 1999.
[21]
UIMA. Unstructured information management architecture. Website, 2007. http://domino.research.ibm.com/comm/researchprojects.nsf/pages/uima.index.html.
[22]
S. Wu, U. Manber, and E. Myers. A subquadratic algorithm for approximate regular expression matching. J. Algorithms, 19(3):346--360, 1995.
[23]
T. Wu and W. M. Pottenger. A semi-supervised active learning algorithm for information extraction from textual data: Research articles. J. Am. Soc. Inf. Sci. Technol., 56(3):258--271, 2005

Cited By

View all
  • (2024)Automatic Regular Expression Generation for Extracting Relevant Image Data From Web Pages Using Genetic AlgorithmsIEEE Access10.1109/ACCESS.2024.342073412(90660-90669)Online publication date: 2024
  • (2023)Human-in-the-loop Regular Expression Extraction for Single Column Format InconsistencyProceedings of the ACM Web Conference 202310.1145/3543507.3583515(3859-3867)Online publication date: 30-Apr-2023
  • (2021)Automatic Search-and-Replace From Examples With Coevolutionary Genetic ProgrammingIEEE Transactions on Cybernetics10.1109/TCYB.2019.291833751:5(2612-2624)Online publication date: May-2021
  • Show More Cited By

Index Terms

  1. Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data
      October 2010
      96 pages
      ISBN:9781450303767
      DOI:10.1145/1871840
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 October 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. clustering in noisy text
      2. regular expression learning
      3. rule-based information extraction

      Qualifiers

      • Research-article

      Conference

      CIKM '10

      Acceptance Rates

      Overall Acceptance Rate 15 of 22 submissions, 68%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)8
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Automatic Regular Expression Generation for Extracting Relevant Image Data From Web Pages Using Genetic AlgorithmsIEEE Access10.1109/ACCESS.2024.342073412(90660-90669)Online publication date: 2024
      • (2023)Human-in-the-loop Regular Expression Extraction for Single Column Format InconsistencyProceedings of the ACM Web Conference 202310.1145/3543507.3583515(3859-3867)Online publication date: 30-Apr-2023
      • (2021)Automatic Search-and-Replace From Examples With Coevolutionary Genetic ProgrammingIEEE Transactions on Cybernetics10.1109/TCYB.2019.291833751:5(2612-2624)Online publication date: May-2021
      • (2020)Regular Expression Learning from Positive Examples Based on Integer ProgrammingInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402040020330:10(1443-1479)Online publication date: 9-Nov-2020
      • (2020)CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular ExpressionsIEEE Access10.1109/ACCESS.2020.2972205(1-1)Online publication date: 2020
      • (2019)Learning Regexes to Extract Router Names from HostnamesProceedings of the Internet Measurement Conference10.1145/3355369.3355589(337-350)Online publication date: 21-Oct-2019
      • (2019)Learning Regular Expressions Using XCS-Based Classifier SystemInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142051011834:10(2051011)Online publication date: 31-Dec-2019
      • (2018)Content and contextProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence10.5555/3504035.3504762(5924-5931)Online publication date: 2-Feb-2018
      • (2018)How well are regular expressions tested in the wild?Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3236024.3236072(668-678)Online publication date: 26-Oct-2018
      • (2018)Active Learning of Regular Expressions for Entity ExtractionIEEE Transactions on Cybernetics10.1109/TCYB.2017.268046648:3(1067-1080)Online publication date: Mar-2018
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media