skip to main content
10.1145/2851613.2851668acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Active learning approaches for learning regular expressions with genetic programming

Published: 04 April 2016 Publication History

Abstract

We consider the long-standing problem of the automatic generation of regular expressions for text extraction, based solely on examples of the desired behavior. We investigate several active learning approaches in which the user annotates only one desired extraction and then merely answers extraction queries generated by the system.
The resulting framework is attractive because it is the system, not the user, which digs out the data in search of the samples most suitable to the specific learning task. We tailor our proposals to a state-of-the-art learner based on Genetic Programming and we assess them experimentally on a number of challenging tasks of realistic complexity. The results indicate that active learning is indeed a viable framework in this application domain and may thus significantly decrease the amount of costly annotation effort required.

References

[1]
D. Angluin. Queries and concept learning. Machine learning, 2(4):319--342, 1988.
[2]
J. Atserias, M. Simi, and H. Zaragoza. H.: Active learning for building a corpus of questions for parsing. In In: Proceedings of LREC 2010, 2010.
[3]
R. Babbar and N. Singh. Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data, AND '10, pages 43--50, New York, NY, USA, 2010. ACM.
[4]
D. F. Barrero, M. D. R-Moreno, and D. Camacho. Adapting searchy to extract data using evolved wrappers. Expert Systems with Applications, 39(3):3061--3070, Feb. 2012.
[5]
A. Bartoli, G. Davanzo, A. De Lorenzo, M. Mauri, E. Medvet, and E. Sorio. Automatic generation of regular expressions from examples with genetic programming. In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO '12, pages 1477--1478, New York, NY, USA, 2012. ACM.
[6]
A. Bartoli, G. Davanzo, A. De Lorenzo, E. Medvet, and E. Sorio. Automatic synthesis of regular expressions from examples. Computer, 47(12):72--80, Dec 2014.
[7]
J. Bongard and H. Lipson. Active coevolutionary learning of deterministic finite automata. The Journal of Machine Learning Research, 6:16511678, 2005.
[8]
F. Brauer, R. Rieger, A. Mocan, and W. M. Barczynski. Enabling information extraction by inference of regular expressions from sample entities. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 1285--1294. ACM, 2011.
[9]
A. Brāzma. Efficient identification of regular expressions from representative examples. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT '93, pages 236--242, New York, NY, USA, 1993. ACM.
[10]
W. Cai, M. Zhang, and Y. Zhang. Active learning for ranking with sample density. Information Retrieval Journal, 18(2):123--144, 2015.
[11]
A. Cetinkaya. Regular expression generation through grammatical evolution. In Proceedings of the 2007 GECCO conference companion on Genetic and evolutionary computation, pages 2643--2646. ACM, 2007.
[12]
O. Cicchello and S. C. Kremer. Inducing grammars from sparse data sets: a survey of algorithms and results. The Journal of Machine Learning Research, 4:603--632, 2003.
[13]
J. De Freitas, G. L. Pappa, A. S. Da Silva, M. Gonçalves, E. Moura, A. Veloso, A. H. Laender, M. G. De Carvalho, et al. Active learning genetic programming for record deduplication. In Evolutionary Computation (CEC), 2010 IEEE Congress on, pages 1--8. IEEE, 2010.
[14]
F. Denis. Learning regular languages from simple positive examples. Machine Learning, 44(1-2):37--66, 2001.
[15]
B. Dunay, F. Petry, and B. Buckles. Regular language induction with genetic programming. In Evolutionary Computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on, pages 396--400 vol. 1, Jun 1994.
[16]
A. Esuli, D. Marcheggiani, and F. Sebastiani. Sentence-based active learning strategies for information extraction. In IIR, pages 41--45, 2010.
[17]
H. Fernau. Algorithms for learning regular expressions from positive data. Information and Computation, 207(4):521--541, 2009.
[18]
R. L. Figueroa, Q. Zeng-Treitler, L. H. Ngo, S. Goryachev, and E. P. Wiechmann. Active learning for clinical text classification: is it better than random sampling? Journal of the American Medical Informatics Association, 19(5):809--816, 2012.
[19]
R. Isele and C. Bizer. Active learning of expressive linkage rules using genetic programming. Web Semantics: Science, Services and Agents on the World Wide Web, 23:2--15, 2013.
[20]
E. Kinber. Learning regular expressions from representative examples and membership queries. Grammatical Inference: Theoretical Results and Applications, pages 94--108, 2010.
[21]
J. R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection (Complex Adaptive Systems). 1992.
[22]
K. J. Lang, B. A. Pearlmutter, and R. A. Price. Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In Grammatical Inference, page 112. Springer, 1998.
[23]
D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 3--12. Springer-Verlag New York, Inc., 1994.
[24]
Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Jagadish. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 21--30. Association for Computational Linguistics, 2008.
[25]
Z. Lu, X. Wu, and J. C. Bongard. Active learning through adaptive heterogeneous ensembling. Knowledge and Data Engineering, IEEE Transactions on, 27(2):368--381, 2015.
[26]
S. M. Lucas and T. J. Reynolds. Learning deterministic finite automata with a smart state labeling evolutionary algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7):1063--1074, 2005.
[27]
N. A. H. Mamitsuka. Query learning strategies using boosting and bagging. In Machine Learning: Proceedings of the Fifteenth International Conference (ICML'98), page 1. Morgan Kaufmann Pub, 1998.
[28]
P. Melville and R. J. Mooney. Diverse ensembles for active learning. In Proceedings of the twenty-first international conference on Machine learning, page 74. ACM, 2004.
[29]
K. Murthy, D. P., and P. M. Deshpande. Improving recall of regular expressions for information extraction. In Web Information Systems Engineering - WISE 2012, volume 7651 of Lecture Notes in Computer Science, pages 455--467. Springer Berlin Heidelberg, 2012.
[30]
A.-C. N. Ngomo and K. Lyko. Eagle: Efficient active learning of link specifications using genetic programming. In The Semantic Web: Research and Applications, pages 149--163. Springer, 2012.
[31]
F. Olsson. A literature survey of active machine learning in the context of natural language processing. 2009.
[32]
T. Scheffer, C. Decomain, and S. Wrobel. Active hidden markov models for information extraction. In Advances in Intelligent Data Analysis, pages 309--318. Springer, 2001.
[33]
B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010.
[34]
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT '92, pages 287--294, New York, NY, USA, 1992. ACM.
[35]
D. Spina, M.-H. Peetz, and M. de Rijke. Active learning for entity filtering in microblog streams. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '15, pages 975--978. ACM, 2015.
[36]
B. Svingen. Learning regular languages using genetic programming. Proc. 3-rd Genetic Programming Conference, pages 374--376, 1998.

Cited By

View all
  • (2023)An efficient regular expression inference approach for relevant image extractionApplied Soft Computing10.1016/j.asoc.2023.110030135:COnline publication date: 1-Mar-2023
  • (2022)Automatic generation of regular expressions for the Regex Golf challenge using a local search algorithmGenetic Programming and Evolvable Machines10.1007/s10710-021-09411-x23:1(105-131)Online publication date: 1-Mar-2022
  • (2020)Language Inference with Multi-head Automata through Reinforcement Learning2020 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN48605.2020.9207156(1-8)Online publication date: Jul-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing
April 2016
2360 pages
ISBN:9781450337397
DOI:10.1145/2851613
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. entity extraction
  2. information extraction
  3. machine learning
  4. programming by examples

Qualifiers

  • Research-article

Conference

SAC 2016
Sponsor:
SAC 2016: Symposium on Applied Computing
April 4 - 8, 2016
Pisa, Italy

Acceptance Rates

SAC '16 Paper Acceptance Rate 252 of 1,047 submissions, 24%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)An efficient regular expression inference approach for relevant image extractionApplied Soft Computing10.1016/j.asoc.2023.110030135:COnline publication date: 1-Mar-2023
  • (2022)Automatic generation of regular expressions for the Regex Golf challenge using a local search algorithmGenetic Programming and Evolvable Machines10.1007/s10710-021-09411-x23:1(105-131)Online publication date: 1-Mar-2022
  • (2020)Language Inference with Multi-head Automata through Reinforcement Learning2020 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN48605.2020.9207156(1-8)Online publication date: Jul-2020
  • (2020)An Active Learning Algorithm Based on Shannon Entropy for Constraint-Based ClusteringIEEE Access10.1109/ACCESS.2020.30250368(171447-171456)Online publication date: 2020
  • (2018)Active Learning of Regular Expressions for Entity ExtractionIEEE Transactions on Cybernetics10.1109/TCYB.2017.268046648:3(1067-1080)Online publication date: Mar-2018
  • (2016)Syntactical Similarity Learning by Means of Grammatical EvolutionParallel Problem Solving from Nature – PPSN XIV10.1007/978-3-319-45823-6_24(260-269)Online publication date: 31-Aug-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media