ABSTRACT
Traditional relation extraction methods require pre-specified relations and relation-specific human-tagged examples. Bootstrapping systems significantly reduce the number of training examples, but they usually apply heuristic-based methods to combine a set of strict hard rules, which limit the ability to generalize and thus generate a low recall. Furthermore, existing bootstrapping methods do not perform open information extraction (Open IE), which can identify various types of relations without requiring pre-specifications. In this paper, we propose a statistical extraction framework called Statistical Snowball (StatSnowball), which is a bootstrapping system and can perform both traditional relation extraction and Open IE.
StatSnowball uses the discriminative Markov logic networks (MLNs) and softens hard rules by learning their weights in a maximum likelihood estimate sense. MLN is a general model, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection is performed by solving an l1-norm penalized maximum likelihood estimation, which enjoys well-founded theories and efficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small but fully labeled data set and large-scale Web data. Empirical results show that StatSnowball can achieve a significantly higher recall without sacrificing the high precision during iterations with a small number of seeds, and the joint inference of MLN can improve the performance. Finally, StatSnowball is efficient and we have developed a working entity relation search engine called Renlifang based on it.
- E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In International Conference on Digital Libraries, 2000. Google ScholarDigital Library
- G. Andrew and J. Gao. Scalable training of l<sub>1</sub>-regularized log-linear models. In ICML, 2007. Google ScholarDigital Library
- M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007. Google ScholarDigital Library
- M. Banko and O. Etzioni. The tradeoffs between open and traditional relation extraction. In ACL, 2008.Google Scholar
- S. Brin. Extracting patterns and relations from the world wide web. In International Workshop on the Web and Databases, 1998. Google ScholarDigital Library
- C. Cortes and V. Vapnik. Support-vector networks. Machine Learing, 20:273--297, 1995. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91--134, 2005.Google ScholarCross Ref
- C. Giuliano, A. Lavelli, and L. Romano. Exploiting shallow linguistic information for relation extraction from biomedical literature. In EACL, 2006.Google Scholar
- A. Harabagiu, C. A. Bejan, and P. Morcheckarescu. Shallow semantics for relation extraction. In IJCAI, 2005. Google ScholarDigital Library
- T. N. Huynh and R. J. Mooney. Dsicriminative structure and parameter learning for markov logic networks. In ICML, 2008. Google ScholarDigital Library
- A. Kaban. On Bayesian classification with laplace priors. Pattern Recognition Letters, 28(10):1271--1282, 2007. Google ScholarDigital Library
- S. Kok and P. Domingos. Learning the structure of markov logic networks. In ICML, 2005. Google ScholarDigital Library
- S. Kok and P. Domingos. Statistical predicate invention. In ICML, 2007. Google ScholarDigital Library
- S. Kok and P. Domingos. Extracting semantic networks from text via relational clustering. In ECML, 2008.Google ScholarCross Ref
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. Google ScholarDigital Library
- A. McCallum. Efficiently inducing features of conditional random fields. In UAI, 2003. Google ScholarDigital Library
- A. McCallum and D. Jensen. A note on the unification of information extraction and data mining using conditional probability, relational models. In IJCAI-2003 Workshop on Learning Statistical Models from Relational Data, 2003.Google Scholar
- Z. Nie, J.-R. Wen, and W.-Y. Ma. Object-level vertical search. In CIDR, 2007.Google Scholar
- S. D. Pietra, V. D. Pietra, and J. Lafferty. Inducing features of random fields. IEEE Trans. on PAMI, 1997. Google ScholarDigital Library
- H. Poon and P. Domingos. Joint inference in information extraction. In AAAI, 2007. Google ScholarDigital Library
- M. Richardson and P. Domingos. Markov logic networks. Machine Learing, 62(1--2):107--136, 2006. Google ScholarDigital Library
- Y. Shinyama and S. Sekine. Preemptive information extraction using unrestricted relation discovery. In HLT/NAACL, 2006. Google ScholarDigital Library
- P. Singla and P. Domingos. Discriminative training of markov logic networks. In AAAI, 2005. Google ScholarDigital Library
- C. H. Teo, Q. Le, A. Smola, and S. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In SIGKDD, 2007. Google ScholarDigital Library
- R. Tibshirani. Regression shrinkage and selection via the LASSO. J. Royal. Statist. Soc., B(58):267--288, 1996.Google Scholar
- D. Zelenko, C. AoneE, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, (3):1083--1106, 2003. Google ScholarDigital Library
- G. Zhou, M. Zhang, D. H. Ji, and Q. Zhu. Tree kernel-based relation extraction with context-sensitive structured parse tree information. In EMNLP-CoNLL, 2005.Google Scholar
- J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In SIGKDD, 2006. Google ScholarDigital Library
- StatSnowball: a statistical approach to extracting entity relationships
Recommendations
Self-supervised relation extraction from the Web
Web extraction systems attempt to use the immense amount of unlabeled text in the Web in order to create large lists of entities and relations. Unlike traditional Information Extraction methods, the Web extraction systems do not label every mention of ...
Learning labeling functions in distantly supervised relation extraction
Distant supervision has become the leading method for training large-scale information extractors. It could be encoded in the form of labeling functions, which employ knowledge bases to provide labels for the data. However, most previous works use only ...
Probabilistic models for biological sequences: selection and Maximum Likelihood estimation
Probabilistic models for biological sequences (DNA and proteins) are frequently used in bioinformatics. We describe statistical tests designed to detect the order of dependency among elements of the sequence and to select the most appropriate ...
Comments