research-article

StatSnowball: a statistical approach to extracting entity relationships

Authors:
Jun Zhu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Zaiqing Nie

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Xiaojiang Liu

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Bo Zhang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Ji-Rong Wen

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

WWW '09: Proceedings of the 18th international conference on World wide webApril 2009Pages 101–110https://doi.org/10.1145/1526709.1526724

Published:20 April 2009Publication History

WWW '09: Proceedings of the 18th international conference on World wide web

Pages 101–110

ABSTRACT

Traditional relation extraction methods require pre-specified relations and relation-specific human-tagged examples. Bootstrapping systems significantly reduce the number of training examples, but they usually apply heuristic-based methods to combine a set of strict hard rules, which limit the ability to generalize and thus generate a low recall. Furthermore, existing bootstrapping methods do not perform open information extraction (Open IE), which can identify various types of relations without requiring pre-specifications. In this paper, we propose a statistical extraction framework called Statistical Snowball (StatSnowball), which is a bootstrapping system and can perform both traditional relation extraction and Open IE.

StatSnowball uses the discriminative Markov logic networks (MLNs) and softens hard rules by learning their weights in a maximum likelihood estimate sense. MLN is a general model, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection is performed by solving an l₁-norm penalized maximum likelihood estimation, which enjoys well-founded theories and efficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small but fully labeled data set and large-scale Web data. Empirical results show that StatSnowball can achieve a significantly higher recall without sacrificing the high precision during iterations with a small number of seeds, and the joint inference of MLN can improve the performance. Finally, StatSnowball is efficient and we have developed a working entity relation search engine called Renlifang based on it.

References

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In International Conference on Digital Libraries, 2000. Google ScholarDigital Library
G. Andrew and J. Gao. Scalable training of l<sub>1</sub>-regularized log-linear models. In ICML, 2007. Google ScholarDigital Library
M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007. Google ScholarDigital Library
M. Banko and O. Etzioni. The tradeoffs between open and traditional relation extraction. In ACL, 2008.Google Scholar
S. Brin. Extracting patterns and relations from the world wide web. In International Workshop on the Web and Databases, 1998. Google ScholarDigital Library
C. Cortes and V. Vapnik. Support-vector networks. Machine Learing, 20:273--297, 1995. Google ScholarDigital Library
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91--134, 2005.Google ScholarCross Ref
C. Giuliano, A. Lavelli, and L. Romano. Exploiting shallow linguistic information for relation extraction from biomedical literature. In EACL, 2006.Google Scholar
A. Harabagiu, C. A. Bejan, and P. Morcheckarescu. Shallow semantics for relation extraction. In IJCAI, 2005. Google ScholarDigital Library
T. N. Huynh and R. J. Mooney. Dsicriminative structure and parameter learning for markov logic networks. In ICML, 2008. Google ScholarDigital Library
A. Kaban. On Bayesian classification with laplace priors. Pattern Recognition Letters, 28(10):1271--1282, 2007. Google ScholarDigital Library
S. Kok and P. Domingos. Learning the structure of markov logic networks. In ICML, 2005. Google ScholarDigital Library
S. Kok and P. Domingos. Statistical predicate invention. In ICML, 2007. Google ScholarDigital Library
S. Kok and P. Domingos. Extracting semantic networks from text via relational clustering. In ECML, 2008.Google ScholarCross Ref
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. Google ScholarDigital Library
A. McCallum. Efficiently inducing features of conditional random fields. In UAI, 2003. Google ScholarDigital Library
A. McCallum and D. Jensen. A note on the unification of information extraction and data mining using conditional probability, relational models. In IJCAI-2003 Workshop on Learning Statistical Models from Relational Data, 2003.Google Scholar
Z. Nie, J.-R. Wen, and W.-Y. Ma. Object-level vertical search. In CIDR, 2007.Google Scholar
S. D. Pietra, V. D. Pietra, and J. Lafferty. Inducing features of random fields. IEEE Trans. on PAMI, 1997. Google ScholarDigital Library
H. Poon and P. Domingos. Joint inference in information extraction. In AAAI, 2007. Google ScholarDigital Library
M. Richardson and P. Domingos. Markov logic networks. Machine Learing, 62(1--2):107--136, 2006. Google ScholarDigital Library
Y. Shinyama and S. Sekine. Preemptive information extraction using unrestricted relation discovery. In HLT/NAACL, 2006. Google ScholarDigital Library
P. Singla and P. Domingos. Discriminative training of markov logic networks. In AAAI, 2005. Google ScholarDigital Library
C. H. Teo, Q. Le, A. Smola, and S. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In SIGKDD, 2007. Google ScholarDigital Library
R. Tibshirani. Regression shrinkage and selection via the LASSO. J. Royal. Statist. Soc., B(58):267--288, 1996.Google Scholar
D. Zelenko, C. AoneE, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, (3):1083--1106, 2003. Google ScholarDigital Library
G. Zhou, M. Zhang, D. H. Ji, and Q. Zhu. Tree kernel-based relation extraction with context-sensitive structured parse tree information. In EMNLP-CoNLL, 2005.Google Scholar
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In SIGKDD, 2006. Google ScholarDigital Library

StatSnowball: a statistical approach to extracting entity relationships
1. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Self-supervised relation extraction from the Web

Web extraction systems attempt to use the immense amount of unlabeled text in the Web in order to create large lists of entities and relations. Unlike traditional Information Extraction methods, the Web extraction systems do not label every mention of ...
Read More
Learning labeling functions in distantly supervised relation extraction

Distant supervision has become the leading method for training large-scale information extractors. It could be encoded in the form of labeling functions, which employ knowledge bases to provide labels for the data. However, most previous works use only ...
Read More
Probabilistic models for biological sequences: selection and Maximum Likelihood estimation

Probabilistic models for biological sequences (DNA and proteins) are frequently used in bioinformatics. We describe statistical tests designed to detect the order of dependency among elements of the sequence and to select the most appropriate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '09: Proceedings of the 18th international conference on World wide web
April 2009
1280 pages
ISBN:9781605584874
DOI:10.1145/1526709
General Chairs:
Juan Quemada
DIT-UPM
,
Gonzalo León
DIT-UPM
,
Program Chairs:
Yoelle Maarek
Google Inc., Israel
,
Wolfgang Nejdl
L3S and Hannover University
Copyright © 2009 IW3C2 org
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 April 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Markov logic networks
relationship extraction
statistical models
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 160
  Total Citations
  View Citations
- 1,511
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

StatSnowball: a statistical approach to extracting entity relationships

WWW '09: Proceedings of the 18th international conference on World wide web

ABSTRACT

References

Cited By

Recommendations

Self-supervised relation extraction from the Web

Learning labeling functions in distantly supervised relation extraction

Probabilistic models for biological sequences: selection and Maximum Likelihood estimation