Abstract
Extracting information from Web pages requires the ability to work at Web scale in terms of the number of documents, the number of domains and domain complexity. Recent approaches have used existing knowledge bases to learn to extract information with promising results. In this paper we propose the use of distant supervision for relation extraction from the Web. Distant supervision is a method which uses background information from the Linking Open Data cloud to automatically label sentences with relations to create training data for relation classifiers. Although the method is promising, existing approaches are still not suitable for Web extraction as they suffer from three main issues: data sparsity, noise and lexical ambiguity. Our approach reduces the impact of data sparsity by making entity recognition tools more robust across domains, as well as extracting relations across sentence boundaries. We reduce the noise caused by lexical ambiguity by employing statistical methods to strategically select training data. Our experiments show that using a more robust entity recognition approach and expanding the scope of relation extraction results in about 8 times the number of extractions, and that strategically selecting training data can result in an error reduction of about 30%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alfonseca, E., Filippova, K., Delort, J.Y., Garrido, G.: Pattern Learning for Relation Extraction with a Hierarchical Topic Model. In: Proceedings of ACL (2012)
Augenstein, I.: Joint information extraction from the web using linked data. In: Janowicz, K., et al. (eds.) ISWC 2014, Part II. LNCS, vol. 8797, pp. 505–512. Springer, Heidelberg (2014)
Augenstein, I.: Seed Selection for Distantly Supervised Web-Based Relation Extraction. In: Proceedings of the COLING Workshop on Semantic Web and Information Extraction (2014)
Augenstein, I., Padó, S., Rudolph, S.: LODifier: Generating Linked Data from Unstructured Text. In: Proceedings of ESWC, pp. 210–224 (2012)
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: A Collaboratively Created Graph Database For Structuring Human Knowledge. In: Proceedings of ACM SIGMOD, pp. 1247–1250 (2008)
Bunescu, R.C., Mooney, R.J.: Learning to Extract Relations from the Web using Minimal Supervision. In: Proceedings of ACL (2007)
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E.R., Mitchell, T.M.: Toward an Architecture for Never-Ending Language Learning. In: Proceedings of AAAI (2010)
Craven, M., Kumlien, J.: Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In: Proceedings of ISMB (1999)
Del Corro, L., Gemulla, R.: ClausIE: Clause-Based Open Information Extraction. In: Proceedings of WWW, pp. 355–366 (2013)
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale Information Extraction in KnowItAll. In: Proceedings of WWW, pp. 100–110 (2004)
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of EMNLP, pp. 1535–1545 (2011)
Finkel, J.R., Grenager, T., Manning, C.D.: Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In: Proceedings of ACL (2005)
Gerber, D., Ngomo, A.C.N., Gerber, D., Ngomo, A.C.N., Unger, C., Bühmann, L., Lehmann, J., Ngomo, A.C.N., Gerber, D., Cimiano, P.: Extracting Multilingual Natural-Language Patterns for RDF Predicates. In: Proceedings of EKAW, pp. 87–96 (2012)
Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L.S., Weld, D.S.: Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. In: Proceedings of ACL, pp. 541–550 (2011)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)
Mausam, S.M., Soderland, S., Bart, R., Etzioni, O.: Open Language Learning for Information Extraction. In: Proceedings of EMNLP-CoNLL, pp. 523–534 (2012)
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant Supervision for Relation Extraction with an Incomplete Knowledge Base. In: Proceedings of HLT-NAACL, pp. 777–782 (2013)
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of ACL, vol. 2, pp. 1003–1011 (2009)
Nakashole, U., Theobald, M., Weikum, G.: Scalable Knowledge Harvesting with High Precision and High Recall. In: Proceedings of WSDM, pp. 227–236 (2011)
Nguyen, T.V.T., Moschitti, A.: End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories. In: Proceedings of ACL (Short Papers), pp. 277–282 (2011)
Presutti, V., Draicchio, F., Gangemi, A.: Knowledge Extraction Based on Discourse Representation Theory and Linguistic Frames. In: Proceedings of EKAW, pp. 114–129 (2012)
Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without labeled text. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010, Part III. LNCS, vol. 6323, pp. 148–163. Springer, Heidelberg (2010)
Riedel, S., Yao, L., McCallum, A., Marlin, B.M.: Relation Extraction with Matrix Factorization and Universal Schemas. In: Proceedings of HLT-NAACL, pp. 74–84 (2013)
Roller, R., Stevenson, M.: Self-supervised relation extraction using UMLS. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 116–127. Springer, Heidelberg (2014)
Roth, B., Klakow, D.: Combining Generative and Discriminative Model Scores for Distant Supervision. In: Proceedings of ACL-EMNLP, pp. 24–29 (2013)
Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: A Large Ontology from Wikipedia and WordNet. Web Semantics: Science, Services and Agents on the World Wide Web 6(3), 203–217 (2008)
Surdeanu, M., Tibshirani, J., Nallapati, R., Manning, C.D.: Multi-instance Multi-label Learning for Relation Extraction. In: Proceedings of EMNLP-CoNLL, pp. 455–465 (2012)
Takamatsu, S., Sato, I., Nakagawa, H.: Reducing Wrong Labels in Distant Supervision for Relation Extraction. In: Proceedings of ACL, pp. 721–729 (2012)
Unger, C., Bühmann, L., Lehmann, J., Ngonga Ngomo, A.C., Gerber, D., Cimiano, P.: Template-Based Question Answering over RDF Data. In: Proceedings of WWW, pp. 639–648 (2012)
Vlachos, A., Clark, S.: Application-Driven Relation Extraction with Limited Distant Supervision. In: Proceedings of the COLING Workshop on Information Discovery in Text (2014)
Vrandečić, D., Krötzsch, M.: Wikidata: A Free Collaborative Knowledge Base. Communications of the ACM (2014)
Wu, F., Weld, D.S.: Autonomously Semantifying Wikipedia. In: Proceedings of the CIKM, pp. 41–50 (2007)
Wu, F., Weld, D.S.: Open Information Extraction Using Wikipedia. In: Proceedings of ACL, pp. 118–127 (2010)
Xu, W., Hoffmann, R., Zhao, L., Grishman, R.: Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction. In: Proceedings of ACL, pp. 665–670 (2013)
Yao, L., Riedel, S., McCallum, A.: Collective Cross-document Relation Extraction Without Labelled Data. In: Proceedings of EMNLP, pp. 1013–1023 (2010)
Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., Soderland, S.: TextRunner: Open Information Extraction on the Web. In: Proceedings of HLT-NAACL: Demonstrations, pp. 25–26 (2007)
Zhu, J., Nie, Z., Liu, X., Zhang, B., Wen, J.R.: StatSnowball: a Statistical Approach to Extracting Entity Relationships. In: Proceedings of WWW, pp. 101–110 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Augenstein, I., Maynard, D., Ciravegna, F. (2014). Relation Extraction from the Web Using Distant Supervision. In: Janowicz, K., Schlobach, S., Lambrix, P., Hyvönen, E. (eds) Knowledge Engineering and Knowledge Management. EKAW 2014. Lecture Notes in Computer Science(), vol 8876. Springer, Cham. https://doi.org/10.1007/978-3-319-13704-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-13704-9_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13703-2
Online ISBN: 978-3-319-13704-9
eBook Packages: Computer ScienceComputer Science (R0)