ABSTRACT
Web Information Extraction (WIE) systems extract billions of unique facts, but integrating the assertions into a coherent knowledge base and evaluating across different WIE techniques remains a challenge. We propose a framework that utilizes natural language to integrate and evaluate extracted knowledge bases (KBs). In the framework, KBs are integrated by exchanging probability distributions over natural language, and evaluated by how well the output distributions predict held-out text. We describe the advantages of the approach, and detail remaining research challenges.
- Cynthia Matuszek Michael, Michael Witbrock, Robert C. Kahlert, John Cabral, Dave Schneider, Purvesh Shah, and Doug Lenat. Searching for common sense: Populating cyc from the web. In In Proceedings of the Twentieth National Conference on Artificial Intelligence, pages 1430--1435, 2005. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91--134, 2005. Google ScholarDigital Library
- Kenneth D Forbus, Christopher Riesbeck, Lawrence Birnbaum, Kevin Livingston, Abhishek Sharma, and Leo Ureel. Integrating natural language, knowledge representation and reasoning, and analogical processing to learn by reading. In PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, volume 22, page 1542. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2007. Google ScholarDigital Library
- M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the Web. In Procs. of IJCAI, 2007. Google ScholarDigital Library
- Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krüpl, and Bernhard Pollak. Towards domain-independent information extraction from web tables. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 71--80, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- Michael J. Cafarella, Alon Y. Halevy, Daisy Z. Wang, Eugene W. 0002, and Yang Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
- Fei Wu and Daniel S. Weld. Automatically refining the wikipedia infobox ontology. In Proc. of WWW, 2008. Google ScholarDigital Library
- F.M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of semantic knowledge. In Procs. of WWW, 2007. Google ScholarDigital Library
- Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. Toward an architecture for never-ending language learning. In Proceedings of the Twenty-Fourth Conference on Artificial Intelligence (AAAI 2010), 2010.Google ScholarDigital Library
- Sören Auer and Jens Lehmann. What have innsbruck and leipzig in common? extracting semantics from wiki content. In Proc. of ESWC, 2007. Google ScholarDigital Library
- James Fan, David Ferrucci, David Gondek, and Aditya Kalyanpur. Prismatic: Inducing knowledge from a large scale lexicalized relation resource. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pages 122--127. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, et al. Building watson: An overview of the deepqa project. AI magazine, 31(3):59--79, 2010.Google ScholarDigital Library
- M. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. In Procs. of the 14th International Conference on Computational Linguistics, pages 539--545, Nantes, France, 1992. Google ScholarDigital Library
- Doug Downey, Oren Etzioni, and Stephen Soderland. Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artificial Intelligence, 174(11):726 -- 748, 2010. Google ScholarDigital Library
- Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, and Alpa Jain. Organizing and searching the world wide web of facts - step one: The one-million fact extraction challenge. In AAAI 2006. AAAI Press, 2006. Google ScholarDigital Library
- Fei Wu, Raphael Hoffmann, and Daniel S. Weld. Information extraction from wikipedia: moving down the long tail. In Proc. of KDD, 2008. Google ScholarDigital Library
- Fei Wu and Daniel S. Weld. Autonomously semantifying wikipedia. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 41--50, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- Hector Gonzalez, Alon Y Halevy, Christian S Jensen, Anno Langen, Jayant Madhavan, Rebecca Shapley, Warren Shen, and Jonathan Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. In Proceedings of the 2010 international conference on Management of data, pages 1061--1066. ACM, 2010. Google ScholarDigital Library
- Push Singh, Thomas Lin, Erik T Mueller, Grace Lim, Travell Perkins, and Wan Li Zhu. Open mind common sense: Knowledge acquisition from the general public. In On the Move to Meaningful Internet Systems 2002: CoopIS, DOA, and ODBASE, pages 1223--1237. Springer, 2002. Google ScholarDigital Library
- L.K. Schubert and M.H. Tong. Extracting and evaluating general world knowledge from the brown corpus. In Proc. of the HLT/NAACL Workshop on Text Meaning, 2003. Google ScholarDigital Library
- AnHai Doan and Alon Y. Halevy. Semantic-integration research in the database community. AI Mag., 26(1):83--94, 2005. Google ScholarDigital Library
- Christian Bizer, Tom Heath, Kingsley Idehen, and Tim Berners-Lee. Linked data on the web (ldow2008). In Proceedings of the 17th international conference on World Wide Web, pages 1265--1266. ACM, 2008. Google ScholarDigital Library
- O. Medelyan and C. Legg. Integrating cyc and wikipedia: Folksonomy meets rigorously defined common-sense. In Proc. of WIKIAI, 2008.Google Scholar
- D. Downey, A. Ahuja, and M. Anderson. Learning to integrate relational databases with wikipedia. In Proc. of WIKIAI, 2009.Google Scholar
- Thomas Lin, Oren Etzioni, et al. Entity linking at web scale. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 84--88. Association for Computational Linguistics, 2012. Google ScholarDigital Library
- Z. Harris. Distributional structure. In J. J. Katz, editor, The Philosophy of Linguistics, pages 26--47. New York: Oxford University Press, 1985.Google Scholar
- Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. Methods for Exploring and Mining Tables on Wikipedia. In Proceedings of the ACM SIGKDD Interactive Data Exploration and Analytics (IDEA). ACM, 2013. Google ScholarDigital Library
- Hoifung Poon, Janara Christensen, Pedro Domingos, Oren Etzioni, Raphael Hoffmann, Chloe Kiddon, Thomas Lin, Xiao Ling, Alan Ritter, Stefan Schoenmackers, et al. Machine reading at the university of washington. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pages 87--95. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- Jonathan Gordon and Benjamin Van Durme. Reporting bias and knowledge acquisition. In Automated Knowledge Base Construction (AKBC): The 3rd Workshop on Knowledge Extraction at CIKM, 2013. Google ScholarDigital Library
- Fei Huang, Arun Ahuja, Doug Downey, Yi Yang, Yuhong Guo, and Alexander Yates. Learning Representations for Weakly Supervised Natural Language Processing Tasks. Computational Linguistics, xx:yy, 2013.Google Scholar
- Noah A Smith. Adversarial evaluation for models of natural language. arXiv preprint arXiv:1207.0245, 2012.Google Scholar
- Ronan Collobert and Jason Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160--167. ACM, 2008. Google ScholarDigital Library
- Jeff Mitchell and Mirella Lapata. Composition in distributional models of semantics. Cognitive Science, 34(8):1388--1429, 2010.Google ScholarCross Ref
- Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1201--1211. Association for Computational Linguistics, 2012. Google ScholarDigital Library
- Jason Wolfe, Aria Haghighi, and Dan Klein. Fully distributed em for very large datasets. In ICML, 2008. Google ScholarDigital Library
- Yi Yang, Alexander Yates, and Doug Downey. Overcoming the memory bottleneck in distributed training of latent variable models of text. In Proceedings of NAACL-HLT, pages 579--584, 2013.Google Scholar
- Burr Settles. Active learning literature survey. University of Wisconsin, Madison, 2010.Google ScholarDigital Library
- Michael Lucas and Doug Downey. Scaling semi-supervised naive bayes with feature marginals. In Proceedings of ACL, 2013.Google Scholar
Index Terms
- Using natural language to integrate, evaluate, and optimize extracted knowledge bases
Recommendations
Representation, Analysis, and Extraction of Knowledge from Unstructured Natural Language Texts
AbstractThis article overviews means of description logics for representing knowledge contained in natural language texts and a classification of description logics by constructors of concepts and roles. It also considers basic conceptions of temporal ...
How to make knowledge resources valuable
PurposeThis paper aims to offer an integration point for newly acquired heterogeneous knowledge resources to be assessed if these resources qualify to be a part of a firm's existing knowledge resource portfolio. Focus of this paper will be on the ...
Deep knowledge integration of heterogeneous features for domain adaptive SAR target recognition
Highlights- Deep knowledge integration at the feature and the decision levels based on heterogeneous features.
AbstractHow to integrate various heterogeneous features for better recognition performance is increasingly critical for automatic target recognition. Existing integration methods present the following drawbacks: (1) most feature integration ...
Comments