skip to main content
10.1145/1835804.1835926acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections

BioSnowball: automated population of Wikis

Published: 25 July 2010 Publication History


Internet users regularly have the need to find biographies and facts of people of interest. Wikipedia has become the first stop for celebrity biographies and facts. However, Wikipedia can only provide information for celebrities because of its neutral point of view (NPOV) editorial policy. In this paper we propose an integrated bootstrapping framework named BioSnowball to automatically summarize the Web to generate Wikipedia-style pages for any person with a modest web presence. In BioSnowball, biography ranking and fact extraction are performed together in a single integrated training and inference process using Markov Logic Networks (MLNs) as its underlying statistical model. The bootstrapping framework starts with only a small number of seeds and iteratively finds new facts and biographies. As biography paragraphs on the Web are composed of the most important facts, our joint summarization model can improve the accuracy of both fact extraction and biography ranking compared to decoupled methods in the literature. Empirical results on both a small labeled data set and a real Web-scale data set show the effectiveness of BioSnowball. We also empirically show that BioSnowball outperforms the decoupled methods.

Supplementary Material

JPG File (kdd2010_nie_bsap_01.jpg)
MOV File (


E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In International Conference on Digital Libraries, 2000.
G. Andrew and J. Gao. Scalable training of l1-regularized log-linear models. In ICML, 2007.
M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007.
R. Barzilay and K. McKeown. Sentence fusion for multidocument news summarization. Computational Linguistics, 2005.
S. Brin. Extracting patterns and relations from the world wide web. In International Workshop on the Web and Databases, 1998.
S. L. Bryant, A. Forte, and A. Bruckman. Becoming wikipedian: Transformation of participation in a collaborative online encyclopedia. In GROUP, 2005.
J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In SIGIR, 1998.
Y. Chen, S. Y. M. Lee, and C.-R. Huang. Polyuhk: A robust information extraction system for web personal names. In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.
M. Collins and Y. Singer. Multi-document summarization by sentence extraction. In NAACL-ANLP, 2000.
H. Cui, M.-Y. Kan, and T.-S. Chua. Soft pattern matching models for definitional question answering. ACM Trans. Inf. Syst., 25(2), 2007.
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence, 165(1):91--134, 2005.
N. Garera and D. Yarowsky. Structural, Transitive and Latent Models for Biographic Fact Extraction. In EACL, 2009.
S. Harabagiu, C. A. Bejan, and P. Morarescu. Shallow semantics for relation extraction. In IJCAI, 2005.
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
H. Poon and P. Domingos. Joint unsupervised coreference resolution with markov logic. In EMNLP, 2008.
P. Singla and P. Domingos. Discriminative training of markov logic networks. In AAAI, 2005.
P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006.
F. M. Suchanek, M. Sozio, and G. Weikum. SOFIE: A self-organizing framework for information extraction. In WWW, 2009.
M. White and T. Korelsky. Multidocument summarization via information extraction. In HLT, 2001.
F. Wu and D. S. Weld. Autonomously semantifying wikipedia. In CIKM, 2007.
F. Wu and D. S. Weld. Automatically refining the wikipedia infobox ontology. In WWW, 2008.
W.-T. Yih, J. Goodman, L. Vanderwende, and H. Suzuki. Multi-document summarization by maximizing informative content-words. In IJCAI, 2007.
L. Zhou, M. Ticrea, and E. Hovy. Multi-document biography summarization. In EMNLP, 2004.
J. Zhu, Z. Nie, and X. Liu. Statsnowball: a statistical approach to extracting entity relationships. In WWW, 2009.
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in Web data extraction. In SIGKDD, 2006.

Cited By

View all
  • (2023)A Personalized Reinforcement Learning Summarization Service for Learning Structure from Unstructured Data2023 IEEE International Conference on Web Services (ICWS)10.1109/ICWS60048.2023.00040(206-213)Online publication date: Jul-2023
  • (2014)Entity profiling with varying source reliabilitiesProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2623330.2623685(1146-1155)Online publication date: 24-Aug-2014
  • (2014)Entity-centric summarizationProceedings of the 23rd International Conference on World Wide Web10.1145/2567948.2567959(33-38)Online publication date: 7-Apr-2014
  • Show More Cited By

Index Terms

  1. BioSnowball: automated population of Wikis



    Information & Contributors


    Published In

    cover image ACM Conferences
    KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
    July 2010
    1240 pages
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]



    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 July 2010


    Request permissions for this article.

    Check for updates

    Author Tags

    1. bootstrapping
    2. fact extraction
    3. markov logic networks
    4. summarization


    • Research-article


    KDD '10

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 15 Feb 2025

    Other Metrics


    Cited By

    View all
    • (2023)A Personalized Reinforcement Learning Summarization Service for Learning Structure from Unstructured Data2023 IEEE International Conference on Web Services (ICWS)10.1109/ICWS60048.2023.00040(206-213)Online publication date: Jul-2023
    • (2014)Entity profiling with varying source reliabilitiesProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2623330.2623685(1146-1155)Online publication date: 24-Aug-2014
    • (2014)Entity-centric summarizationProceedings of the 23rd International Conference on World Wide Web10.1145/2567948.2567959(33-38)Online publication date: 7-Apr-2014
    • (2013)Generating text summaries of graph snippetsProceedings of the 19th International Conference on Management of Data10.5555/2694476.2694504(121-124)Online publication date: 19-Dec-2013
    • (2013)Co-Occurrence-Based Diffusion for Expert Search on the WebIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.4925:5(1001-1014)Online publication date: 1-May-2013
    • (2013)Exploring the effectiveness of linguistic knowledge for biographical relation extractionNatural Language Engineering10.1017/S135132491300031421:4(519-551)Online publication date: 18-Oct-2013
    • (2012)Statistical Entity Extraction From the WebProceedings of the IEEE10.1109/JPROC.2012.2191369100:9(2675-2687)Online publication date: Sep-2012
    • (2011)Extraction and geographical navigation of important historical events in the webProceedings of the 10th international conference on Web and wireless geographical information systems10.5555/1966271.1966277(21-35)Online publication date: 3-Mar-2011
    • (2011)Evaluating significance of historical entities based on tempo-spatial impacts analysis using Wikipedia link structureProceedings of the 22nd ACM conference on Hypertext and hypermedia10.1145/1995966.1995980(83-92)Online publication date: 6-Jun-2011
    • (2011)Extraction and Geographical Navigation of Important Historical Events in the WebWeb and Wireless Geographical Information Systems10.1007/978-3-642-19173-2_4(21-35)Online publication date: 2011

    View Options

    Login options

    View options


    View or Download as a PDF file.



    View online with eReader.







    Share this Publication link

    Share on social media