short-paper

Filling the Gaps: Improving Wikipedia Stubs

Authors:
Siddhartha Banerjee

The Pennsylvania State University, State College, PA, USA

The Pennsylvania State University, State College, PA, USA
View Profile

,
Prasenjit Mitra

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

DocEng '15: Proceedings of the 2015 ACM Symposium on Document EngineeringSeptember 2015Pages 117–120https://doi.org/10.1145/2682571.2797073

Published:08 September 2015Publication History

DocEng '15: Proceedings of the 2015 ACM Symposium on Document Engineering

Pages 117–120

ABSTRACT

The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers - Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (~6% F-score). Our generation approach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.

References

S. Banerjee, C. Caragea, and P. Mitra. Playscript classification and automatic wikipedia play articles generation. In Proceedings of the 22nd International Conference on Pattern Recognition (ICPR), pages 3630--3635. IEEE, 2014. Google ScholarDigital Library
S. Banerjee and P. Mitra. Wikikreator: Improving wikipedia stubs automatically. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, 2015.Google Scholar
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993--1022, 2003. Google ScholarDigital Library
Y.-l. Boureau, Y. L. Cun, et al. Sparse feature learning for deep belief networks. In Advances in neural information processing systems, pages 1185--1192, 2008.Google ScholarDigital Library
J. Clarke and M. Lapata. Global inference for sentence compression: An integer linear programming approach. J. Artif. Intell. Res.(JAIR), 31:399--429, 2008. Google ScholarDigital Library
G. Erkan and D. R. Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res.(JAIR), 22(1):457--479, 2004. Google ScholarDigital Library
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining, pages 441--450. ACM, 2010. Google ScholarDigital Library
Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188--1196, 2014.Google ScholarDigital Library
P. Li, Y. Wang, and J. Jiang. Automatically building templates for entity summary construction. Information Processing & Management, 49(1):330--340, 2013. Google ScholarDigital Library
C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74--81, 2004.Google Scholar
A. Nenkova, S. Maskey, and Y. Liu. Automatic summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts of ACL 2011, page 3. Association for Computational Linguistics, 2011. Google ScholarDigital Library
C. Sauper and R. Barzilay. Automatically generating wikipedia articles: A structure-aware approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 208--216. Association for Computational Linguistics, 2009. Google ScholarDigital Library
I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254--255. ACM, 1999. Google ScholarDigital Library
C. Yao, X. Jia, S. Shou, S. Feng, F. Zhou, and H. Liu. Autopedia: automatic domain-independent wikipedia article generation. In Proceedings of the 20th international conference companion on World wide web, pages 161--162. ACM, 2011. Google ScholarDigital Library

Index Terms

Filling the Gaps: Improving Wikipedia Stubs
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation

Recommendations

Sentiment diversification for short review summarization
WI '17: Proceedings of the International Conference on Web Intelligence

With the abundance of reviews published on the Web about a given product, consumers are looking for ways to view major opinions that can be presented in a quick and succinct way. Reviews contain many different opinions, making the ability to show a ...
Read More
Graph-based abstractive biomedical text summarization
Graphical abstract

Display Omitted
Highlights
- A graph generation and frequent itemset mining approach have been used for the generation of extractive summaries.
Abstract
Summarization is the process of compressing a text to obtain its important informative parts. In recent years, various methods have been presented to extract important parts of textual documents to present them in a summarized form. ...
Read More
SumCR: A new subtopic-based extractive approach for text summarization

In text summarization, relevance and coverage are two main criteria that decide the quality of a summary. In this paper, we propose a new multi-document summarization approach SumCR via sentence extraction. A novel feature called Exemplar is introduced ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '15: Proceedings of the 2015 ACM Symposium on Document Engineering
September 2015
248 pages
ISBN:9781450333078
DOI:10.1145/2682571
General Chair:
Christine Vanoirbeek
EPFL, Switzerland
,
Program Chair:
Pierre Genevès
CNRS, France
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 September 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
text summarization
topic modeling
wikipedia generation
Qualifiers
- short-paper
Conference

Acceptance Rates
DocEng '15 Paper Acceptance Rate11of31submissions,35%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 159
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Filling the Gaps: Improving Wikipedia Stubs

DocEng '15: Proceedings of the 2015 ACM Symposium on Document Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Sentiment diversification for short review summarization

Graph-based abstractive biomedical text summarization

SumCR: A new subtopic-based extractive approach for text summarization