skip to main content
research-article

Automatic Assessment of Document Quality in Web Collaborative Digital Libraries

Published: 01 December 2011 Publication History

Abstract

The old dream of a universal repository containing all of human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and open edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its quality. In this work, we explore a significant number of quality indicators and study their capability to assess the quality of articles from three Web collaborative digital libraries. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment. Through experiments, we show that the most important quality indicators are those which are also the easiest to extract, namely, the textual features related to the structure of the article. Moreover, to the best of our knowledge, this work is the first that shows an empirical comparison between Web collaborative digital libraries regarding the task of assessing article quality.

References

[1]
Adler, T. B. and de Alfaro, L. 2007. A content-driven reputation system for the Wikipedia. In Proceedings of the 16th International Conference on the World Wide Web (WWW’07). 261--270.
[2]
Agichtein, E., Castillo, C., Donato, D., Gionis, A., and Mishne, G. 2008. Finding high-quality content in social media. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). ACM, New York, NY, 183--194.
[3]
Alexander, J. E. and Tate, M. A. 1999. Web Wisdom; How to Evaluate and Create Information Quality on the Web. L. Erlbaum Associates Inc., Hillsdale, NJ.
[4]
Argamon, S., Koppel, M., Fine, J., and Shimoni, A. R. 2003. Gender, genre, and writing style in formal written texts. TEXT 23, 321--346.
[5]
Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., and Gonçalves, M. 2009. Detecting spammers and content promoters in online video social networks. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 620--627.
[6]
Bethard, S., Wetzer, P., Butcher, K., Martin, J. H., and Sumner, T. 2009. Automatically characterizing resource quality for educational digital libraries. In Proceedings of the Joint International Conference on Digital Libraries (JCDL’09). ACM, 221--230.
[7]
Bigonha, C., Cardoso, T. N., Moro, M. M., Almeida, V., and Gonçalves, M. A. 2010. Detecting evangelists and detractors on twitter. In Simpósio Brasileiro de Sistemas Multimídia e Web - Webmedia 2010, Belo Horizonte, Minas Gerais, Brazil, 107--114.
[8]
Björnsson, C. 1968. Lesbarkeit durch Lix. Stockholm: Pedagogiskt Centrum.
[9]
Boldi, P. and Vigna, S. 2004. The webgraph framework I: Compression techniques. In Proceedings of the 13th International Conference on the World Wide Web (WWW’04). ACM, New York, NY, 595--601.
[10]
Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the 6th Workshop on Very Large Corpora.
[11]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 1-7, 107--117.
[12]
Brown, R. 2009. Does fundamentalist religion cause the rejection of evolution? or is it the other way around? http://karmatics.com/docs/evolution-and-wisdom-of-crowds.html.
[13]
Chang, C. C. and Lin, C. J. 2001. LIBSVM: A library for support vector machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[14]
Chevalier, F., Huot, S., and Fekete, J.-D. 2010. Wikipediaviz: Conveying article quality for casual wikipedia readers. In Proceedings of the Pacific Visualization Symposium (PacificVis). IEEE, 49--56.
[15]
Chin, S.-C., Street, W. N., Srinivasan, P., and Eichmann, D. 2010. Detecting wikipedia vandalism with active learning and statistical language models. In Proceedings of the 4th Workshop on Information Credibility (WICOW’10). ACM, New York, NY, 3--10.
[16]
Chu, W., Keerthi, S. S., and Ong, C. J. 2001. A unified loss function in bayesian framework for support vector regression. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 51--58.
[17]
Chu, Y. 1997. Trust management for the World Wide Web. M.S. thesis, MIT, Cambridge, MA.
[18]
Coleman, M. and Liau, T. L. 1975. A computer readability formula designed for machine scoring. J. Appl. Psych. 60, 2, 283--284.
[19]
Cusinato, A., Della Mea, V., Di Salvatore, F., and Mizzaro, S. 2009. QuWi: Quality control in Wikipedia. In Proceedings of the 3rd Workshop on Information Credibility on the Web (WICOW’09). ACM, 27--34.
[20]
Dalip, D. H., Gonçalves, M. A., Cristo, M., and Calado, P. 2009. Automatic quality assessment of content created collaboratively by web communities: A case study of Wikipedia. In Proceedings of the Joint International Conference on Digital libraries (JCDL’09). 295--304.
[21]
De la Calzada, G. and Dekhtyar, A. 2010. On measuring the quality of Wikipedia articles. In Proceedings of the 4th Workshop on Information Credibility (WICOW’10). ACM, New York, NY, 11--18.
[22]
Dondio, P., Barrett, S., Weber, S., and Seigneur, J. 2006. Extracting trust from domain analysis: A case study on the Wikipedia project. In Autonomic and Trusted Computing, Springer, Berlin, 362--373.
[23]
Dorogovtsev, S. N. and Mendes, J. F. F. 2003. Evolution of Networks: From Biological Nets to the Internet and WWW (Physics). Oxford University Press.
[24]
Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. J., and Vapnik, V. 1996. Support vector regression machines. In Advances in Neural Information Processing Systems (NIPS). M. Mozer, M. I. Jordan, and T. Petsche Eds., MIT Press, 155--161.
[25]
Flesch, R. 1948. A new readability yardstick. J. Appl. Psych., 221--235.
[26]
Fogg, B. J., Marshall, J., Laraki, O., Osipovich, A., Varma, C., Fang, N., Paul, J., Rangnekar, A., Shon, J., Swani, P., and Treinen, M. 2001. What makes web sites credible?: A report on a large quantitative study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’01). 61--68.
[27]
Fogg, B. J., Soohoo, C., Danielson, D. R., Marable, L., Stanford, J., and Tauber, E. R. 2003. How do users evaluate the credibility of web sites?: A study with over 2,500 participants. In Proceedings of the Conference on Designing for User Experiences (DUX’03). 1--15.
[28]
Giles, J. 2005. Internet encyclopaedias go head to head. Nature 438, 7070, 901--902.
[29]
Gunning, R. 1952. The Technique of Clear Writing. McGraw-Hill International Book Co.
[30]
Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182.
[31]
Hsu, C.-W., Chang, C.-C., and Lin, C.-J. 2000. A practical guide to support vector classification. Bioinformatics 1, 1.
[32]
Hu, M., Lim, E.-P., Sun, A., Lauw, H. W., and Vuong, B.-Q. 2007. Measuring article quality in Wikipedia: Models and evaluation. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM’07). 243--252.
[33]
Järvelin, K. and Kekäläinen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00). 41--48.
[34]
Kittur, A. and Kraut, R. E. 2008. Harnessing the wisdom of crowds in Wikipedia: Quality through coordination. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW’08). ACM, New York, NY, 37--46.
[35]
Korfiatis, N., Poulos, M., and Bokos, G. 2006. Evaluating authoritative sources using social networks: An insight from wikipedia. Online Inf. Rev. 30, 3, 252--262.
[36]
Krowne, A. 2003. Building a digital library the commons-based peer production way. D-Lib Mag. 9, 1082.
[37]
Maged, K. B., Maramba, I., and Wheeler, S. 2006. Wikis, blogs and podcasts: A new generation of web-based tools for virtual collaborative clinical practice and education. BMC Medical Educ. 6, 41+.
[38]
McLaughlin, G. H. 1969. Smog grading: A new readability formula. J. Read., 639--646.
[39]
Mingus, B. 2008. personal communication.
[40]
Mitchell, T. M. 1997. Machine Learning. McGraw-Hill Higher Education.
[41]
Muppet. 2010. Statistics-Muppet wiki. http://muppet.wikia.com/wiki/Special:Statistics.
[42]
P. Dondio, S. B. and Weber, S. 2006. Calculating the trustworthiness of a Wikipedia article using dante methodology. In Proceedings of the IADIS International Conference on e-Society.
[43]
Pirolli, P., Wollny, E., and Suh, B. 2009. So you know you’re getting the best possible information: A tool that increases Wikipedia credibility. In Proceedings of the 27th International Conference on Human Factors in Computing Systems (CHI’09). ACM, New York, NY, 1505--1508.
[44]
Rassbach, L., Pincock, T., and Mingus, B. 2007. Exploring the feasibility of automatically rating online article quality. http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf.
[45]
Ressler, S. 1993. Perspectives on Electronic Publishing: Standards, Solutions, and More. Prentice-Hall, Inc., Upper Saddle River, NJ.
[46]
Rubio, R., Martín, S., and Morán, S. 2010. Collaborative web learning tools: Wikis and blogs. Comput. Appl. Engin. Educ. 18, 502--511.
[47]
Smith, E. A. and Senter, R. J. 1967. Automated readability index. Aerospace Medical Division.
[48]
Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag, Berlin, Germany.
[49]
Veltman, K. H. 2005. Access, claims and quality on the internet-future challenges. Progress Inform. PI 2, 17--40.
[50]
Weiss, G. M. and Provost, F. 2003. Learning when training data are costly: The effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315--354.
[51]
Wikipedia. 2010a. http://en.wikipedia.org/wiki/Wikipedia.
[52]
Wikipedia. 2010b. Featured article candidates. http://en.wikipedia.org/wiki/Wikipedia:Featured_article_candidates.
[53]
Wikipedia. 2010c. Good article nominations. http://en.wikipedia.org/wiki/Wikipedia:Good_article_nominations.
[54]
Wikipedia. 2010d. Version 1.0 editorial team/assessment. http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Assessment.
[55]
Wikipedia. 2010e. Version 1.0 editorial team/release version criteria. http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Release_Version_Criteria.
[56]
Wikipedia. 2010f. Wikipedia:database download-wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia_database.
[57]
Wikipedia. 2010g. Wikipedia:wikiproject. http://en.wikipedia.org/wiki/Wikipedia:WikiProject.
[58]
Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics, 80--83.
[59]
Wilkinson, D. M. and Huberman, B. A. 2007. Cooperation and quality in wikipedia. In Proceedings of the International Symposium on Wikis (WikiSym’07). ACM, New York, NY, 157--164.
[60]
Wöhner, T. and Peters, R. 2009. Assessing the quality of Wikipedia articles with lifecycle based metrics. In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (WikiSym’09). ACM, New York, NY, 16:1--16:10.
[61]
Wookieepedia. 2010a. Statistics-wookieepedia, the star wars wiki. http://starwars.wikia.com/wiki/Special:Statistics.
[62]
Wookieepedia. 2010b. Wookieepedia: Featured article nominations. http://starwars.wikia.com/wiki/Wookieepedia:Featured_article_nominations.
[63]
Wookieepedia. 2010c. Wookieepedia: Good article nominations. http://starwars.wikia.com/wiki/Wookieepedia:Good_article_nominations.
[64]
Zeng, H., Alhossaini, M., Ding, L., Fikes, R., and Mcguinness, D. L. 2006a. Computing trust from revision history. In Proceedings of the International Conference on Privacy, Security and Trust.
[65]
Zheng, R., Li, J., Chen, H., and Huang, Z. 2006b. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Amer. Soc. Inf. Sci. Technol. 57, 378--393.

Cited By

View all
  • (2024)Timely Quality Problem Resolution in Peer-Production Systems: The Impact of Bots, Policy Citations, and Contributor ExperienceInformation Systems Research10.1287/isre.2020.0485Online publication date: 9-Sep-2024
  • (2023)Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature ReviewACM Computing Surveys10.1145/362528656:4(1-37)Online publication date: 10-Nov-2023
  • (2023)CLEFT: Contextualised Unified Learning of User Engagement in Video Lectures With FeedbackIEEE Access10.1109/ACCESS.2023.324598211(17707-17720)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 2, Issue 3
December 2011
54 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/2063504
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2011
Accepted: 01 July 2011
Revised: 01 April 2011
Received: 01 December 2010
Published in JDIQ Volume 2, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Quality assessment
  2. SVM
  3. machine learning
  4. quality features
  5. wiki

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Timely Quality Problem Resolution in Peer-Production Systems: The Impact of Bots, Policy Citations, and Contributor ExperienceInformation Systems Research10.1287/isre.2020.0485Online publication date: 9-Sep-2024
  • (2023)Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature ReviewACM Computing Surveys10.1145/362528656:4(1-37)Online publication date: 10-Nov-2023
  • (2023)CLEFT: Contextualised Unified Learning of User Engagement in Video Lectures With FeedbackIEEE Access10.1109/ACCESS.2023.324598211(17707-17720)Online publication date: 2023
  • (2022)Quality assessment of cross-topic article features based on improved CTS model2022 6th International Symposium on Computer Science and Intelligent Control (ISCSIC)10.1109/ISCSIC57216.2022.00031(101-106)Online publication date: Nov-2022
  • (2021)Quality assessment of web-based information on type 2 diabetesOnline Information Review10.1108/OIR-02-2021-008946:4(715-732)Online publication date: 12-Oct-2021
  • (2021)Measuring Quality of Wikipedia Articles by Feature Fusion‐based Stack LearningProceedings of the Association for Information Science and Technology10.1002/pra2.44958:1(206-217)Online publication date: 13-Oct-2021
  • (2019)Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different TopicsComputers10.3390/computers80300608:3(60)Online publication date: 14-Aug-2019
  • (2019)Quality assessment of Wikipedia content using topic modelsProceedings of the 25th Brazillian Symposium on Multimedia and the Web10.1145/3323503.3360628(249-252)Online publication date: 29-Oct-2019
  • (2019)Financial Regulatory and Risk Management Challenges Stemming from Firm-Specific Digital MisinformationJournal of Data and Information Quality10.1145/327465511:1(1-4)Online publication date: 4-Jan-2019
  • (2019)Quality assessment of answers with user-identified criteria and data-driven features in social Q&AInformation Processing & Management10.1016/j.ipm.2018.08.00756:1(14-28)Online publication date: Jan-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media