research-article

Automatic Assessment of Document Quality in Web Collaborative Digital Libraries

Authors:

Daniel Hasan Dalip,

Marcos André Gonçalves,

Pável CaladoAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 2, Issue 3

Article No.: 14, Pages 1 - 30

https://doi.org/10.1145/2063504.2063507

Published: 01 December 2011 Publication History

Abstract

The old dream of a universal repository containing all of human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and open edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its quality. In this work, we explore a significant number of quality indicators and study their capability to assess the quality of articles from three Web collaborative digital libraries. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment. Through experiments, we show that the most important quality indicators are those which are also the easiest to extract, namely, the textual features related to the structure of the article. Moreover, to the best of our knowledge, this work is the first that shows an empirical comparison between Web collaborative digital libraries regarding the task of assessing article quality.

References

[1]

Adler, T. B. and de Alfaro, L. 2007. A content-driven reputation system for the Wikipedia. In Proceedings of the 16th International Conference on the World Wide Web (WWW’07). 261--270.

Digital Library

[2]

Agichtein, E., Castillo, C., Donato, D., Gionis, A., and Mishne, G. 2008. Finding high-quality content in social media. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). ACM, New York, NY, 183--194.

Digital Library

[3]

Alexander, J. E. and Tate, M. A. 1999. Web Wisdom; How to Evaluate and Create Information Quality on the Web. L. Erlbaum Associates Inc., Hillsdale, NJ.

Digital Library

[4]

Argamon, S., Koppel, M., Fine, J., and Shimoni, A. R. 2003. Gender, genre, and writing style in formal written texts. TEXT 23, 321--346.

[5]

Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., and Gonçalves, M. 2009. Detecting spammers and content promoters in online video social networks. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 620--627.

Digital Library

[6]

Bethard, S., Wetzer, P., Butcher, K., Martin, J. H., and Sumner, T. 2009. Automatically characterizing resource quality for educational digital libraries. In Proceedings of the Joint International Conference on Digital Libraries (JCDL’09). ACM, 221--230.

Digital Library

[7]

Bigonha, C., Cardoso, T. N., Moro, M. M., Almeida, V., and Gonçalves, M. A. 2010. Detecting evangelists and detractors on twitter. In Simpósio Brasileiro de Sistemas Multimídia e Web - Webmedia 2010, Belo Horizonte, Minas Gerais, Brazil, 107--114.

[8]

Björnsson, C. 1968. Lesbarkeit durch Lix. Stockholm: Pedagogiskt Centrum.

[9]

Boldi, P. and Vigna, S. 2004. The webgraph framework I: Compression techniques. In Proceedings of the 13th International Conference on the World Wide Web (WWW’04). ACM, New York, NY, 595--601.

Digital Library

[10]

Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the 6th Workshop on Very Large Corpora.

[11]

Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 1-7, 107--117.

Digital Library

[12]

Brown, R. 2009. Does fundamentalist religion cause the rejection of evolution? or is it the other way around? http://karmatics.com/docs/evolution-and-wisdom-of-crowds.html.

[13]

Chang, C. C. and Lin, C. J. 2001. LIBSVM: A library for support vector machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm.

[14]

Chevalier, F., Huot, S., and Fekete, J.-D. 2010. Wikipediaviz: Conveying article quality for casual wikipedia readers. In Proceedings of the Pacific Visualization Symposium (PacificVis). IEEE, 49--56.

[15]

Chin, S.-C., Street, W. N., Srinivasan, P., and Eichmann, D. 2010. Detecting wikipedia vandalism with active learning and statistical language models. In Proceedings of the 4th Workshop on Information Credibility (WICOW’10). ACM, New York, NY, 3--10.

Digital Library

[16]

Chu, W., Keerthi, S. S., and Ong, C. J. 2001. A unified loss function in bayesian framework for support vector regression. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 51--58.

Digital Library

[17]

Chu, Y. 1997. Trust management for the World Wide Web. M.S. thesis, MIT, Cambridge, MA.

[18]

Coleman, M. and Liau, T. L. 1975. A computer readability formula designed for machine scoring. J. Appl. Psych. 60, 2, 283--284.

[19]

Cusinato, A., Della Mea, V., Di Salvatore, F., and Mizzaro, S. 2009. QuWi: Quality control in Wikipedia. In Proceedings of the 3rd Workshop on Information Credibility on the Web (WICOW’09). ACM, 27--34.

Digital Library

[20]

Dalip, D. H., Gonçalves, M. A., Cristo, M., and Calado, P. 2009. Automatic quality assessment of content created collaboratively by web communities: A case study of Wikipedia. In Proceedings of the Joint International Conference on Digital libraries (JCDL’09). 295--304.

Digital Library

[21]

De la Calzada, G. and Dekhtyar, A. 2010. On measuring the quality of Wikipedia articles. In Proceedings of the 4th Workshop on Information Credibility (WICOW’10). ACM, New York, NY, 11--18.

Digital Library

[22]

Dondio, P., Barrett, S., Weber, S., and Seigneur, J. 2006. Extracting trust from domain analysis: A case study on the Wikipedia project. In Autonomic and Trusted Computing, Springer, Berlin, 362--373.

Digital Library

[23]

Dorogovtsev, S. N. and Mendes, J. F. F. 2003. Evolution of Networks: From Biological Nets to the Internet and WWW (Physics). Oxford University Press.

Digital Library

[24]

Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. J., and Vapnik, V. 1996. Support vector regression machines. In Advances in Neural Information Processing Systems (NIPS). M. Mozer, M. I. Jordan, and T. Petsche Eds., MIT Press, 155--161.

[25]

Flesch, R. 1948. A new readability yardstick. J. Appl. Psych., 221--235.

[26]

Fogg, B. J., Marshall, J., Laraki, O., Osipovich, A., Varma, C., Fang, N., Paul, J., Rangnekar, A., Shon, J., Swani, P., and Treinen, M. 2001. What makes web sites credible?: A report on a large quantitative study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’01). 61--68.

Digital Library

[27]

Fogg, B. J., Soohoo, C., Danielson, D. R., Marable, L., Stanford, J., and Tauber, E. R. 2003. How do users evaluate the credibility of web sites?: A study with over 2,500 participants. In Proceedings of the Conference on Designing for User Experiences (DUX’03). 1--15.

Digital Library

[28]

Giles, J. 2005. Internet encyclopaedias go head to head. Nature 438, 7070, 901--902.

[29]

Gunning, R. 1952. The Technique of Clear Writing. McGraw-Hill International Book Co.

[30]

Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182.

Digital Library

[31]

Hsu, C.-W., Chang, C.-C., and Lin, C.-J. 2000. A practical guide to support vector classification. Bioinformatics 1, 1.

[32]

Hu, M., Lim, E.-P., Sun, A., Lauw, H. W., and Vuong, B.-Q. 2007. Measuring article quality in Wikipedia: Models and evaluation. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM’07). 243--252.

Digital Library

[33]

Järvelin, K. and Kekäläinen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00). 41--48.

Digital Library

[34]

Kittur, A. and Kraut, R. E. 2008. Harnessing the wisdom of crowds in Wikipedia: Quality through coordination. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW’08). ACM, New York, NY, 37--46.

Digital Library

[35]

Korfiatis, N., Poulos, M., and Bokos, G. 2006. Evaluating authoritative sources using social networks: An insight from wikipedia. Online Inf. Rev. 30, 3, 252--262.

[36]

Krowne, A. 2003. Building a digital library the commons-based peer production way. D-Lib Mag. 9, 1082.

[37]

Maged, K. B., Maramba, I., and Wheeler, S. 2006. Wikis, blogs and podcasts: A new generation of web-based tools for virtual collaborative clinical practice and education. BMC Medical Educ. 6, 41+.

[38]

McLaughlin, G. H. 1969. Smog grading: A new readability formula. J. Read., 639--646.

[39]

Mingus, B. 2008. personal communication.

[40]

Mitchell, T. M. 1997. Machine Learning. McGraw-Hill Higher Education.

Digital Library

[41]

Muppet. 2010. Statistics-Muppet wiki. http://muppet.wikia.com/wiki/Special:Statistics.

[42]

P. Dondio, S. B. and Weber, S. 2006. Calculating the trustworthiness of a Wikipedia article using dante methodology. In Proceedings of the IADIS International Conference on e-Society.

[43]

Pirolli, P., Wollny, E., and Suh, B. 2009. So you know you’re getting the best possible information: A tool that increases Wikipedia credibility. In Proceedings of the 27th International Conference on Human Factors in Computing Systems (CHI’09). ACM, New York, NY, 1505--1508.

Digital Library

[44]

Rassbach, L., Pincock, T., and Mingus, B. 2007. Exploring the feasibility of automatically rating online article quality. http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf.

[45]

Ressler, S. 1993. Perspectives on Electronic Publishing: Standards, Solutions, and More. Prentice-Hall, Inc., Upper Saddle River, NJ.

Digital Library

[46]

Rubio, R., Martín, S., and Morán, S. 2010. Collaborative web learning tools: Wikis and blogs. Comput. Appl. Engin. Educ. 18, 502--511.

[47]

Smith, E. A. and Senter, R. J. 1967. Automated readability index. Aerospace Medical Division.

[48]

Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag, Berlin, Germany.

Digital Library

[49]

Veltman, K. H. 2005. Access, claims and quality on the internet-future challenges. Progress Inform. PI 2, 17--40.

[50]

Weiss, G. M. and Provost, F. 2003. Learning when training data are costly: The effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315--354.

[51]

Wikipedia. 2010a. http://en.wikipedia.org/wiki/Wikipedia.

[52]

Wikipedia. 2010b. Featured article candidates. http://en.wikipedia.org/wiki/Wikipedia:Featured_article_candidates.

[53]

Wikipedia. 2010c. Good article nominations. http://en.wikipedia.org/wiki/Wikipedia:Good_article_nominations.

[54]

Wikipedia. 2010d. Version 1.0 editorial team/assessment. http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Assessment.

[55]

Wikipedia. 2010e. Version 1.0 editorial team/release version criteria. http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Release_Version_Criteria.

[56]

Wikipedia. 2010f. Wikipedia:database download-wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia_database.

[57]

Wikipedia. 2010g. Wikipedia:wikiproject. http://en.wikipedia.org/wiki/Wikipedia:WikiProject.

[58]

Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics, 80--83.

[59]

Wilkinson, D. M. and Huberman, B. A. 2007. Cooperation and quality in wikipedia. In Proceedings of the International Symposium on Wikis (WikiSym’07). ACM, New York, NY, 157--164.

Digital Library

[60]

Wöhner, T. and Peters, R. 2009. Assessing the quality of Wikipedia articles with lifecycle based metrics. In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (WikiSym’09). ACM, New York, NY, 16:1--16:10.

Digital Library

[61]

Wookieepedia. 2010a. Statistics-wookieepedia, the star wars wiki. http://starwars.wikia.com/wiki/Special:Statistics.

[62]

Wookieepedia. 2010b. Wookieepedia: Featured article nominations. http://starwars.wikia.com/wiki/Wookieepedia:Featured_article_nominations.

[63]

Wookieepedia. 2010c. Wookieepedia: Good article nominations. http://starwars.wikia.com/wiki/Wookieepedia:Good_article_nominations.

[64]

Zeng, H., Alhossaini, M., Ding, L., Fikes, R., and Mcguinness, D. L. 2006a. Computing trust from revision history. In Proceedings of the International Conference on Privacy, Security and Trust.

Digital Library

[65]

Zheng, R., Li, J., Chen, H., and Huang, Z. 2006b. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Amer. Soc. Inf. Sci. Technol. 57, 378--393.

Digital Library

Cited By

Mindel VAaltonen ARai AMathiassen LJabr W(2024)Timely Quality Problem Resolution in Peer-Production Systems: The Impact of Bots, Policy Citations, and Contributor ExperienceInformation Systems Research10.1287/isre.2020.0485Online publication date: 9-Sep-2024
https://doi.org/10.1287/isre.2020.0485
Moás PLopes C(2023)Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature ReviewACM Computing Surveys10.1145/362528656:4(1-37)Online publication date: 10-Nov-2023
https://dl.acm.org/doi/10.1145/3625286
Roy SGaur VRaza HJameel S(2023)CLEFT: Contextualised Unified Learning of User Engagement in Video Lectures With FeedbackIEEE Access10.1109/ACCESS.2023.324598211(17707-17720)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3245982
Show More Cited By

Index Terms

Automatic Assessment of Document Quality in Web Collaborative Digital Libraries
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia
JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

The old dream of a universal repository containing all the human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative, participation of people. Wikipedia is a great ...
A practical approach to the assessment of quality in use of corporate web sites

A novel perspective on web site quality based on stakeholders.A comprehensive and structured model for quality in use of corporate web sites.A practical methodology for the assessment of expected quality in use.An evaluation exercise with seven ...
Capturing the Ineffable: Collecting, Analysing, and Automating Web Document Quality Assessments
Knowledge Engineering and Knowledge Management
Abstract
Automatic estimation of the quality of Web documents is a challenging task, especially because the definition of quality heavily depends on the individuals who define it, on the context where it applies, and on the nature of the tasks at hand. Our ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 2, Issue 3

December 2011

54 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/2063504

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2011

Accepted: 01 July 2011

Revised: 01 April 2011

Received: 01 December 2010

Published in JDIQ Volume 2, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Conselho Nacional de Desenvolvimento Científico e Tecnológico

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
895
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mindel VAaltonen ARai AMathiassen LJabr W(2024)Timely Quality Problem Resolution in Peer-Production Systems: The Impact of Bots, Policy Citations, and Contributor ExperienceInformation Systems Research10.1287/isre.2020.0485Online publication date: 9-Sep-2024
https://doi.org/10.1287/isre.2020.0485
Moás PLopes C(2023)Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature ReviewACM Computing Surveys10.1145/362528656:4(1-37)Online publication date: 10-Nov-2023
https://dl.acm.org/doi/10.1145/3625286
Roy SGaur VRaza HJameel S(2023)CLEFT: Contextualised Unified Learning of User Engagement in Video Lectures With FeedbackIEEE Access10.1109/ACCESS.2023.324598211(17707-17720)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3245982
Huang PLi LWu CZhang XLiu Z(2022)Quality assessment of cross-topic article features based on improved CTS model2022 6th International Symposium on Computer Science and Intelligent Control (ISCSIC)10.1109/ISCSIC57216.2022.00031(101-106)Online publication date: Nov-2022
https://doi.org/10.1109/ISCSIC57216.2022.00031
Ölçer DTaşkaya Temizel T(2021)Quality assessment of web-based information on type 2 diabetesOnline Information Review10.1108/OIR-02-2021-008946:4(715-732)Online publication date: 12-Oct-2021
https://doi.org/10.1108/OIR-02-2021-0089
Hou JLi JWang P(2021)Measuring Quality of Wikipedia Articles by Feature Fusion‐based Stack LearningProceedings of the Association for Information Science and Technology10.1002/pra2.44958:1(206-217)Online publication date: 13-Oct-2021
https://dl.acm.org/doi/10.1002/pra2.449
Lewoniewski WWęcel KAbramowicz W(2019)Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different TopicsComputers10.3390/computers80300608:3(60)Online publication date: 14-Aug-2019
https://doi.org/10.3390/computers8030060
Santos LChristofani TSilva IDalip Ddos Santos JMuchaluat Saade Dda Graça C. Pimentel MMacedo A(2019)Quality assessment of Wikipedia content using topic modelsProceedings of the 25th Brazillian Symposium on Multimedia and the Web10.1145/3323503.3360628(249-252)Online publication date: 29-Oct-2019
https://dl.acm.org/doi/10.1145/3323503.3360628
Casey KJr. K(2019)Financial Regulatory and Risk Management Challenges Stemming from Firm-Specific Digital MisinformationJournal of Data and Information Quality10.1145/327465511:1(1-4)Online publication date: 4-Jan-2019
https://dl.acm.org/doi/10.1145/3274655
Fu HOh S(2019)Quality assessment of answers with user-identified criteria and data-driven features in social Q&AInformation Processing & Management10.1016/j.ipm.2018.08.00756:1(14-28)Online publication date: Jan-2019
https://doi.org/10.1016/j.ipm.2018.08.007
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents