Skip to main content
Log in

Web article quality ranking based on web community knowledge

  • Published:
Computing Aims and scope Submit manuscript

Abstract

The Web article has been recognized as the most popular data source for its convenience and abundance of information. Yet its data quality is compromised as most of existing quality assessment approaches rely mainly on the syntax or lexicon, rather than the semantics. We propose a Fact-based Quality Assessment (FQA) approach, which captures the data quality based on content semantics by gleaning the Web community knowledge. The FQA can automatically rank the Web data quality in terms of the three most important quality dimensions accuracy, completeness and freshness. Furthermore, the semantic dimensions can well complement existing works based on syntactical or lexical features. Given one source article, the FQA starts with the identification of an alternative context by collecting articles of the same topics. Then, the dimension baselines of accuracy, completeness and freshness are extracted in the alternative context. Finally, the data quality is determined by comparing the semantic corpus of the source article with the established dimension baselines. The performance of our FQA is verified in the experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://www.bbc.co.uk/history/historic_figures/darwin_charles.shtml.

  2. http://www.lucidcafe.com/library/96feb/darwin.html.

  3. http://nlp.stanford.edu/software/dcoref.shtml.

  4. http://reverb.cs.washington.edu/.

  5. http://wordnet.princeton.edu/.

  6. http://en.wikipedia.org/wiki/Scientist.

  7. http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Assessment.

References

  1. Dalip DH, Cristo M, Calado P (2009) Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia. In: Proceedings of JCDL ’09, pp 295–304

  2. Zeng H, A.Alhossaini M, Ding L (2006) Computing trust from revision history. In: Proc. of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services

  3. Stvilia B, Twidle MB, Smith LC (2005) Assessing information quality of a community-based encyclopedia. In: Proceedings of the international conference on information quality, pp 442–454

  4. Wang RY, Kon HB, Madnick SE (1993) Data quality requirements analysis and modeling. In: Proceedings of the 9th international conference on data engineering, pp 670–677

  5. Louis DA, Perrochon L (1993) Towards improving data quality. In: Proceedings of the international conference on information systems and management of data, pp 273–281

  6. Bouzeghoub M, Peralta V (2004) A framework for analysis of data freshness. In: Proceedings of 2004 international information quality conference on information system, pp 59–67

  7. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  8. Wand Y, Wang RY (1996) anchoring data quality dimensions in ontological foundations. Communications of the ACM 39(11)

  9. Foltz PW, Gilliam S, Kendall S (2000) Supporting content-based feedback in on-line writing evaluation with lsa. Interact Learn Environ 8(2):111–127

    Article  Google Scholar 

  10. Dalip DH, Gonalves MA, Cristo M, Calado P (2012) On multiview-based meta-learning for automatic quality assessment of wiki articles. In: Proceedings of the 2nd international conference on theory and practice of digital libraries, pp 234–246

  11. Dalip DH, Gonalves MA, Cristo M, Calado P (2013) Exploiting user feedback to learn to rank answers in q&a forums: a case study with stack overflow. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp 543–552

  12. Rassbach L, Pincock T, Mingus B (2008) Exploring the feasibility of automatically rating online article quality

  13. Zeng H, Alhossaini MA, Fikes R, McGuinness DL (2006) mining revision history to assess trustworthiness of article fragments. In: Proceedings of the 2006 international conference on collaborative computing networking applications and worksharing, pp 1–10

  14. Baeza-Yates R, Rello L (2012) on measuring the lexcial quality of the web. In: Proceedings of the 2nd joint WICOW/AIRWeb workshop on web quality, pp 1–6

  15. Lex E, Voelske M, Errecalde M (2012) Measuring the quality of web content using factual information. In: Proceedings of the 2nd joint WICOW/AIRWeb workshop on web quality, pp 7–10

  16. Li X, Meng W, Yu C (2011) T-verifier: verifying truthfulness of fact statements. In: Proceedinds of ICDE 2011:63–74

  17. Parameswaran A, Rajaraman A, Garcia-Molina H (2010) Towards the web of concepts: extracting concepts from large datasets. In: Proceedings of 2010 VLDB. 3:566–577

  18. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of 25th international conference on very large data bases, Morgan Kaufmann, pp 518–529

  19. Ohsawa Y, Benson NE, Yachida M (1998) Keygraph: automatic indexing by co-occurrence graph based on building construction metaphor. In: Proceedings of the IEEE international forum on research and technology advances in digital libraries, pp 12–18

  20. Lu Y, Meng W, Zhang W, Liu KL, Yu C (2006) Automatic extraction of publication time from news search results. In: Proceedings of the 22nd international conference on data engineering workshops, p 50

  21. Chen Z, Ma J, Cui C, Rui H, Huang S (2010) Web page publication time detection and its application for page rank. In: Proceedings of SIGIR’10, pp 859–860

  22. Si X, Chang EY, Gyongyi Z, Sun M (2010) Confucius and its intelligent disciples: integrating social with search. In: Proceedigs of the 36th VLDB, pp 1505–1516

  23. Lee H, Chang A, Peirsman Y, Chambers N, Surdeanu M, Jurafsky D (2013) Deterministic coreference resolution based on entity-centric, precision-ranked rules. Comput Linguist 39(4):885–916

    Article  Google Scholar 

  24. Etzioni O, Fader A, Christensen J, Soderland S, Mausam (2011) Open information extraction:the second generation. In: Proceedings of twenty-second international joint conference on artificial intelligence, pp 3–10

  25. Blanco E, Moldovan D (2011) Semantic representation of negation using focus detection. In: Proceedings of the 49th annual meeting of the association for computational linguistics, pp 581–589

Download references

Acknowledgments

We sincerely thank Professor Alexandra Poulovassilis from London Knowledge Lab(LKL) for her valuable suggestions, and the anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingyu Han.

Additional information

This research is fully supported by National Natural Science Foundation of China under the grant numbers 61003040, 61100135.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Han, J., Chen, K. & Wang, J. Web article quality ranking based on web community knowledge. Computing 97, 509–537 (2015). https://doi.org/10.1007/s00607-014-0435-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-014-0435-4

Keywords

Mathematics Subject Classification

Navigation