Elsevier

Future Generation Computer Systems

Volume 31, February 2014, Pages 120-133
Future Generation Computer Systems

Visualizing large-scale human collaboration in Wikipedia

https://doi.org/10.1016/j.future.2013.04.001Get rights and content

Highlights

  • A novel method for analysis and visualization of large wikis such as Wikipedia.

  • Visualization of a wiki in a form similar to a geographic map.

  • Analyzed and visualized English, German, Chinese, Swedish and Danish Wikipedia.

  • Significant co-author count differences between different language Wikipedias.

  • Superior over text data in usability, accuracy, speed and user preference.

Abstract

Volunteer-driven large-scale human-to-human collaboration has become common in the Web 2.0 era. Wikipedia is one of the foremost examples of such large-scale collaboration, involving millions of authors writing millions of articles on a wide range of subjects. The collaboration on some popular articles numbers hundreds or even thousands of co-authors. We have analyzed the co-authoring across entire Wikipedias in different languages and have found it to follow a geometric distribution in all the language editions we studied. In order to better understand the distribution of co-author counts across different topics, we have aggregated content by category and visualized it in a form resembling a geographic map. The visualizations produced show that there are significant differences of co-author counts across different topics in all the Wikipedia language editions we visualized. In this article we describe our analysis and visualization method and present the results of applying our method to the English, German, Chinese, Swedish and Danish Wikipedias. We have evaluated our visualization against textual data and found it to be superior in usability, accuracy, speed and user preference.

Introduction

The emergence of Web 2.0 technologies in recent years has made human-to-human collaboration on unprecedented scales not only possible but a reality. One of the best-known examples of world-wide large-scale collaboration is Wikipedia, “the free encyclopedia that anyone can edit” (Wikipedia’s own slogan)  [1]. Wikipedia has great value that has not yet been fully researched. Past research on Wikipedia has focused on both a micro-level (e.g.  [2], [3]) and a macro-level of analysis (e.g.  [4], [5], [6], [7]). A micro-level of analysis typically focuses on a single article, whereas a macro-level of analysis studies the wiki as a whole, exploring relationships and the evolution of the entire content collection, among others. Our research falls in the latter class and aims to obtain an overview of Wikipedia and identify popular topic areas. By applying this to different language Wikipedias we wish to discover differences among those language editions, and by implication to discover differences of interest in those topic areas among the user communities of those language groups. However, our aim in this research is for our methods and tools to be general enough to be applied to other wikis besides Wikipedia, for example intra-organizational wikis.

The technology underlying Wikipedia is relatively simple: a wiki engine (MediaWiki) implemented in PHP on a web server which most users access through a web browser, and primarily making use of three main functions: searching for, reading and editing articles. Other functions, used to a much lesser extent by common users, are asynchronous discussion of articles, viewing the revision history of an article, comparing revisions to find out what has changed between them, undoing specific revisions, and a few others. Wikipedia administrators have additional privileges, allowing them to protect articles (making them read-only), moving (renaming) articles, deleting articles entirely, blocking users, and other administrative/maintenance functions.

The Wikipedia user base is large and broad: the English Wikipedia edition alone counted about 17.8 million registered users in November 2012, out of which 132,800 (0.7%) are considered “active” users (meaning that they have performed some action within the past 30 days). A small portion of these registered users are site administrators, under 1500 (about 1% of active users) in the case of the English Wikipedia. An overview of user statistics for a few selected Wikipedia language editions that we have studied is shown in Table 1. We selected these Wikipedia language editions mainly for the practical reason that we understand these languages (which is required for interpreting the visualized result), but also to give us a selection of very large (English), medium-sized (German, Chinese) and small (Swedish, Danish) Wikipedias.

Wikipedia content is user-contributed, meaning that end-users can add to, modify and delete content in Wikipedia articles. They can also write entirely new articles and link these to other articles. To better organize content Wikipedia has a hierarchical category system, and any given article can be marked as belonging to any number of categories. For instance in the English Wikipedia (as of January 2012), article “Wiki” is assigned to category “Wikis” (plus five other categories), which in turn has parent category “World Wide Web” (plus four other parent categories), which in turn has parent category “Digital Media” (plus six other parent categories), and so on. The same as with articles, categories are also user-contributed: users can create new categories, assign categories to parent categories, assign articles to categories, and change existing article-to-category and category-to-category assignments. The result is an organically evolving category system that reflects the current needs of the user-contributor community. One of the implications of such an open editing process is that it may result in different granularity of the category hierarchy. Table 2 shows the numbers of articles and categories of the five Wikipedia language editions we have analyzed (these counts include all articles and categories, including non-content ones that we later remove). The absolute numbers of articles and categories differs significantly in these different language editions, but so does the average number of articles per category (the right-most column in Table 2) which indicates the granularity of the category hierarchy. In four of the five analyzed Wikipedia language editions the number of articles per category ranges between about 4 and 8, but in the German Wikipedia there are on average 17.7 articles per category, suggesting a much coarser category hierarchy granularity. As documented on Wikipedia itself, the German edition of Wikipedia differs from other editions: “Compared to the English Wikipedia, the German edition tends to be more selective in its coverage” and “Categories are usually introduced only for a minimum of ten entries and are not always subdivided even for larger numbers of items,”1 which explains this difference in the articles per category statistics. In fact, the absolute number of categories in the German Wikipedia is even smaller than that in each of the Chinese and Swedish Wikipedias although the number of articles is significantly larger. Different language communities clearly have different standards as to how fine-grained they believe their category hierarchies should be.

Wikipedia is not only user-contributed, but as a direct result of its openness the number of contributors that get involved in editing a given article can also be very large. We have analyzed this number of co-authors and for each category calculated the average number of distinct co-authors of all articles assigned to that category. This average count of co-authors per category varies dramatically between categories. For example, in the English Wikipedia there are 15 categories, each of which has an average number of co-authors of over 5000. On the other hand there are over 100,000 categories, each of which has an average number of co-authors of 10 or fewer. The distribution of average number of co-authors per category in the five Wikipedia language editions we analyzed is plotted in Fig. 1. Interestingly, despite all the differences in scale and category hierarchy granularity among the different language Wikipedias, their curves have essentially the same shape. We determined goodness of fit using the Anderson–Darling test and found the data from all five language editions to follow a geometric distribution, with p ranging between 0.03 and 0.05 in the different languages.

However, this distribution of the average number of co-authors per category does not reveal where the differences lie—which categories attract the most co-authors to their articles, and which the fewest. This may also differ between different Wikipedia language editions, as the top-10 list of categories with highest co-author count shown in Table 3 indicates politics and religion feature strongly in the English Wikipedia, whereas in the German Wikipedia it is art and society that feature strongly, with some sports and television appearing in both top-10 lists. We also do not know if similar co-author counts cluster together by topic, i.e. whether categories that belong to the same parent category also have similarly high co-author counts. This information is difficult to obtain as topic clusters are hard to determine given the large number of parent categories that a given category may belong to.

We have devised a method for analyzing the category hierarchy to determine which major parent category a given category should belong to. This allows us to aggregate co-author counts from individual categories recursively up to their ancestor until the top of the category hierarchy. Doing so reveals which categories at the highest level are the most collaborative, and which the least. We have then used the output of this analysis to visualize the average numbers of co-authors across the first three levels of the category hierarchy for the entire Wikipedia of a given language. This visualization reveals interesting differences in the distribution of co-authoring over the Wikipedias we studied.

The remainder of this article is organized as follows: the next section briefly presents related work. Section  3 then introduces our analysis and visualization method. Section  4 presents the results of applying our method to actual Wikipedia data, followed by Section  5 which evaluates our visualization. In Section  6 we discuss applications of our visualization method and make conclusions in Section  7.

Section snippets

Related work

Wikis in general, and Wikipedia in particular, have experienced dramatic growth over the past decade both in size and value. Consequently they have become the focus of research by numerous researchers in different fields worldwide. This section gives a brief overview of pertinent research.

Analysis and visualization method

Our analysis and visualization method consists of two parts, respectively, for analysis and visualization of Wikipedia data. The analysis part processes the data in preparation for the visualization part which transforms the data into a graphical form. The pre-processing immediately precedes the visualization; thus if the category data is changed in any way then running the pre-processing and visualization process anew will produce a visualization that reflects this change.

The visualization

Visualizations

We have applied our analysis and visualization method to the English, German, Chinese, Swedish and Danish Wikipedia editions. In all cases we aggregated categories and visualized the first three levels of categories below the semantic root node, using average co-author count for region colouring. The colour scale was the same in all cases, with six colour ranges, each representing a quadruple lower limit relative to the next range (i.e. limits at 1, 4, 16, 64, etc.). Fig. 8 shows the English

Evaluation

In order to assess our visualization we conducted a usability evaluation, focusing mainly on usability, accuracy, speed, and preference. As no other visualizations of co-authorship exist, we compared our visualization against textual tables of summary data on co-authorship that we extracted from Wikipedia. This data consisted of the first three levels of categories in the Simple English Wikipedia together with an average co-author count for each (a 20-page PDF document) and a table of

Applications

In this section we briefly discuss other potential applications of our visualization. In the previous work we have used our visualization method to represent a different attribute, namely article count  [32]. Region colouring in that visualization indicated how large a given category is in terms of the number of articles. Visualization of other attributes is likewise possible, such as number of revisions, article age, recent edit activity, etc. Doing so can be of use to various stakeholders:

Conclusions

Many of today’s internet-scale applications produce large amounts of data, and this is expected to increase even more in the years to come. These “big data” applications produce terabytes, exabytes and more of data  [33]. Making effective use of this data becomes increasingly difficult because of its sheer volume. Even a single website such as Wikipedia stores terabytes of data. In this article we presented a novel method for analyzing large amounts of wiki data and visualizing it in a form

Robert P. Biuk-Aghai is an Assistant Professor of Computing Sciences at the University of Macau. He holds a Ph.D. degree in Computing Sciences from the University of Technology, Sydney, and an M.Sc. degree in Information Systems from the London School of Economics. Dr. Biuk-Aghai’s research interests include collaboration systems, information visualization, and mobile GIS.

References (34)

  • A.G. West et al.

    What Wikipedia deletes: characterizing dangerous collaborative content

  • J. Yu et al.

    Ontology evaluation using Wikipedia categories for browsing

  • T. Zesch, I. Gurevych, Analysis of the Wikipedia category graph for NLP applications, in: Proceedings of the...
  • I.S. Dhillon et al.

    Concept decompositions for large sparse text data using clustering

    Machine Learning

    (2001)
  • Y.H. Li et al.

    Classification of text documents

    The Computer Journal

    (1998)
  • G. Salton et al.

    Introduction to Modern Information Retrieval

    (1986)
  • J. Szymański

    Mining relations between Wikipedia categories

  • Cited by (27)

    • A comparative study of item space visualizations for recommender systems

      2023, International Journal of Human Computer Studies
    • Computing controversy: Formal model and algorithms for detecting controversy on Wikipedia and in search queries

      2018, Information Processing and Management
      Citation Excerpt :

      Kittur, Chi, and Suh (2009) used a WCG of annotated data to detect contentious topics in Wikipedia. Recently, Biuk-Aghai, Pang, and Si (2014) attempted to visualize human collaboration in Wikipedia by visualizing WCG subtrees as simple trees. In their search for controversial topics in Wikipedia articles, Borra et al. (2015) used language agnostic programming to develop a tool they called Contropedia.

    • Creating realistic map-like visualisations: Results from user studies

      2017, Journal of Visual Languages and Computing
      Citation Excerpt :

      In terms of their applications, map-like visualisations have been used to represent knowledge domains (e.g. a corpus of scientific papers) or hierarchical data (e.g. a file system). In these cases, large and complicated datasets can be represented through the map metaphor, so that users can perceive the information as if reading a geographic map, without the need for prior training [4]. In such cases, data can be more easily searched and is more discoverable for novice users [31].

    • A map-like visualisation method based on liquid modelling

      2015, Journal of Visual Languages and Computing
      Citation Excerpt :

      We have further significantly evolved our method to overcome some limitations that existed, and to improve the visual quality of the result, which we present here. This method has a smoother appearance, and has better usability than our previous hexagon-tiling visualisation [2]. Our visualisation method is loosely based on the metaphor of expanding liquids: each area in the map is represented by an immiscible liquid that is poured onto a point in a plane.

    View all citing articles on Scopus

    Robert P. Biuk-Aghai is an Assistant Professor of Computing Sciences at the University of Macau. He holds a Ph.D. degree in Computing Sciences from the University of Technology, Sydney, and an M.Sc. degree in Information Systems from the London School of Economics. Dr. Biuk-Aghai’s research interests include collaboration systems, information visualization, and mobile GIS.

    Cheong-Iao Pang is a Ph.D. candidate in the Department of Computing and Information Systems of the University of Melbourne. He obtained an M.Sc. degree from the University of Macau. He is interested in understanding why people look for information, and how to improve this search process with information visualization and interactive software.

    Yain-Whar Si is an Assistant Professor at the University of Macau. He holds a Ph.D. degree in Information Technology from the Queensland University of Technology, Brisbane, and an M.Sc. degree in Software Engineering from the University of Macau. His research interests are in the areas of business process management and decision support systems.

    View full text