Visualizing large-scale human collaboration in Wikipedia

doi:10.1016/j.future.2013.04.001

Future Generation Computer Systems

Volume 31, February 2014, Pages 120-133

https://doi.org/10.1016/j.future.2013.04.001 Get rights and content

Highlights

•
A novel method for analysis and visualization of large wikis such as Wikipedia.
•
Visualization of a wiki in a form similar to a geographic map.
•
Analyzed and visualized English, German, Chinese, Swedish and Danish Wikipedia.
•
Significant co-author count differences between different language Wikipedias.
•
Superior over text data in usability, accuracy, speed and user preference.

Abstract

Volunteer-driven large-scale human-to-human collaboration has become common in the Web 2.0 era. Wikipedia is one of the foremost examples of such large-scale collaboration, involving millions of authors writing millions of articles on a wide range of subjects. The collaboration on some popular articles numbers hundreds or even thousands of co-authors. We have analyzed the co-authoring across entire Wikipedias in different languages and have found it to follow a geometric distribution in all the language editions we studied. In order to better understand the distribution of co-author counts across different topics, we have aggregated content by category and visualized it in a form resembling a geographic map. The visualizations produced show that there are significant differences of co-author counts across different topics in all the Wikipedia language editions we visualized. In this article we describe our analysis and visualization method and present the results of applying our method to the English, German, Chinese, Swedish and Danish Wikipedias. We have evaluated our visualization against textual data and found it to be superior in usability, accuracy, speed and user preference.

Introduction

The emergence of Web 2.0 technologies in recent years has made human-to-human collaboration on unprecedented scales not only possible but a reality. One of the best-known examples of world-wide large-scale collaboration is Wikipedia, “the free encyclopedia that anyone can edit” (Wikipedia’s own slogan) [1]. Wikipedia has great value that has not yet been fully researched. Past research on Wikipedia has focused on both a micro-level (e.g. [2], [3]) and a macro-level of analysis (e.g. [4], [5], [6], [7]). A micro-level of analysis typically focuses on a single article, whereas a macro-level of analysis studies the wiki as a whole, exploring relationships and the evolution of the entire content collection, among others. Our research falls in the latter class and aims to obtain an overview of Wikipedia and identify popular topic areas. By applying this to different language Wikipedias we wish to discover differences among those language editions, and by implication to discover differences of interest in those topic areas among the user communities of those language groups. However, our aim in this research is for our methods and tools to be general enough to be applied to other wikis besides Wikipedia, for example intra-organizational wikis.

The technology underlying Wikipedia is relatively simple: a wiki engine (MediaWiki) implemented in PHP on a web server which most users access through a web browser, and primarily making use of three main functions: searching for, reading and editing articles. Other functions, used to a much lesser extent by common users, are asynchronous discussion of articles, viewing the revision history of an article, comparing revisions to find out what has changed between them, undoing specific revisions, and a few others. Wikipedia administrators have additional privileges, allowing them to protect articles (making them read-only), moving (renaming) articles, deleting articles entirely, blocking users, and other administrative/maintenance functions.

The Wikipedia user base is large and broad: the English Wikipedia edition alone counted about 17.8 million registered users in November 2012, out of which 132,800 (0.7%) are considered “active” users (meaning that they have performed some action within the past 30 days). A small portion of these registered users are site administrators, under 1500 (about 1% of active users) in the case of the English Wikipedia. An overview of user statistics for a few selected Wikipedia language editions that we have studied is shown in Table 1. We selected these Wikipedia language editions mainly for the practical reason that we understand these languages (which is required for interpreting the visualized result), but also to give us a selection of very large (English), medium-sized (German, Chinese) and small (Swedish, Danish) Wikipedias.

Wikipedia content is user-contributed, meaning that end-users can add to, modify and delete content in Wikipedia articles. They can also write entirely new articles and link these to other articles. To better organize content Wikipedia has a hierarchical category system, and any given article can be marked as belonging to any number of categories. For instance in the English Wikipedia (as of January 2012), article “Wiki” is assigned to category “Wikis” (plus five other categories), which in turn has parent category “World Wide Web” (plus four other parent categories), which in turn has parent category “Digital Media” (plus six other parent categories), and so on. The same as with articles, categories are also user-contributed: users can create new categories, assign categories to parent categories, assign articles to categories, and change existing article-to-category and category-to-category assignments. The result is an organically evolving category system that reflects the current needs of the user-contributor community. One of the implications of such an open editing process is that it may result in different granularity of the category hierarchy. Table 2 shows the numbers of articles and categories of the five Wikipedia language editions we have analyzed (these counts include all articles and categories, including non-content ones that we later remove). The absolute numbers of articles and categories differs significantly in these different language editions, but so does the average number of articles per category (the right-most column in Table 2) which indicates the granularity of the category hierarchy. In four of the five analyzed Wikipedia language editions the number of articles per category ranges between about 4 and 8, but in the German Wikipedia there are on average 17.7 articles per category, suggesting a much coarser category hierarchy granularity. As documented on Wikipedia itself, the German edition of Wikipedia differs from other editions: “Compared to the English Wikipedia, the German edition tends to be more selective in its coverage” and “Categories are usually introduced only for a minimum of ten entries and are not always subdivided even for larger numbers of items,”¹ which explains this difference in the articles per category statistics. In fact, the absolute number of categories in the German Wikipedia is even smaller than that in each of the Chinese and Swedish Wikipedias although the number of articles is significantly larger. Different language communities clearly have different standards as to how fine-grained they believe their category hierarchies should be.

Wikipedia is not only user-contributed, but as a direct result of its openness the number of contributors that get involved in editing a given article can also be very large. We have analyzed this number of co-authors and for each category calculated the average number of distinct co-authors of all articles assigned to that category. This average count of co-authors per category varies dramatically between categories. For example, in the English Wikipedia there are 15 categories, each of which has an average number of co-authors of over 5000. On the other hand there are over 100,000 categories, each of which has an average number of co-authors of 10 or fewer. The distribution of average number of co-authors per category in the five Wikipedia language editions we analyzed is plotted in Fig. 1. Interestingly, despite all the differences in scale and category hierarchy granularity among the different language Wikipedias, their curves have essentially the same shape. We determined goodness of fit using the Anderson–Darling test and found the data from all five language editions to follow a geometric distribution, with $p$ ranging between 0.03 and 0.05 in the different languages.

However, this distribution of the average number of co-authors per category does not reveal where the differences lie—which categories attract the most co-authors to their articles, and which the fewest. This may also differ between different Wikipedia language editions, as the top-10 list of categories with highest co-author count shown in Table 3 indicates politics and religion feature strongly in the English Wikipedia, whereas in the German Wikipedia it is art and society that feature strongly, with some sports and television appearing in both top-10 lists. We also do not know if similar co-author counts cluster together by topic, i.e. whether categories that belong to the same parent category also have similarly high co-author counts. This information is difficult to obtain as topic clusters are hard to determine given the large number of parent categories that a given category may belong to.

We have devised a method for analyzing the category hierarchy to determine which major parent category a given category should belong to. This allows us to aggregate co-author counts from individual categories recursively up to their ancestor until the top of the category hierarchy. Doing so reveals which categories at the highest level are the most collaborative, and which the least. We have then used the output of this analysis to visualize the average numbers of co-authors across the first three levels of the category hierarchy for the entire Wikipedia of a given language. This visualization reveals interesting differences in the distribution of co-authoring over the Wikipedias we studied.

The remainder of this article is organized as follows: the next section briefly presents related work. Section 3 then introduces our analysis and visualization method. Section 4 presents the results of applying our method to actual Wikipedia data, followed by Section 5 which evaluates our visualization. In Section 6 we discuss applications of our visualization method and make conclusions in Section 7.

Section snippets

Related work

Wikis in general, and Wikipedia in particular, have experienced dramatic growth over the past decade both in size and value. Consequently they have become the focus of research by numerous researchers in different fields worldwide. This section gives a brief overview of pertinent research.

Analysis and visualization method

Our analysis and visualization method consists of two parts, respectively, for analysis and visualization of Wikipedia data. The analysis part processes the data in preparation for the visualization part which transforms the data into a graphical form. The pre-processing immediately precedes the visualization; thus if the category data is changed in any way then running the pre-processing and visualization process anew will produce a visualization that reflects this change.

The visualization

Visualizations

We have applied our analysis and visualization method to the English, German, Chinese, Swedish and Danish Wikipedia editions. In all cases we aggregated categories and visualized the first three levels of categories below the semantic root node, using average co-author count for region colouring. The colour scale was the same in all cases, with six colour ranges, each representing a quadruple lower limit relative to the next range (i.e. limits at 1, 4, 16, 64, etc.). Fig. 8 shows the English

Evaluation

In order to assess our visualization we conducted a usability evaluation, focusing mainly on usability, accuracy, speed, and preference. As no other visualizations of co-authorship exist, we compared our visualization against textual tables of summary data on co-authorship that we extracted from Wikipedia. This data consisted of the first three levels of categories in the Simple English Wikipedia together with an average co-author count for each (a 20-page PDF document) and a table of

Applications

In this section we briefly discuss other potential applications of our visualization. In the previous work we have used our visualization method to represent a different attribute, namely article count [32]. Region colouring in that visualization indicated how large a given category is in terms of the number of articles. Visualization of other attributes is likewise possible, such as number of revisions, article age, recent edit activity, etc. Doing so can be of use to various stakeholders:

Conclusions

Many of today’s internet-scale applications produce large amounts of data, and this is expected to increase even more in the years to come. These “big data” applications produce terabytes, exabytes and more of data [33]. Making effective use of this data becomes increasingly difficult because of its sheer volume. Even a single website such as Wikipedia stores terabytes of data. In this article we presented a novel method for analyzing large amounts of wiki data and visualizing it in a form

Robert P. Biuk-Aghai is an Assistant Professor of Computing Sciences at the University of Macau. He holds a Ph.D. degree in Computing Sciences from the University of Technology, Sydney, and an M.Sc. degree in Information Systems from the London School of Economics. Dr. Biuk-Aghai’s research interests include collaboration systems, information visualization, and mobile GIS.

References (34)

T. Kamada et al.
An algorithm for drawing general undirected graphs
Information Processing Letters
(1989)
S. dos Santos et al.
Gaining understanding of multivariate and multidimensional data through visualization
Computers & Graphics
(2004)
M.J. McQuaid et al.
Multidimensional scaling for group memory visualization
Decision Support Systems
(1999)
R.L. Grossman et al.
Compute and storage clouds using wide area high performance networks
Future Generation Computer Systems
(2009)
D. O’Leary
Wikis: ‘from each according to his knowledge’
Computer
(2008)
F.B. Viégas et al.
Studying cooperation and conflict between authors with history flow visualizations
B. Suh et al.
Lifting the veil: improving accountability and social transparency in Wikipedia with WikiDashboard
T. Holloway et al.
Analyzing and visualizing the semantic coverage of Wikipedia and its authors
Complexity
(2006)
F. Ortega et al.
Quantitative analysis of the Wikipedia community of users
P.K.-F. Fong et al.
What did they do? Deriving high-level edit histories in Wikis

A.G. West et al.

What Wikipedia deletes: characterizing dangerous collaborative content

J. Yu et al.

Ontology evaluation using Wikipedia categories for browsing

T. Zesch, I. Gurevych, Analysis of the Wikipedia category graph for NLP applications, in: Proceedings of the...

I.S. Dhillon et al.

Concept decompositions for large sparse text data using clustering

Machine Learning

(2001)

Y.H. Li et al.

Classification of text documents

The Computer Journal

(1998)

G. Salton et al.

Introduction to Modern Information Retrieval

(1986)

J. Szymański

Mining relations between Wikipedia categories

Cited by (27)

A comparative study of item space visualizations for recommender systems
2023, International Journal of Human Computer Studies
Recommender systems aim at supporting users in their search and decision making process by selecting a small number of likely relevant items from a large set of options. Although automatically filtering unmanageably large item sets down to a few recommendations often produces results that match the user’s interests well, it also prevents users from understanding and exploring items in their larger context. This may reduce users’ perception of transparency and controllability of the system. Visualizations have been proposed as a means for overcoming this problem, with some visualizations providing a complete overview of the entire space of available items. However, thus far item space visualizations have rarely been investigated and compared in user studies. To address this, we developed and empirically compared three applications that present the user with personalized music recommendations embedded in a visualization of the entire item space. The three applications display the same item space as a list, as a treemap, and as a map, respectively. We compared these applications in an online user study and found, against our expectations, that they did not differ much in how the recommendations are perceived. Perception of transparency, recommendation quality, and degree of control over the recommendations received relatively high scores over all three applications. However, we did find a difference in hedonic user experience and perceived novelty of the recommendations. Both factors were perceived to be higher in the map condition. Backed up by a mediation analysis, we argue that a halo effect is the reason for the observed perceived novelty: participants transferred the novelty of the application to the novelty of the recommendations.
Towards a Digital Reflexive Sociology: Using Wikipedia's Biographical Repository as a Reflexive Tool
2022, Poetics
We propose the development of 'digital reflexive sociology', understood as the use of digital methods and Big Data to reflect on the social and historical circumstances of sociologists and sociological thinking. To show this approach's potential, we employ Wikipedia as a ‘reflexive tool’, i.e., an external artefact of self-observation that can help sociologists to notice conventions, biases, and blind spots within their discipline. We analyse the collective patterns of the 500 most notable sociologists on Wikipedia, performing structural, network, and text analyses of their biographies. Our exploration reveals patterns in their historical frequency, gender composition, geographical concentration, birth-death mobility, centrality degree, biographical clustering, and proximity between countries, also stressing institutions, events, places, and relevant dates from a biographical point of view. Linking these patterns in a diachronic way, we distinguish five generations of sociologists recorded on Wikipedia and emphasise the high historical concentration of the discipline in geographical areas, gender, and schools of thought. Drawing on these results, we discuss the potential of using digital repositories and methods to enhance reflexivity within sociology.
Computing controversy: Formal model and algorithms for detecting controversy on Wikipedia and in search queries
2018, Information Processing and Management
Citation Excerpt :
Kittur, Chi, and Suh (2009) used a WCG of annotated data to detect contentious topics in Wikipedia. Recently, Biuk-Aghai, Pang, and Si (2014) attempted to visualize human collaboration in Wikipedia by visualizing WCG subtrees as simple trees. In their search for controversial topics in Wikipedia articles, Borra et al. (2015) used language agnostic programming to develop a tool they called Contropedia.
Controversy is a complex concept that has been attracting attention of scholars from diverse fields. In the era of Internet and social media, detecting controversy and controversial concepts by the means of automatic methods is especially important. Web searchers could be alerted when the contents they consume are controversial or when they attempt to acquire information on disputed topics. Presenting users with the indications and explanations of the controversy should offer them chance to see the “wider picture” rather than letting them obtain one-sided views. In this work we first introduce a formal model of controversy as the basis of computational approaches to detecting controversial concepts. Then we propose a classification based method for automatic detection of controversial articles and categories in Wikipedia. Next, we demonstrate how to use the obtained results for the estimation of the controversy level of search queries. The proposed method can be incorporated into search engines as a component responsible for detection of queries related to controversial topics. The method is independent of the search engine’s retrieval and search results recommendation algorithms, and is therefore unaffected by a possible filter bubble.
Our approach can be also applied in Wikipedia or other knowledge bases for supporting the detection of controversy and content maintenance. Finally, we believe that our results could be useful for social science researchers for understanding the complex nature of controversy and in fostering their studies.
Creating realistic map-like visualisations: Results from user studies
2017, Journal of Visual Languages and Computing
Citation Excerpt :
In terms of their applications, map-like visualisations have been used to represent knowledge domains (e.g. a corpus of scientific papers) or hierarchical data (e.g. a file system). In these cases, large and complicated datasets can be represented through the map metaphor, so that users can perceive the information as if reading a geographic map, without the need for prior training [4]. In such cases, data can be more easily searched and is more discoverable for novice users [31].
Maps have traditionally been used for displaying geographical information. However, apart from this obvious purpose, the metaphor of maps has been applied to other uses, such as information visualisation and novel user interfaces, since the map metaphor is easy-to-understand and allows users to explore data intuitively. There are several methods for creating these map-like visualisations and user interfaces, but there is little understanding on how people perceive these non-geographical maps, and how to make the visualisation output more realistic. As such, we aim to find preliminary answers on these issues by conducting user studies with a series of map-like visualisations. In this paper, we report on the results of the studies and reveal the factors that have an impact on the human perception of visualisations that are designed to resemble geographic maps. Based on these results, we propose design suggestions for building realistic map-like visualisations.
A map-like visualisation method based on liquid modelling
2015, Journal of Visual Languages and Computing
Citation Excerpt :
We have further significantly evolved our method to overcome some limitations that existed, and to improve the visual quality of the result, which we present here. This method has a smoother appearance, and has better usability than our previous hexagon-tiling visualisation [2]. Our visualisation method is loosely based on the metaphor of expanding liquids: each area in the map is represented by an immiscible liquid that is poured onto a point in a plane.
Many applications produce large amounts of data, and information visualisation has been successfully applied to help make sense of this data. Recently geographic maps have been used as a metaphor for visualisation, given that most people are familiar with reading maps, and several visualisation methods based on this metaphor have been developed. In this paper we present a new visualisation method that aims to improve on existing map-like visualisations. It is based on the metaphor of liquids poured onto a surface that expand outwards until they touch each other, forming larger areas. We present the design of our visualisation method and evaluations we have carried out to compare it with an existing visualisation. Our new visualisation has better usability, leading to higher accuracy and greater speed of task performance, as well as a lower error rate.
Special issue on advances in computer supported collaboration:systems and technologies
2014, Future Generation Computer Systems

View all citing articles on Scopus

Cheong-Iao Pang is a Ph.D. candidate in the Department of Computing and Information Systems of the University of Melbourne. He obtained an M.Sc. degree from the University of Macau. He is interested in understanding why people look for information, and how to improve this search process with information visualization and interactive software.

Yain-Whar Si is an Assistant Professor at the University of Macau. He holds a Ph.D. degree in Information Technology from the Queensland University of Technology, Brisbane, and an M.Sc. degree in Software Engineering from the University of Macau. His research interests are in the areas of business process management and decision support systems.

View full text

Visualizing large-scale human collaboration in Wikipedia

Highlights

Abstract

Introduction

Section snippets

Related work

Analysis and visualization method

Visualizations

Evaluation

Applications

Conclusions

Information Processing Letters

Computers & Graphics

Decision Support Systems

Future Generation Computer Systems

Wikis: ‘from each according to his knowledge’

Computer

Studying cooperation and conflict between authors with history flow visualizations

Lifting the veil: improving accountability and social transparency in Wikipedia with WikiDashboard

Analyzing and visualizing the semantic coverage of Wikipedia and its authors

Complexity

Quantitative analysis of the Wikipedia community of users

What did they do? Deriving high-level edit histories in Wikis