Keywords

1 Introduction

The companies that run the large online platforms serving media content of various kinds to users have long recognized that some content, especially, but not exclusively so-called user generated content (UGC) may be deemed inappropriate. Most large platforms offer mechanisms on their platforms that applies, for example, to Google, which not only runs the most widely used search engine in the western hemisphere but also owns the world’s largest and most widely used video platform YouTube [19], and Facebook, the world’s most widely used social network and also the owner of photo sharing platform Instagram. The platform companies have also been alerted to the fact that automatic distribution of content can lead to juxtapositions that customers find highly problematic, mainly in the area of advertising: A number of companies have, for example, discontinued advertising campaigns run on platforms after it was discovered that their ads ran next to or in front of, extremist or otherwise undesirable content [9, 11, 12, 21, 22]. This paper, however, is concerned with a different kind of contextually inappropriate content: The kind that is caused by the serial, automatically generated video viewing recommendations on YouTube. As developers working on the platform have publicly explained, the main goal of the automatic curation system that suggests a video to watch right after a user has finished his or her current video on the platform is to maximize watch time [15], i.e. the total number of minutes and seconds as well as the number of further videos a given user will keep watching.

To further this goal, YouTube introduced a feature called “autoplay” in 2015. This means that the next “suggested video” in the queue produced by the algorithmic curation system of the platform for each user in each session starts automatically if the user doesn’t actively intervene by clicking pause or closing the browser tab or app. The watch time optimization goal can produce collateral damage: It sometimes leads to the promotion of extremist, misleading or otherwise problematic content to a wider audience [19]. YouTube has even been called “The great radicalizer” because of this mechanism [20]. YouTube’s parent company Google has acknowledged that the systems in place can lead to undesirable content being promoted: “When our services are used to propagate deceptive or misleading information, our mission is undermined.” [6] The company has vowed to apply strategies to counter “disinformation and misinformation” across “Google Search, Google News, YouTube, and our advertising products”.

The focus of this paper, however, is a concept more abstract and also broader than disinformation and misinformation. We are addressing a category of content suggested by such systems that we call “contextually inappropriate”. By this we mean content that is problematic not necessarily in and of itself - although disinformation and misinformation videos can generally be considered inappropriate in any context.

The goal of this paper is to develop a computational understanding for the mechanisms underlying such inappropriate content recommendations and thus potentially offer approaches to remedy some of the problems caused by this kind of content (see next section). We will first try to develop a clearer definition of different kinds of contextually inappropriate content, citing recent examples from the literature and also journalistic publications on this issue. We will then proceed to develop a conceptual model of the underlying mechanisms (Sect. 3) that we can simulate by first analysing the recommender system as described in [4] and other sources. After discussing the role of the user in Sect. 4, we describe our simulation model and provide first results in Sects. 5 and 6.

Our working hypothesis (as expressed in the title) is that even if a recommendation system makes few erroneous wrong suggestions that may be inappropriate in a specific context, the user interaction easily propels these to be played out more and more. It appears to be hard to prevent that kind of “unwanted malicious interaction” between user and recommendation system.

2 Contextually Inappropriate Content

We define contextually inappropriate content as content that violates the assumptions, intentions and goals of the viewer and/or the uploader of a specific video in the context of the current viewing session. To put it differently: Contextually inappropriate content is content recommended by an automatic curation system that reaches an audience that it is not intended for, or an audience that might be shocked, disturbed, misled or otherwise harmfully impacted by said content. Contextually inappropriate content is problematic for two main reasons: First, the content itself might, in a different context or targeted to a different audience, be entirely harmless and unremarkable. This means that human content raters that assess the appropriateness of content for the platforms have no choice but to leave the content online. Flagging systems and the like thus may not work in these cases. Second, contextually inappropriate content creates a situation where the - potential - harm a certain piece of content does may be confined to a few isolated viewing sessions. It is thus hard to detect and even harder to counter. Both of these reasons are exacerbated by the fact that YouTube is used by many children [13] who might be particularly vulnerable to the kinds of inappropriate content recommended to them. Traditional systems for preventing minors from watching harmful content might be circumvented by this process, as the second example listed below will show. To illustrate the concept of contextually inappropriate content, here are three examples:

  • In 2016, former YouTube developer Guillaume Chaslot systematically explored automatically generated recommendations for the search terms

    “Trump” and “Clinton” in the run-up to the presidential election in the USA. He found that both search terms produced sequences of recommendations that skewed heavily towards “Trump-leaning” clips. He noted that “a large proportion of these recommendations were divisive and fake news”. Chaslot also reported that “a ‘Clinton’ search on the eve of the election led to mostly anti-Clinton videos” [3].

  • In 2017, an independent writer [2] and subsequently several news outlets [10] [16] reported that the YouTube recommendation system led underage users towards videos that consisted of “parodies” of well-known cartoon shows for children like “Peppa Pig” or “Paw Patrol”. These “parody” videos contained, to quote the British “Guardian”, “well-known cartoon characters in violent or lewd situations and other clips with disturbing imagery that are occasionally - in a nice postmodern touch - set to nursery rhymes”. Children following the flow of recommendations ended up watching these videos, some reacting disturbed or fearful.

  • In 2019, three researchers originally interested in political content on YouTube in Brazil, came across a network of “channels that were sexually suggestive” [7]. When examining those channels more closely, they found a number of sexually suggestive videos featuring “underage women” or adults posing in children’s clothing. Examining the suggestions for such videos in turn led them to “channels featuring videos of small children”, some in swimwear, some doing gymnastics, “the common theme was that the children were only lightly dressed”. According to the “New York Times” [5], the newspaper the researchers had co-operated with, some of these videos accumulated hundreds of thousands of views within days, from viewers who had obviously been led there “through a progression of recommendations”. When the researchers published their own results as a working paper, they stressed that they specifically decided against a peer reviewed publication. They reasoned that “the children in the videos, nor the families that had uploaded some of the videos could not have possibly waited one year for YouTube to change their algorithm”.

All of these are examples of what we have defined as contextually inappropriate content. Each single video might be entirely legal, harmless and unremarkable, be it one promoting a presidential candidate, a drastic cartoon parody or a clip of children in swimming gear. But the context of the respective viewing session or reached audience makes them inappropriate, either for the audience or for the uploaders. For example, the parents of the children concerned in example three. This points to two of the core problems associated with contextually inappropriate content promoted by automated recommender systems: What emerges as problematic is hard to predict, but, once uncovered, it often seems to warrant immediate intervention. What is hard to gauge is the role that the behavior of individual users plays in interaction with the recommender systems. Certain very active subgroups of users rallying around certain types of content, be they pro-Trump videos, drastic cartoon parodies or clips of lightly-dressed children, seem to contribute extreme outsized signals that influence the recommendations generated by the system if not explicitly toned down. These recommendations in turn make the content available to more users. Some of these users might keep watching for entirely different reasons, be it curiosity, shock, alarm or any number of other motivations.

The aim of this paper is to in principle uncover the mechanisms that lead to the recommendation of contextually inappropriate content in automated recommender systems and discuss possible measures to mediate this problem.

In the next section (Sect. 3), we first investigate if the YouTube recommender system works in a way that it “drives” users to more extreme content. Based on the available material (the most recent description is the work of Covington, Adams, and Sargin [4]) we conclude that this is not the case.

Is the problem described above thus the users’ fault? The contributions of the users are much harder to judge, starting from the fact that we do not have a good description of the mechanisms inside the user that enables us to deduce if the user unintentionally ignites the process of degeneration. Users are of course not a homogeneous group, but rather a very diverse and very large population. Even a single user can interact with the system in a very different fashion depending on the environment or situation.

Our main hypothesis is that the injection of contextually inappropriate content described above mainly stems from classification errors the recommender system makes which are in turn amplified by more explorative users.

3 The Role of the Recommender System

Whereas we try to take into account what we know of the inner workings of the YouTube recommender system, we have to be aware that this is an evolving system that will not freeze at the development stage of 2016 [4]. In any case, we are far from knowing the details of this system well enough to recreate it. This is because the paper leaves out many details that would be necessary to do so. On top of that it could only be replicated with the full dataset of any point in time which we do not possess. Even if we did, we would probably be unable to process it due to its size. What we can do, however, is to roughly describe how the system works and then try to generalize to a much simplified form that can at least be simulated in a more qualitative style (what happens when this factor changes).

The YouTube recommender system that is described in [4] consists of two main parts:

  • a collaborative filtering system that generates a set of suggestions on the order of several hundreds, and

  • a personalized ranking system that selects the best fitting content from these that is then offered to the user.

Both parts rely on deep learning, and in the following, we will focus on the first system only because it requires much less user data and is partly oriented at serving groups of users well. It takes into account the last video views and searches of a single user, but also the age, gender, and geographic location. From [4] we can obtain a good overview of the general mechanisms of the recommender system, but a number of important concrete details remain in the dark. This, however, is not surprising as the exact function of the system is a) valuable intellectual property of YouTube, and b) would require a lot of concrete data that would not fit into a conference paper.

There are a number of attempts to understand the system beyond the sketch in [4], e.g.Footnote 1, most likely with the aim to be able to explain how it behaves in a lot of reported cases where its behavior was unexpected. We will not dive very deep but attempt to obtain a rough picture on the mechanics of the collaborative filtering (suggestion generation) system.

Roughly, the recommendation system learns what to recommend to whom by merging several layers of embeddings. An embedding is a way to map rather sparse, very high dimensional data into a much lower, constant dimension space. This is necessary because technically, both dealing with variable dimensions and sparse data are very difficult. Also, the underlying architecture of artificial neural networks (ANN) requires a constant number of input and output dimensions.Footnote 2

We can thus say that the embedding works as a means of compression. The input data is not fully known, and Google states that a fixed but large vocabulary is used to describe the videos. We presume that this works like a tagging process where for each video a variable, possibly high number of tags is generated such that each video is described with a bag-of-words. A second source of information depends on the users and takes the users’ last searches, views, and also group related information as the geographic location, gender, age, etc. into account (see [4]).

For both of these data sets (and possibly more), single embeddings are learned, and these are combined by averaging later on. During learning, this effectively means that videos that are often viewed subsequently shall be placed “closer” to each other during learning time. At serving time, a much faster process is needed and here a compromise between accuracy and performance has to be achieved. Whereas in the learning phase, a high dimensional but fixed space (according to [4] we presume that it most likely has 256 dimensions) is filled with video entries, at lookup time an approximate nearest neighbor scheme is used to find the videos that are closest to any given video. This is then the set of hundreds of candidate videos. In the second step, the recommender system—by taking even more information (e.g. user language, time since last watch, previous interactions) about individual users into account—brings down this number to something around some dozens of videos which are then presented to the user.

One obvious way to trick the system into giving more weight to the actions of some users then others is actually disabled by design: Regardless of the activity, only a fixed amount of user interactions are considered for the training process (50 last videos, 50 last searches). This means that particularly active users might still have comparatively more influence on the recommendations, but this influences is capped.

What we do not know, however, is how often the system is retrained or updated. Every conceivable influence of the user behavior on the recommendation system can only be realized when the user data is fed into the training, and in theory it is easy to imagine some sort of feedback loop: Unwanted recommendations are erroneously played out to the users and if these accept the recommendations, this amplifies the video to video relation of two videos that appear to be “similar” but might not be in reality. Generally, more frequent iterations speed up the process of single videos getting more popular or specific sequences of videos being stored in the system. Thus we find conflicting objectives here (very often, it is a highly desired effect that some videos get more popular quickly): updating the system very often leads to wanted (viral “hit” videos) and unwanted videos or combinations spreading quicker (timeliness vs error propagation). A slower update process (less trainings) would of course slow down how fast wanted and unwanted content combinations spread.

4 The Role of the User

Kahn [8] presents an overview of YouTube users’ motivations, based on McQuail’s [14] Uses and Gratifications framework for media choice. Kahn distinguishes between five different motives: Seeking Information, Giving Information, Self-Status Seeking, Social Interaction and Relaxing Entertainment. Most of these factors, however, do not really bear on the question we are dealing with here: People who watch videos presented to them by YouTube’s recommender system are most likely looking for, to use Kahn’s terminology, “relaxing entertainment” or possibly “seeking information”. To quote: “Seeking information and relaxing entertainment motives are factors that were highly significant in explaining a user’s behavior of viewing videos.” The possible user actions considered by Kahn and others in the field, like commenting, “liking” or “disliking” videos, uploading videos or sharing videos can be largely discounted for the purpose of this paper. One interesting result from Kahn’s study, however, points to the kind of user situation we are most likely dealing with here: According to Kahn, users who seek entertainment on YouTube are likely to “like”, “dislike” (thumbs down button) and share videos, but not more likely comment on or upload videos. In other words: Actions that can be performed, with one click, while watching videos at the same time are likely for people who use YouTube for entertainment purposes, while actions that would require interrupting the watching session are not. There is a number of data points that point to the fact that recommendations generated by YouTubes recommender system account for a high percentage of the views. A YouTube representative claimed in early 2018 that 70% of watch time is due to algorithmic recommendations [18]. Also, according to a Pew study, 8 in 10 Americans watch videos recommended by YouTube at least occasionally [17].

What is the user actually doing (in an process flow sense) when consuming videos on YouTube? This is going to be important in order to be able to simulate user behavior as well. We assume that a YouTube session usually starts with some kind of search, such that the first video that is seen is not depending on the recommender system but is a user choice (from a more general viewpoint the user starts with a random video). When the first video finishes, the user is provided with a list of ranked recommendations. If autoplay is activated and the user does not interact, the next video is simply the highest ranked video of these recommendations. If autoplay is disabled or the user chooses to interact, he/she may accept one of the videos from the presented recommendation list. These may be contextually appropriate, but some may actually be contextually inappropriate. If autoplay is disabled or disregarded, the user has to actively choose one of these in order to keep watching. We presume that if this is not the case, either the YouTube session stops or another “random” video is chosen (from the system viewpoint it may appear random; from the user viewpoint the video is probably chosen on the basis of criteria we do not know, possibly based on another manual search).

5 Simulation

Our simulation system attempts to replicate the most important mechanisms as given in the description of the original YouTube recommendation system [4]. However, we are well aware that:

  • the system has probably already changed at least in many details since the publication of that paper, and

  • due to the complexity of the deep learning system employed and the lack of exact data, a very close reproduction is not possible.

The basis of the simulation system is a coordinate system we model after the presumed latent space of the original system. By latent space we mean the hyperspace (presumably 256 dimensions) used by the recommender system for embedding all YouTube videos before later on using the positions of the videos in that space for looking up (via approximate nearest neighbor search) similar videos.

We thus use a 256 dimensional unit hypercubeFootnote 3 (coordinate values between 0 and 1) in which each video is placed onto a “real” position, according to its properties. By real position we mean that if the tagging process for each video and its mapping to the 256 dimensional space were error-free (also taking into account the positions of similar videos), this is where it should be positioned. Next to this real position, we determine an “apparent” position. This is determined by taking the real position and adding some noise. This noise reflects the errors that have been made in tagging and placing. Whereas it is difficult to know what the error distributions are like in reality, we simply assume that the tagging process has about 10 % error. The error is modeled by changing every coordinate of the apparent position with a 10 % change to a random position. Note that the exact size of this error is not really important, it will just accelerate or decelerate the spread of classification errors (and thus possible contextually inappropriate content).

This placing of the videos in a high dimensional space simulates learning a video embedding in such a space, i.e. the training of the neural network without taking user information into account. How can we now integrate recent searches/views by the users? In the original system, the user information seems to be employed for learning another embedding layer. The layers are then connected via averaging. That means (and this is a functional building block of the system) that user histories and the sequential video pairs therein can drive videos closer together. We model this by means of a simpler mechanism:

We collect all pairs of videos that are viewed after each other by any user and apply a weak force between them in the apparent space. The force is toned down so that the two sources of information (original position, all viewed pairs including this video) are approximately on the same scale. In order to compute concrete force values we have to decide what an average distance between two videos is. That is in our case the expected distance between two random points in a unit hypercube \([0,1]^{256}\). Computing this results in a complicated multiple integral that does not seem to be analytically known for such high dimensionsFootnote 4. However, there is an approximation for lower and upper bounds on this value in dependence of n (number of dimensions) [1], see (1) that is fairly easy to compute.

$$\begin{aligned} \frac{1}{3}n^{1/2} \le \varDelta (n) \le (\frac{1}{6}n)^{1/2} \sqrt{\frac{1}{3}[1 + 2(1 - \frac{3}{5n})^{1/2}] } \end{aligned}$$
(1)

It is known that especially for lower values of n, the true value of \(\varDelta (n)\) is much closer to the upper bound than to the lower bound. We thus simply use the upper bound (right side) as a stand-in and name it \(\hat{\varDelta }(n)\). We divide this estimated value by the overall count of a specific video in all user histories \(v_i\), multiplied by the number of users U. The idea behind this is that the maximum applied forces between two respective videos should at least not be larger than the approximated distance of two random videos. As we will see later, the resulting force is probably still too strong, so we may see this approach rather as reflecting a trend and not as a concrete value we can build on.

This way, we can now compute the force we apply between the positions of video i and video j as in (2). The 2 in the denominator stems from the fact that almost all videos in history lists (except the first and the last) do have a predecessor and a successor (so each video has two neighbors).

$$\begin{aligned} f_{i,j} = \frac{\hat{\varDelta }(n)}{2 \cdot v_i \cdot U} \end{aligned}$$
(2)
figure a

The combined amount of forces (regardless of the directions) on one video thus cannot be larger than the approximate distance of any two random videos, and even this applies only if all of the single constituents point into the same direction. The real forces are probably much smaller, as we can expect a lot of movements to partly or fully cancel each other out (dragging a video in lots of different directions may actually result in a very similar final position). It is clear that this model is not particularly accurate and we will not be able to make good quantitative predictions with it, but it is definitively closely related to the main mechanics of the original system and should thus be sufficient to ask what-if questions. Of course, we apply the forces only to the apparent space position of each video, not to its real position. Presuming that we can take the relation of two videos to be contextually matching or mismatching as stable even if some users see both videos in a row, we also keep the real position fixed (the real tagging of each video should not depend on user view choices but the other way around).

Returning to our original goal, we now need to define how we want to recognize that a video suggestion is inappropriate, given the context of the most recently watched video(s). This is a difficult question to answer, so we simply rely on a rough estimate: If the real positions of two videos are more distant from each other than the distance of two random videos (which resembles \(\varDelta (n)\)) we assume that they are not contextually related. If they are closer together, they may or may not be contextually related, but we assume that they are, at least to a certain extent. Note that we operate in a very high dimensional space (256). This means that certain notions of distance and neighborhood are much more fragile than in our usual 3D world.

The overall course of the simulation is described in pseudocode in Algorithm 1, the user behavior for choosing \(2 \cdot h\) videos (in order to fill the user history with fresh content) is provided in pseudocode in Algorithm 2. Note that we did not reflect the erroneous nature of the approximate nearest neighbor lookup that is used in the original recommendation algorithm as described in [4]. Here, we simply presume that nearest neighbor searches are accurate, assuming that the possible effect of errors here is rather small.

figure b

6 Experimental Analysis

In the simulation system described above, we have several parameters that can be set. However, space is limited and simulation time is considerable (around 70 min for 20 repetitions). We thus decided to choose a parameter set that is a compromise between runtimes, memory use on the one hand and hopefully realistic values on the other hand. We set the number of videos in our simulation to 10,000, and the number of users to 1,000. It is clear that these values are much lower than the real world case (YouTube), but otherwise the computation times would explode due to the necessary distance computations. Likewise, we also model the users as a homogeneous group, with the same probabilities to accept recommendations or switch on autoplay. This is of course a very strong simplification. The simulation system could accommodate different probabilities for every user, but apart from the difficulty to obtain such detailed realistic data we do not use this feature now, as our first experiments are meant to focus on the big picture only.

The experimental results thus only provide a rough, rather qualitative picture of the underlying effects. We also acknowledge that the applied forces that reflect the acceptance of recommendations may still be too strong and need to further be reduced to provide a realistic impression.

6.1 Function Test

At first, we perform a functional test, meaning that with autoplay turned off for all users \((p_{it} = 0)\) and acceptance rates for appropriate and inappropriate recommendations set to zero (\(p_{ia}=0\), \(p_{in}=0\)) the video corpus should stay relatively stable. In this case, the users completely ignore the recommendation system and only choose videos they search for, which is modeled as the users always choosing random videos. Figure 1 displays the rates for appropriate and inappropriate videos on the left side, and the real average video distances and apparent video distances on the right side over 10 epochs of training and recommendation. We can easily see that the rates for appropriate and inappropriate video suggestions stay the same over the whole simulation, and that the apparent distances slightly decrease. Standard deviations (we perform 20 runs and average all numbers) are near zero for all measured values.

This result means that the simulation system performs as expected in this case, which is no surprise as all forces that are applied have random directions.

Fig. 1.
figure 1

Left: rates for appropriate suggested and accepted, and inappropriate suggested and accepted videos for autoplay = 0, appropriate acceptance = 0, inappropriate acceptance = 0. Right: real and apparent average distances between videos. Both over 10 training and recommendation epochs.

6.2 Autoplay Test

Leaving all other parameters the same as before, we now test what happens if we enable autoplay with a considerable chance of 0.5. This means that in 50% of the watches, the next recommended video is accepted without consideration if it may be appropriate or inappropriate. Figure 2 shows the averages of the obtained results (here, the standard deviations are a bit higher but still on a very low level around 1–2%. We see that the number of appropriate suggestions slowly decreases over the epochs, and the number of inappropriate videos that are suggested and also accepted rises steadily. Average distances of all videos show comparable behavior as in the case where nothing is accepted at all.

Fig. 2.
figure 2

Left: rates for appropriate suggested and accepted, and inappropriate suggested and accepted videos for autoplay = 0.5, appropriate acceptance = 0, inappropriate acceptance = 0. Right: real and apparent average distances between videos. Both over 10 training and recommendation epochs.

6.3 Appropriate/Inappropriate Accept Test

What happens now if autoplay is switched off and either only 50% of the appropriate videos but no inappropriate videos or vice versa are accepted? Figs. 3 and 4 show the results of our simulations, again averaged from 20 runs and with quite low standard deviations around 1% to 2%. Whereas the rate of accepted inappropriate videos stays at 0 in the first case and rises slowly in the second case, the overall impression is comparable. The only remarkable difference is that the number of inappropriate suggestions also rises more quickly than in the first case. Surprisingly, this means that the interaction with the recommendation system itself already leads to a higher rate of suggested inappropriate videos, although in the first case these are never accepted. It seems to be not of importance if the accepted videos are appropriate or not, it is rather important if they have been recommended. It seems that the applied forces make the suggestion of inappropriate videos more likely.

Fig. 3.
figure 3

Left: rates for appropriate suggested and accepted, and inappropriate suggested and accepted videos for autoplay = 0.0, appropriate acceptance = 0.5, inappropriate acceptance = 0. Right: real and apparent average distances between videos. Both over 10 training and recommendation epochs.

Fig. 4.
figure 4

Left: rates for appropriate suggested and accepted, and inappropriate suggested and accepted videos for autoplay = 0.0, appropriate acceptance = 0.0, inappropriate acceptance = 0.5. Right: real and apparent average distances between videos. Both over 10 training and recommendation epochs.

6.4 Realistic Setting with Low Autoplay?

For a hopefully realistic situation, we set the autoplay probability to 0.1, the acceptance of appropriate videos to 0.5, and the acceptance probability for inappropriate videos to 0.1. Figure 5 provides the response of the simulation system. As expected, inappropriate video recommendations are now accepted more often, but still much less frequently than in the 50% autoplay case. Compared to Fig. 4, the overall rate of accepted inappropriate videos ranges at around the double size (7 to 8%) after 10 epochs. This is at first surprising, but taking the autoplay figure into account, it seems that the autoplay feature, even if switched on with a much lower probability, dominates the handling of inappropriate videos (more are accepted and, after a while, more are also being suggested).

Fig. 5.
figure 5

Left: rates for appropriate suggested and accepted, and inappropriate suggested and accepted videos for autoplay = 0.1, appropriate acceptance = 0.5, inappropriate acceptance = 0.1. Right: real and apparent average distances between videos. Both over 10 training and recommendation epochs.

7 Possible Mitigations

Judging from the results of our simulation, the simplest way to mitigate the bulk of the problem of contextually inappropriate video suggestions would be to eliminate the autoplay option from YouTube. The feature seems to produce an inherent and cumulative error that makes contextually inappropriate video suggestions more likely the longer a viewing session lasts. Maximising watch time across videos is, however, the stated optimization goal of the YouTube development team. If the number given by a YouTube executive in 2018 (see above) is correct, algorithmically generated suggestions are responsible for 70% of watch time on YouTube. It thus seems unlikely that the company would be willing to accept this measure. And since the recommender system itself has no way of “knowing” whether certain users receive contextually inappropriate videos and watch them anyway, more finely tuned mitigation efforts on the supply side don’t seem to easily suggest themselves. One method of approaching the problem would be to improve the quantity and quality of the feedback users can give. This should be complemented, and this is crucial, by effective measures on the side of the company to act on such feedback. Parents who find that their childrens’ videos might have become a desirable kind of content for pedophiles should not have to rely on scientists or major news publications to be heard. A potential regulatory measure would be to limit the total time or number of successive videos that the autoplay function is allowed to deliver. There could, for example, be a mandatory cap on viewing sessions, so that a new chain of recommendation and autoplay would have to be initiated with another “random” video, i.e. one generated by a user initiated search or some other way. The cumulative effects of inappropriate videos entering the recommendation stream would thus also be capped. A first, quick approach to the problem is to alert users, including uploaders and passive YouTube users, to the existence of the problem. Parents, for example, might not even realize which mechanisms lie behind their own childrens’ YouTube consumption. The might also be unaware of the danger of receiving contextually inappropriate content suggestions generated this way. Similarly, uploaders might not be aware that their videos might end up in front of an audience that they were not intended for. Alerting users to these facts might be a start.

8 Conclusion

We have been looking at the YouTube recommendation system and its effects from various angles. We departed from reports of surprised to annoyed users who have been confronted with strange recommendations as well as by uploaders shocked by view counts that pointed to an audience that they never intended to reach. We then looked at the technical side of how the algorithms in the recommendation system work. Finally, we tried to simulate the most important mechanisms in a “what-if” fashion. From the reports of users and researchers we know that the recommendation system sometimes does not work as expected but sometimes suggests videos assumed to be contextually inappropriate. In Sect. 2, we have attempted to characterize what that actually means, as it seems there is no useful definition available so far.

From the technical viewpoint, it seems that contextually inappropriate recommendations are a collateral damage rather than intended aim of the recommendation system. The aim of YouTube to maximize watch time indirectly enables pandering to all possible user motivations if they help increasing watch time, from joy to disgust. It also appears to be very difficult to get rid of such unwanted recommendations due to a) the huge amount of data that enforces compromises between performance and accuracy, and b) the necessity to incorporate user video watch information into the system that will impact the system’s performance (as it is the nature of collaborative filtering systems) in unforeseen ways.

We do not want to exaggerate the relevance and representativity our simulation results, as there are a lot of shortcomings we have to accept (too small, too simple, etc.). However, it is interesting to see that even if inappropriate videos are not accepted at all by the users and autoplay is switched off, the number of inappropriate suggestions rises over time. Most problematic in this respect seems to be a high autoplay rate, however, because this may lead to situations when users are not even aware that inappropriate videos are currently played, e.g., because they are away from the screen for a few minutes or YouTube is silently continuing to play videos in a background browser tab. Some YouTube users report that they have switched off the autoplay feature in order to obtain more control about what is played and when. The available data, however, point to a high impact of the recommendation system on what videos users actually consume on YouTube. This corresponds to available information on the importance of the recommender system for total watch time as reported by YouTube executives.

It seems clear that more research is needed in this direction. The number of scientific teams that can look into the journalistic, communication scientific, and computer science perspective at the same time seems to be too small to be able to sufficiently research the effects of hugely important socio-technical systems as social media.