Keywords

1 Introduction and Related Work

Classifying a page as interesting or not for a user who is scrolling through a social network is not a challenge. The main issue is rather the overload of pages they have to look through before they find what they want. Hence, advancement of recommender systems that help users find communities of interest is an ongoing process characterized by a variety of approaches. The focus of these approaches is usually the user. As [13] puts it, user-modelling that generally deals with behavior and actions of a user in a computer system includes inferring interests from them (interest discoveryFootnote 1). From this perspective, one user can exhibit a variety of interests, and the task of modelling is to infer them. In this paradigm, the main marker of interests is linguistic data (user-generated content). Hence, interests are mined as tags [11, 18, 32, 36], keywords [4, 35, 37], named entities [3, 28, 33], user classified interests from profiles [17, 24], topics [2, 19, 20, 39] in microblogs [3, 29, 38], most commonly derived with the help of LDA and LSA algorithm [7, 34]Footnote 2. Other approaches, e.g. the social network analysis, employ such non-linguistic information as friends, followers [12], contacts [31], clicks [1, 4], likes [8] and reposts, retweets, social recommendations [9, 10, 16]. Some projects unite users into clusters that can be represented with a graph-model [23, 40]. In all approaches, the main target is to facilitate the search functions of social networks by a more effective recommendation.

As for the algorithms of interest classification, their choice depends on the model. Where machine classification is possible, according to [25], traditionally the following classifiers are used: Decision Trees, Nearest Neighbors, Naive Bayes, linear algorithms separating hyperplanes (variations of commonly known Support Vector Machines, or SVM). [6] use Nearest Neighbors and Naive Bayes to suggest NLP-based recommendation of “news of interest”. However, none of the works we know focus on community pages that attract users with similar interests. As we demostrate below, such pages provide valuable information on existing user clusters and user interests.

In the present research, we would like to shift the focus from modelling a single user’s list of interests to modelling a social network community that a user might like, and we will do it based on a linguistic model. We assume (and discuss further) that one main interest is what attracts a user to a page if they start to follow itFootnote 3.

Our solution presumes we already know a page that a user likes, or we have a set of pages that a user’s friends like - we will call such pages model. A recommender system can find more pages that are similar to the model ones with the help of text similarity algorithmsFootnote 4. We can also view this task as a text classification problem usually solved with such algorithms as Decision Trees, k-Nearest Neighbors, Naive Bayes, etc. Additionally, classification presupposes that pages followed by users with a common interest belong to a certain class, especially from the sociological and linguistic point of view.

2 Interest Classification from the Sociological Perspective

Although interests are personal, in communities they have to be shared (sociologists call this phenomenon “contagion” [30]). In social networks, interest sharing produces linguistic content that makes online communities a valuable object of research.

Although there is no universal definition of social groups, many authors among whom are [5, 14, 21], etc. agree that a social group is a collection of individuals interacting in a certain way on the basis of shared expectations of each member of the group in relation to others. A social group can be viewed as an abstract whole that has certain features distinguishing it from others. For example, football fans as a social group are known around the world for their typical behavior: attending football matches, collecting sports memorablia, and quite often for violation of public conduct. Accordingly, adherence of an individual to the social group shows in speech. An individual who claims to belong to a social group calls himself or herself by a special name (a football fan of some team, a hoolie), mentions attributes of the group (a team’s name and players, leagues, places, sports memorabilia), performs activities typical of all members of the group and reports about it (attending matches, play-offs). When in social networks representatives of a social group interact, linguistic data serve as a means of identification and role assignment. Hence, network pages of social groups can be viewed as representatives of a class. And we can use such linguistic data as keywords, topics, named entities, terminology for automatic differentiation of these groups.

At the same time, what hinders classification is that groups can have points of intersection (for example, both football and hockey matches happen at stadiums, teams participate in leagues, etc.). Even names of teams and players can be the same. In such cases, fans often invent nicknames (using flag colors or mascots) to differentiate between them. Hence, linguistic content marks difference between unrelated social groups and simultaneously shows relation between allied groups.

Previously, we stated that there is one main interest that attracts users to a page. We will call it the Major Interest (MaI). The MaI is bound to the social group that joins for interaction on a social network page. If the people interacting do not belong to the same social group, they express different interests, and the MaI becomes unclear.

To study the phenomenon of MaI, we conducted a survey of the Russian social network Vkontakte (vk.com). We had to work with the Russian language as we were able to only find enough Russian-speaking experts. Vkontakte was created by Pavel Durov, who currently develops Telegram, in 2006. The network was chosen as one of the largest sources of linguistic content in Russian. In the experiment described in [22], we asked ten experts (certified and currently employed as linguists, sociologists, marketing specialists) to give their opinion on what social group manifests itself in a dialogue taken from a social network page. We instructed experts to define if authors in the sample dialogue belong to the same social group and, if yes, explain why they think so. The experts were not prompted by multiple choice answers. Three dialogues were marked correctly and unanimously as belonging to football fans, historical reenactors, and vegetarians. Two dialogues (fans of rock music and “bros”) got a 50% agreement. And the control sample where people did not express adherence to one social groupFootnote 5 got a 90% agreement that there is no social group and that these people do not share any interests.

After the experiment we conducted automatic classification of social network pages by the three MaIs (football, rock music, vegetarianism) across networks and languages. For each MaI in the three sets (English Twitter, Russian Twitter, Russian Vkontakte), we prepared 30 text samples downloaded from social network pages. We used several classifiers (SVM, Neural Networks, Naive Bayes, Logistic Regression, Decision Trees, and k-Nearest Neighbors) to predict the three MaIs in each set. Logistic Regression proved to be the best performing algorithm when operating on vector representations of 1,000 most frequent words (0 denoting presence and 1 - absence of a word in a text). Table 1 illustrates the result of classification; the score given is the average F1-score of five tests performed with Monte-Carlo cross-validation.

Table 1. Interclass classification of pages with supervised machine learning classifiers: F1-score. F - football, R - rock music, V - vegetarianism, T - Twitter, Vk - Vkontakte, En - English, Ru - Russian.

Generally, in this experiment we faced the efficiency of Bernoulli model of feature representation, i.e. word frequencies are not as important as their absence or presence. We also found out that human expertise is not a guarantee that a MaI will be difficult for classification. For example, Rock music and Vegetarianism were classified similarly well.

We tend to think that MaIs are more like umbrella terms to a variety of topics discussed by communities (for example, the MaI “football” encompasses matches, players, stadiums, events, ticket sales, memorablia). On the one hand, MaIs can be generalized into types of social groups: football fans are a type of sports fans, rock music fans are a type of music fans. Within the type, the variety of topics is quite similar (as in the case of hockey and football fans). On the other hand, MaIs can break into specific representatives, for example, rock music fans can be Metallica fans, Slipknot fans, etc.; football fans can be fans of Manchester United, Spartak, etc. The type determines the stable part of the user-generated content that relates some social groups, and representatives of a MaI are in charge of the entropy content that differentiates them from other representatives.Footnote 6

3 Retrieving Texts with a Certain MaI from a Large Collection

In the present research, we will describe an algorithm that is quite efficient when searching for pages with the same MaI in a collection much larger than the number of pages to be retrieved. We designed it on the grounds of interviews with the experts evaluating the texts in the experiment described above.

Every text \({T_i}\) in the test set is weighed on the basis of one or two model texts united into one \({T_m}\) in the training set to state its similarity to the model in every given class \({C_j}\) (each class corresponds to one MaI). The weights are evaluated by the Relevance Function. The result is a list of texts that are considered to represent the same MaI. The classes are three MaIs from the experiment: football, vegetarianism, and historical reenactment.

A Model Text \({T_m}\) is a text, chosen as a standard representative of a class. Ideally, it contains as many characteristic features of the class as possibleFootnote 7. The Relevance Function extracts these features for every class. Then, in every class, the Distribution function weighs all the texts in the test set and rates them choosing the top ranked as representatives of the class. Thus, every text can occur in more than one class.

3.1 Data Selection

We conducted our retrieval experiment on a corpus of texts downloaded from Vkontakte. For the present analysis, we automatically searched through 20,000 VKontakte open access pages using Vkontakte API. 4,460 pages turned out to contain user-generated content of size from 1 to 100,523 words. We asked a panel of three experts (certified linguists and sociologists) to manually search through them to find texts of football fans, historical reenactors, and vegetarians. In the final set of texts, the three MaIs were represented by a different number of items. Next, we asked experts to find more pages (using recommended links, user reposts and Vkontakte search) to create a set of 30 texts in each class. We also removed all texts belonging to the three MaIs and texts with the lowest number of words from the initial corpus. All in all, our corpus contains 4,000 unclassified items (“Miscellaneous”) and 30 texts belonging to each of the three MaIs (90 texts, in total). We consider the ratio between the class “Miscellaneous” and each of the other classes to be large-scale because the joint probability to retrieve a succession of 30 items of one class from 4,030 is very low: \(\frac{30}{4030}\times \frac{29}{4029}\ldots \times \frac{1}{4001}\,=\,6.82936273447e{-}78\).

Every text in the corpus of 4,090 was preprocessed to extract the following four parameter features:

  1. 1.

    Key-words. Key-words are selected from the normalized list of words of \({T_m}\) based on differences in their frequency. In a list of words, ranked by their frequency, a key-word is a word with a frequency that differs by more than one from the word with the next lower rank (e.g. 4, 7, 11 is a good list of frequencies with large enough steps; 1, 2, 3 is not). This method excludes all n legomena (hapax, dis, tris, etc.) to single out the most characteristic set of keywords. The normalized list of keywords has stop-words excluded. For short texts the result is a list of 1–2 words, and up to 20–30 for long texts.

  2. 2.

    Stems. Stems are selected from the vocabulary after stemming words with the Porter stemmer. Interestingly, when we preprocessed the vocabulary with a morphological analyser, it lowered down the performance. Therefore, no preprocessing except stemming was employed. In the resulting list of stemmed words, if each stem is found more than three times, it is added to the list of stems. This procedure is based on the expert opinion that social groups not only use some words frequently, but develop a whole vocabulary with derivatives of these words: vegetables - vegan, vegetarian, vegetarianism, lacto-vegetarian, ovo-vegetarian, etc.

  3. 3.

    Uniques. Lists of stemmed words, that were collected in the stemming procedure (without frequencies), are compared to each other in all pairs of classes, and stems that are found only within one class are added to the list of uniques. These words are a kind of terminological dictionary that describes a group’s uniqueness. In the interviews, the experts also stated that groups use unique words that are understandable only by the representatives of this group or have a special value within this group. But tests showed that these lists are formed not only from some inner vocabulary, but also from common-knowledge words describing group activities.

  4. 4.

    Named entities. Named entities are a natural part of a social group vocabulary, as the group shares its impression of people, places, etc. Also, names of a group’s leaders unite it. To extract named entities from social network posts and comments, we wrote a simple heuristic NER-parser. We take only named entities with frequency more than three.

3.2 Relevance Function

The Relevance Function creates a list of features for each class. The number of types of features can vary in optimization. For the further analysis frequencies are not needed. In the tested version, we cut down Model Texts so that they would produce about 1,000 features in sum. Empirically, this method showed to be the most effective.

The four lengths of feature arrays form a vector (\({v_1 , v_2 , v_3 , v_4}\)) in the 4-dimensional space, which serves as the basis for a right rectangular prism (a hyperrectangle, or a box). The volume of the box \({P_m}\) (Model Box) is a model volume and can neither be superseded or be equal to 0. To avoid it, Laplace smoothing \({\alpha = 1}\) is applied to every vector:

$$\begin{aligned} \varTheta _i = v_i + \alpha \end{aligned}$$
(1)

Once the classifier parameters are found, the system proceeds to the analysis of the test set. Every text \({T_i}\) in a test set is analyzed in the same way as the Model Text except uniques. Instead of them, a list of stems is used. Within each class, the algorithm searches for every element of the train text arrays among the elements of \({T_i}\) and adds smoothing:

$$\begin{aligned} f({x_k}, {T_i})=\{1(true), \text { if } x_k \in {T_i}, 0(false), \text { if } x_k \notin {T_i}\}+\alpha \end{aligned}$$
(2)

The result of evaluation is a set of vectors \({\varTheta _li}\) for each text. Now we compare volumes of “boxes” made with these vectors, the volume being considered as the main definitive factor in similarity analysis:

$$\begin{aligned} {V_{P_i}}=\prod _{l=1}^{4} \varTheta _{li} \end{aligned}$$
(3)

For each text in the test set, as many box volumes are calculated as there are classes. After that within each class, the texts are sorted in the decreasing order by these volumes. The bigger the volume is, the more likely it is that the text belongs to this class. Hence, the texts at the beginnig of the list are supposedly relevant. However, we would want to establish a borderline after which we are not likely to meet relevant texts anymore.

3.3 Distribution Function

The Distribution Function states which texts are relevant for the query based on their weight distribution. Note that attribution of a text to more than one class is possible.

Let us first consider weighting a list of texts based on two model texts from the class “football fans” with the help of the Relevance Function. Figure 1 demonstrates a list of 4,030 text weights (“box volumes”) sorted in the decreasing order.

Fig. 1.
figure 1

Box volumes of 4,030 texts evaluated for the MaI “football”.

It forms an exponent-like curve. The few texts in the left part of it have very high results (these are mainly texts of football communities) compared to the long “tail” on the right. The tail commences after a very steep passage between relevant and non-relevant texts. Hence, the point that separates relevant texts from irrelevant (the break point) should be somewhere at this steep part of the curve. To calculate it, we will analyze difference between weights by the slope of a characteristic line connecting each point \({(x_i; y_i)}\) and the X-axis at \({(x_i+1;0)}\).

To compare slopes of BC and DE, let us rearrange the diagram so that every segment starts at the point \({(x_0; y_0)}\) and goes to \({(x_n; y_j)}\). See Fig. 2, on the left.

The slope \({a \in [0; +\infty ]}\) is calculated at the point \({(x_1; y_1)}\), where \({y_i = a \cdot {x_1} + b}\). As the segment begins at 0, \({b = 0}\). We calculate x as an arithmetic mean of the text weights:

$$\begin{aligned} x_1= \frac{\sum {} {V_{P_i}}}{N} \end{aligned}$$
(4)

So:

$$\begin{aligned} y_i=a_i \cdot x_1 \Longrightarrow a_i= \frac{y_i N}{V_{P_i}} \end{aligned}$$
(5)

Empirically, we found out that the best results have \({a > 7.01}\). Table 2 demonstrates relevant results of the mentioned calculations for the class “football fans”.

3.4 Tests

To test the efficiency of our algorithm, we tried several existing implementations of supervised learning algorithms from the “Scikit-learn” package [26] with different optimization parameters: SVM, Neural Networks, Naive Bayes, Logistic Regression, Decision Trees, and k-Nearest Neighbors. The training set included two Model Texts in each of the three classes; the training set for the “Miscellaneous” class was formed with the four Model Texts, belonging to two other classes. For example, for the class of “football fans”, two Model Texts go to the training set as class representatives, and the four Model Texts of historical reenactors and vegetarians form the training set for the class “Miscellaneous”Footnote 8. The test set contained 30 texts of the studied class (e.g. football fans), 30 texts of the two other classes from the training set (e.g. historical reenactors and vegetarians) and 4,000 texts of the class “Miscellaneous” (i.e. not belonging to any of the three). The only algorithm providing a comparable result in such conditions was SVM (with the linear kernel, C = 5). Table 3 demonstrates it.

Table 2. Results of the distribution function in the class of football fans.
Fig. 2.
figure 2

The slope of the characteristic line.

Table 3. Retrieval of the three MaIs from a collection of 4,090 texts.

It is of interest that in all the three classes the F-score of our algorithm was very close in value. “Vegetarianism”appears to be the most well-balanced class by the three measures varying within the scale of 0.02. The results would be better if the value of the slope at the break point were optimized for every particular class. But that is the drawback of having just one Model Text without a large set of labeled data. How the break point moves in different classes and with sets of different size is yet an issue to be studied.

4 Conclusion

In the present article, we attempted to describe a new approach to classification of social network pages by interests of users. We suggested that retrieval of pages of interest should be based on one or two Model Texts rather than on a large collection. Even such a classifier as SVM that is typically used with large datasets gives a reasonably good (beyond the chance) classification result with only six texts in the training set and 4,090 texts in the test set. However, we suggested our own supervised learning algorithm that outperforms SVM in the same conditions. The algorithm can be applied in a recommender system for recommendation of pages of interest based on a page that a user already follows.

In a way, our algorithm can be viewed as a simplified and more intuitive and expertise-based version of SVM, designed for a particular task. It also separates vectors in a hyperspace but in a “fuzzy” way so that one text can be attributed to several classes. However, with the lack of a large set of labeled data for training we cannot be sure that the break point is always the same. In a real life situation, a user can be offered the whole rated list of pages starting with the top results until they stop scrolling for further pages.

As for the further research, we are planning to modify our algorithm for tasks like learning individual user interests and their specification, i.e. when a major interest can be specified into smaller ones which attract subgroups of users. For example, vegetarians call themselves “vegans”, “rawatarians”, “fruitarians”; football fans support one particular football team; historical reenactors deal with particular periods of time and certain cultures. Finally, we think that detecting a social group automatically when nothing is known about it yet (unsupervised learning of interests) is the most challenging task.