Elsevier

Information Systems

Volume 34, Issue 6, September 2009, Pages 511-535
Information Systems

Exploitation of semantic relationships and hierarchical data structures to support a user in his annotation and browsing activities in folksonomies

https://doi.org/10.1016/j.is.2009.02.004Get rights and content

Abstract

In this paper we present a new approach to supporting users to annotate and browse resources referred by a folksonomy. Our approach is characterized by the following novelties: (i) it proposes a probabilistic technique to quickly and accurately determine the similarity and the generalization degrees of two tags; (ii) it proposes two hierarchical structures and two related algorithms to arrange groups of semantically related tags in a hierarchy; this allows users to visualize tags of their interests according to desired semantic granularities and, then, helps them to find those tags best expressing their information needs. In this paper we first illustrate the technical characteristics of our approach; then we describe various experiments allowing its performance to be tested; finally, we compare it with other related approaches already proposed in the literature.

Introduction

The term folksonomy is currently used to indicate a support data structure that allows human users to classify and categorize various kinds of resources (e.g., URLs, photos, videos, scientific papers, and so on) by means of plain keywords, also known as tags [6], [20], [31]. A folksonomy consists of a set of URIs (used to identify the resources referred by it), a set of tags (used to label these resources) and a set of users (who produce and label these resources) [26].

The number of folksonomies on the Web is rapidly increasing since 2004. In the meantime, also the number of resources referred by each folksonomy, as well as the number of users who exploit folksonomies, are rapidly growing [6]. Popular examples of folksonomies are Flickr [3] (which allows users to annotate their photos), del.icio.us [2] (which allows users to store and share their Web bookmarks), and Bibsonomy [1] (which allows users to share bibliographic data on scientific papers).

Actually, folksonomies are gaining a wider and wider popularity not only on the Web but also in large organizations and businesses (see, for instance [15], [33]).

Some of the main reasons underlying the pervasive diffusion of folksonomies are the following:

  • In traditional knowledge management systems, classification activity is performed by a human expert or a pool of human experts. As the size and the variety of available information increase, both the costs and the time required to carry out classification in this way increase too.

    Specifically, we can observe that the rate at which new resources are referred by a folksonomy is very high; for instance, authors of [25] report that, in June 2007, del.icio.us users posted about 120,000 URLs per day; the same study estimated that the number of URLs stored in del.icio.us in 2007 was about 115 millions. Clearly, in presence of this rate, the amount of time needed by human experts to view, analyse and catalogue these URLs would be huge. Moreover, the pool of human experts should be huge and, consequently, the classification costs would be prohibitive.

    In addition, resources referred by a folksonomy encompass a large number of disparate domains: for instance, we observed that, in del.icio.us, 524,109 URLs were labelled with the tag “Environment” whereas 617,927 were labelled with the tag “Database”; as a consequence, in this system, there are very important and frequent topics (which, therefore, cannot be neglected) that are, at the same time, very different and possibly related each other. A human expert (or a pool of human experts) would be required to own a very large vocabulary spanning many specialized domains. Moreover, since new topics rapidly emerge, the experts are required to swiftly acquire new knowledge.

    An analogous problem arises for large-scale businesses; in fact, also in this context the number and the variety of available resources are high; therefore, also in this scenario, performing and maintaining a classification of available resources is a very complex task which cannot be delegated to a human expert (or a pool of human experts).

    In both the two contexts described above it appears necessary a totally different form of coordination in which each user classifies his resources and users’ classifications are made available to other users on the basis of their needs/requirements.

  • Many authors [6], [25] conjecture that the usage of folksonomies (in particular, those regarding Web bookmarks) can enhance the performance of traditional Web search engines. As an example, the authors of [6] observe that the set of tags used to label a Web page can be more effective than a classical information retrieval technique (like TF/IDF) to summarize its content. As a consequence, tags can play the role of metadata and can be effectively used to compute the degree of match between a query and a Web page.

  • Folksonomies are useful to highlight hidden social ties among users. For instance, the authors of [22] propose an approach to organize tags in a hierarchical structure and to associate a specific topic with each tag. After this, given two users u1 and u2 and a topic T, this approach considers the set I(u1,T) (resp., I(u2,T)), consisting of the set of resources that u1 (resp., u2) labelled with the tags associated with T, and computes the degree of overlap between I(u1,T) and I(u2,T); if this degree is higher than a specific threshold it is possible to conclude that there exists a social tie between u1 and u2. Therefore, the exploitation of folksonomies provides a notion of users’ social tie more effective than that generally proposed in commercial systems; in fact, these systems assume that there exists a social tie between two users only if this fact is explicitly claimed by them; as a consequence, with these systems, there would not exist a social tie between two users who do not know each other.

  • In large businesses folksonomies are useful to identify communities of users with shared (or complimentary) interests as well as experts of a certain topic. For instance, in the social bookmark system DOGEAR [33] a firm expert can insert a tag t and can identify all URLs tagged with t; after this, he can retrieve the list of all firm experts who exploited t and can contact them to receive help or additional material on a subject related to t. Moreover, he can select one (or more) of the firm experts and can browse his (their) lists of bookmarks to access new resources, to strengthen his skills or acquire new ones.

However, as clearly pointed out in [20], [41], despite these advantages, folksonomies suffer from some, quite crippling, disadvantages that can be summarized in: (i) ambiguity, (ii) usage of synonymous tags, and (iii) discrepancy on granularity of tags. In order to concretely illustrate these disadvantages we consider a real-life folksonomy dealing with Databases and Information Systems; it will be the reference folksonomy throughout the paper. It covers a wide spectrum of topics, like the design/implementation of a Database/Information System, the usage of Information Systems on the Web, the usage of object-oriented programming languages in Databases/Information Systems, and so on.

Ambiguity refers to the fact that some terms may have multiple meanings. The basic example of ambiguity are homonyms; we say that two terms are homonymous if they have the same name but different meanings. For instance, in our reference folksonomy, the tag “Generalization” could be exploited to label both slides about the inheritance among classes in an object-oriented programming language and slides about the generalization relationship in E/R diagrams. As a consequence, the answer to a query consisting of the term “Generalization” performed by a user interested in E/R diagrams would include also a reference to the slides about the inheritance among classes in an object-oriented language.

Usage of synonymous tags in a folksonomy means that different users could exploit different (yet synonymous) tags to label/query the same type of resources (or, even, the same resource). For instance, in our reference folksonomy, some resources about an E/R modelling tool could be labelled with the tag “Data Modelling” by some users, whereas other resources about the same tool could be labelled with the tag “Database Design” by other users. As a consequence, a user who ignores the tag “Database Design” and submits a query containing only the tag “Data Modelling” would not receive some relevant resources as answer to his query. Generally speaking, a user could receive a complete set of answers to his query only if his vocabulary is rich enough to allow him to encompass a tag with the whole set of its synonyms. This hypothesis is clearly unrealistic because users’ vocabulary is often quite limited.

Discrepancy on granularity of tags could arise because a resource could be reasonably described by various tags, ranging from terms having a broad meaning to terms characterized by a narrow meaning. Therefore, some users, according to their expertise level and cultural background, may prefer to exploit generic tags, whereas other users could be driven to exploit specific tags. For instance, consider a tutorial about the JOIN clause in SQL and assume that it is labelled only by the tag “Join”. An expert user could submit a query containing very specific tags like “Outer Join” or “Left Join” and, then, he would not receive the tutorial about the join operator, even though it may be relevant to his goals. By contrast, a novice user could submit a query containing a generic tag like “Select Clause” and, therefore, also he would not receive the tutorial.

In order to better face the three problems mentioned above, it would be extremely useful a tool capable of parsing the set of tags a user is inserting (for either cataloguing a resource in a folksonomy or submitting a query over it) in such a way as to interactively suggest new related tags. As a matter of fact, these new tags could help him to more properly label the resource he is registering or to more precisely specify the query he is submitting. More concretely, suggested tags would be able to:

  • Disambiguate the meaning of a tag. For instance, with regard to the previous example about ambiguity, if a user is inserting the tag “Generalization”, the tool could suggest tags like “Class Diagram” and “E/R Diagram”; then, the user can examine these tags and can enrich his query by selecting those ones best specifying his needs.

  • Extend the vocabulary of a user. For instance, with reference to the example about the usage of synonymous tags, a user who is inserting the tag “Data Modelling” could receive the tag “Database Design” to complete his query.

  • Enable users to opt for the subjectively “right” level of granularity. For instance, with reference to the example about discrepancy on granularity of tags, the system can suggest a set of tags having a broader or a narrower meaning than that characterizing the tags specified by a user. As an example, if a user is inserting the tag “SQL”, the system can suggest a set of more specific tags, like “Join”, and a set of more generic tags, like “RDBMS”. The user can select, among these tags, those ones best specifying the desired level of granularity.

Clearly, the number of suggested tags should be limited in such a way that the time and the effort required to the user to evaluate proposed tags are reasonable.

From the previous discussion it emerges that the more vague the knowledge of a user about a domain is the higher the benefit he would gain from such a tool will be. In fact, if a user has a vague knowledge about a domain, it might happen that: (i) he ignores possible multiple meanings of a tag; (ii) he has a limited vocabulary (and, then, he would be incapable of recognizing terms having a similar meaning), and (iii) he is not aware about the granularity of terms used to label a resource (and, then, he could use tags with a wrong level of granularity).

This paper provides a contribution in this setting; in fact, it proposes an approach to supporting social annotations and browsing activities in folksonomies.

Our approach operates as follows. Assume that a user wants to query a folksonomy or to label a resource he is registering in it. Initially he starts by specifying a set TSetInput of tags; this set could be incomplete and/or imprecise. Our approach first derives a set NeighTSetInput of tags semantically related to those of TSetInput and capable of making the query or the set of labels more complete and precise. After this, it visualizes this supplementary set of tags as a hierarchy and allows the user to interactively “drill down” or “roll up” it in such a way as to easily find the desired level of granularity and, then, the most adequate tags.

From the previous description we can observe that our approach can be exploited in two different scenarios. In the first one a user submits a query generally consisting of a small set of tags representing his needs. In this activity he must focus mainly on understanding his needs and on thinking of the most suitable tags to express them. In the second scenario a user labels his resources and he must try to perform this task in such a way that all possible interested users can quickly and easily retrieve them. As a consequence, he must focus mainly on finding the largest set of tags that can represent the semantics of the resources he is labelling in such a way that, even if the users who query the folksonomy adopt different tags to express the name concept, the system can successfully answer them.

Even if these two scenarios are very different from the user standpoint, they do not require different algorithms to manage them. In fact, in both scenarios, it is necessary to derive groups of tags related to those already inserted by the user for his query submission or resource annotation tasks and, then, to arrange the derived tags in suitable hierarchies. Due to this reasons, since the focus of this paper is on algorithms rather than on user interfaces, in the following we shall cope with the problems of resource annotation and query submission in a unitary fashion.

In order to construct NeighTSetInput (i.e., the set of tags semantically related to those initially specified by the user) our approach defines and uses a suitable semantic distance function which receives two tags and returns a value in the real interval [0,1]; the lower this value is the more related the two tags will be. This function relies on the conjecture that the co-occurrence of two tags is a useful sign (even if it is not an absolute one) of their semantic similarity. This conjecture is largely accepted in the literature; as a confirmation of this, in the past, various papers have supplied justifications to it; as an example, some very detailed experiments providing an empirical confirmation to it are reported in [7], [31].

In order to organize the tags of NeighTSetInput into a hierarchy, our approach defines two different functions and, starting from them, implements two algorithms.

The former function is called generalization degree; it receives a pair of tags ti and tj and returns a realnumber belonging to the interval [0,1]; this number indicates how much the meaning of ti is more general than the meaning of tj. For instance, consider two tags t1=“Join” and t2= “Outer Join” of our reference folksonomy; in this case, the generalization degree of ti vs. tj is close to 1 because ti has a more general meaning than tj.

The latter function is called semantic granularity; it receives a tag and returns a real number belonging to the interval [0,1]; the more general the meaning of a term is, the higher its semantic granularity will be. For instance, in our reference folksonomy, the semantic granularity of the tag “Database” is close to 1 because this term has a broad meaning in it; on the contrary, the tag “Left Join” has a low semantic granularity because this term has a narrow meaning in it.

Observe that the generalization degree is a relative measure; in fact, it can specify that the tag “Join” is more general than the tag “Outer Join”, but it cannot specify if “Join” has a broad or a narrow meaning in the reference folksonomy taken as a whole. On the contrary, the semantic granularity is a global measure because it is computed by taking the whole folksonomy into account; in fact, for instance, it can specify that the two tags “SQL” and “E/R model” have broad meanings in the reference folksonomy; however, it cannot specify if one of them is more general than the other or if no generalization relationship exists between them.

The first algorithm is called MST-based; it initially constructs a suitable weighted and directed graph (called generalization graph) whose nodes represent the tags of NeighTSetInput and whose arcs denote the generalization degree among tags. After this, it derives the hierarchy by computing the maximum spanning tree associated with the constructed generalization graph.

The second algorithm is called Concentric; it associates each tag of NeighTSetInput with a coefficient stating its semantic granularity. Tag semantic granularities can be used to build a suitable data structure; this can be graphically depicted as a set of concentric circles such that the innermost ones are associated with the most specific tags, whereas the outermost ones are associated with the most general tags.

Observe that the information contents implied by the two hierarchies constructed by the MST-based and the Concentric algorithms are different. In fact, the hierarchy constructed by the MST-based algorithm is a tree and, therefore, it emphasizes the generalization relationships among tags. By contrast, the hierarchy returned by the Concentric algorithm puts in each circle tags having very close semantic granularities. This implies that, in the hierarchy produced by the MST-based algorithm, a tag ti (e.g., “Join”) is linked to a tag tj (e.g., “Outer Join”) if they refer to the same topic and ti is more general than tj; as a consequence, a user can locate a tag ti in the hierarchy and can explore it to find tags more general or more specific than ti. In the hierarchy returned by Concentric, instead, a tag ti is located in an inner (resp., outer) circle if it has a high (resp., low) semantic granularity; as a consequence, a user can locate a tag ti and can explore the corresponding circle (resp., move to an inner one, move to an outer one) if he desires tags with the same (resp., a higher, a lower) semantic granularity. The previous observation points out that the two algorithms are orthogonal; as a consequence, in the prototype implementing our approach, a user can choose to exploit the former, the latter, or both of them.

The general features of our approach and its capability of suggesting tags make it capable of successfully facing the problems of ambiguity, usage of synonymous tags and discrepancy on granularity of tags introduced above. Specifically:

  • The first step of our approach analyses all tags of TSetInput and, without any user intervention, explores the whole space of available tags to construct the neighbourhood NeighTSetInput of TSetInput, i.e., the set of tags semantically close to those of TSetInput. NeighTSetInput is computed by simultaneously considering all tags of TSetInput, rather than each tag separately from the other ones. This simple form of “context” definition allows our approach to disambiguate the meaning of terms specified by the user and, then, to avoid possible homonyms. As a consequence, our approach is capable of solving the “ambiguity” problem mentioned above.

    We illustrate this capability through some examples regarding our reference folksonomy. As a first example, consider the case in which TSetInput= {“XML”, “Database”}. If each tag of TSetInput would be considered separately from the other ones, then tags like “CSS” (related to “XML” but not to “Database”) or “Relational Algebra” (related to “Database” but not to “XML”) would be included in NeighTSetInput; the user would receive some resources, like tutorials on CSS or lecture notes on Relational Algebra, that are likely to be not relevant to him. By contrast, if all tags are jointly examined, the user would receive resources concerning both “Database” and “XML”, e.g., tutorials on XQuery or lecture notes on XML databases.

    As a second example, assume that the user adds the tag “SQL”1 to the previous TSetInput; the joint examination of all these tags allows our system to refine the set of resources presented to the user in the previous example; specifically, it can suggest resources like a tutorial on the mapping of XML databases onto relational ones.

    As a third example, assume that the user adds the tag “XQuery” to the set TSetInput of the previous example; in this case, the joint examination of all these tags allows our system to filter out some of the resources proposed to the user in the previous example and to retrieve only resources very tailored to his needs (e.g., a tutorial on how some features of XQuery, like the FLWR expressions, are implemented in relational databases).

  • Given a set TSetInput of tags, the exploitation of the semantic distance function allows NeighTSetInput to contain tags synonymous with those of TSetInput. These tags are proposed by our system and validated by the user. In this way our system is capable of coping with the “usage of synonymous terms” problem introduced above.

  • During the second step, our approach organizes the tags of NeighTSetInput into a hierarchy; each level of this hierarchy stores concepts at a certain granularity level. As a consequence, when a user specifies his tags, he can choose the hierarchy level that he considers the most adequate to his knowledge background, needs and desires. In this way, our approach is capable of facing the “discrepancy on granularity” problem described above.

In addition to the three problems mentioned above, our approach aims to face two further problems, arising when folksonomies to handle are very large and rapidly variable over time (think, for instance, of Flickr or del.icio.us). These problems, and the corresponding solutions proposed by our approach, are as follows:

  • The number of resources referred by a folksonomy is often huge and rapidly varies over time. As will be clear in the following, this problem could negatively influence the computation of tag semantic distances and, ultimately, the construction of both NeighTSetInput and the associated hierarchy. As specified in the following sections, in our approach the computation of the semantic distance between two tags is performed by computing the Jaccard coefficient of the sets of resources labelled by them. In order to face the problem of the dimension and the rapid variation of the resources referred by a folksonomy, we have decided to apply a heuristics for the estimation of the Jaccard coefficient. In the literature, the authors of [13] proposed a Monte Carlo technique for carrying out this estimation. In our opinion this technique is perfectly tailored to our application context. As a consequence, we decided to apply this heuristics in our system without performing any modification in it.

  • The number of tags present in a folksonomy can be very large and can rapidly vary over time. As will be clear in the following, the construction of NeighTSetInput requires the computation of the semantic distances associated with all possible pairs of tags; clearly, this task is very time consuming. In order to face this problem, our approach associates a suitable data structure, called neighbourhood list with each tag. This choice is in compliance with analogous choices adopted in the past by approaches that had to solve the same problem, even if in different application contexts. The neighbourhood list of a tag stores the set of those tags semantically closest to it. The presence of this data structure allows our approach to examine, for each tag, only those tags belonging to its neighbourhood list, instead of all available tags; this allows a considerable reduction of its execution time.

The plan of this paper is as follows: in Section 2 we provide some preliminary concepts largely used throughout this paper. In Section 3 we describe the first phase of our approach, i.e., the construction of NeighTSetInput. Section 4 illustrates the second phase, i.e., the construction of hierarchies. Experiments carried out to evaluate the performance of our approach are reported in Section 5. A comparison between our approach and other related ones previously proposed in the literature can be found in Section 6. Finally, in Section 7, we draw our conclusions.

Section snippets

Basic definitions

In this section we illustrate some preliminary concepts that will be extensively exploited in the next sections of this paper. The first concept is that of folksonomy [26].

Definition 2.1

Let USet={u1,,up} be a set of users, let RSet={r1,,rm} be a set of resource URIs and let TSet={t1,,tn} be a set of tags. A folksonomy F is a tuple F=USet,RSet,TSet,ASet, where ASetUSet×RSet×TSet is a ternary relationship called tag assignment set.

In this definition we do not make any hypothesis about the nature of

Phase 1: neighbourhood computation

In this section we illustrate Phase 1 of our approach, i.e., that phase which receives a set TSetInput of tags specified by a user and returns a set NeighTSetInput of semantically related tags.

The problem of deriving tags semantically related to other ones has been studied in the past and proposed solutions are generally based on data clustering techniques (see, for instance, [7], [10], [31]). The computational cost associated with a graph clustering algorithm is O(n3), where n is the number of

Phase 2: hierarchy construction

In this section we illustrate Phase 2 of our approach; it receives the set NeighTSetInput of tags returned at the end of Phase 1 and organizes them in a hierarchical fashion.

In order to carry out all tasks of this phase we have defined two orthogonal algorithms, called Maximum Spanning Tree based (hereafter, MST-based) and Concentric, each characterized by some specific features.

Experiments

In order to evaluate our approach we built a prototype in Java and MySQL. We carried out all experiments on a Personal Computer equipped with a 3.4 GHz CPU and 1 GB of RAM. In Fig. 2, Fig. 3 we report some screenshots of our prototype. Specifically, Fig. 2 shows the original set of tags specified by the user (TSetInput) along with the “expanded” set of tags (NeighTSetInput) returned by our system when the AND-like strategy is applied. In Fig. 3 we report the tag hierarchy obtained after the

Comparison with approaches based on the exploitation of external data sources

Some authors have suggested to exploit external data sources in the detection of groups of semantically related tags and in the construction of a hierarchy. Exploited data sources range from simple thesauri (e.g., WordNet [34]) to complex ontologies.

Authors in [28] propose an approach to constructing a tree of tags in a folksonomy. This approach iteratively picks an input tag and creates a chain going from the unique root of the noun hierarchy of WordNet to the examined tag. At the end of this

Conclusions

In this paper we have presented a new approach to supporting users to perform social annotations and browsing activities in folksonomies. Our approach receives a set TSetInput of tags specified by a user. It first constructs a set NeighTSetInput of tags semantically related to those specified in TSetInput. After this, it organizes tags of NeightTSetInput in a hierarchy in such a way as to allow a user to visualize the tags of his interest according to the desired semantic granularity as well as

Acknowledgements

The authors thank Giuseppe Barillà for his contribution to the implementation of the proposed approach.

The authors thank also the anonymous Referees whose precious suggestions allowed them to greatly improve the quality of this paper.

References (44)

  • G. Begelman, P. Keller, F. Smadja. Automated tag clustering: improving search and exploration in the tag space, in:...
  • A. Broder

    On the resemblance and containment of documents

  • C.H. Brooks et al.

    Improved annotation of the blogosphere via autotagging and hierarchical clustering

  • C. Cattuto et al.

    Network properties of folksonomies

    Artificial Intelligence Communications

    (2007)
  • T.H. Cormen et al.

    Introduction to Algorithms

    (2001)
  • B. Cripe. Folksonomy, keywords, and tags: Social and democratic user interaction in enterprise content management,...
  • L. Ding et al.

    Swoogle: a search and metadata engine for the semantic web

  • P. Drineas et al.

    Clustering in large graphs and matrices

  • J. Edmonds

    Optimum branchings

    Journal of Research of the National Bureau of Standards

    (1967)
  • D. Eppstein et al.

    Fast approximation of centrality

  • S.A. Golder et al.

    Usage patterns of collaborative tagging systems

    Journal of Information Science

    (2006)
  • S. Gollapudi et al.

    Using bloom filters to speed up HITS-like ranking algorithms

  • Cited by (42)

    • Knowledge base enrichment by relation learning from social tagging data

      2020, Information Sciences
      Citation Excerpt :

      The work in [36] detected subsumption relations between tags using the inclusion of user sets, within a dataset crawled from the general domain social tagging system Delicious1. The study in [35] further defined a metric called inclusion degree and generalisation degree and automatically generates hierarchies using graph-pruning algorithms. Graph centrality is another well-known heuristic in the literature [6,26].

    • Folksonomy-based personalized search and ranking in social media services

      2012, Information Systems
      Citation Excerpt :

      Additionally, users make frequent use of ambiguous and synonymous terms when they annotate and search resources, resulting in tremendous ambiguity about their intentions. In fact, some tags involve partially connected semantics and share some meanings with other tags [7]. Accordingly, users may fail to find valuable resources if they retrieve solely those resources that already have tags contained in the query.

    • Learning relations from social tagging data

      2018, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text