1 Introduction

The past decade has witnessed the massive growth of the social web, the continued impact and expansion of the world wide web and the increasing importance and synergy of content modalities, such as text, images, videos, opinions, and other data. There are currently about 200 active social networksFootnote 1 that attract visitors in the range of the 100s of millions each month. Online visitors spend considerable amounts of time on social network platforms where they constantly contribute, consume, and implicitly evaluate content. The Facebook community alone, with over 1.2 billion members, shares the impressive amount of 30 billion pieces of content every month [15]. The knowledge contained in these massive data networks is unprecedented and, when harvested, can be made useful for many applications. Although research has started to automatically mine information from these rich sources, the problem of knowledge extraction from multimedia content remains difficult. The main challenges are the heterogeneity of the data, the scalability of the processing methods and the reliability of their predictions.

In order to address these challenges in the social web domain, recent researches exploit the use of semantics in multimodal information retrieval and specially in image retrieval [11]. However, the focus resided on image processing and, so far, the methods used for text similarity for the purpose of multimodal retrieval are fairly mainstream [22]. In this work, we focus on semantic-based keyword search while specifically considering the optimization of the processing time, thus making our approach manageable in an information system.

This paper has two contributions. As the first contribution, we explored the effect of semantic similarity and optimization methods in text-based image retrieval in social media by applying Word2Vec [16] and Random Indexing (RI) [21]. This represents one possible form for a semantic concept index. We particularly focus on the optimization of these algorithms to allow them to scale to real-world collection sizes for more effective semantic-based keyword search on the (social) web. With an execution time that is about 40 times slower than standard TF-IDF in Solr, especially with longer documents, it is clear that optimization is paramount for allowing semantic search to become applicable and useful. We applied and evaluated two optimization techniques to contribute to this essential and important goal.

The second contribution is an architecture and test-bed for integrating and evaluating algorithms and methods for semantic indexing and keyword search. It is designed as a combined faceted index for multimodal content collections, such as MediaEval Diverse Images [9, 10]. The framework is based on a flexible document model and incorporates concepts as an extension toward more generalized forms of information search that exceed the classic bag-of-words approach. The interlinked nature of these parts has the benefit of being flexible with respect to many kinds of multimodal and also multilingual documents. Each of these facets can be transformed into a semantic representation based on a dynamic set of algorithm. The index itself is implemented effectively by using flexible facet indices and a document index that can be combined and used based on the data at hand. The first contribution additionally serves as an application use-case for this architecture.

The following section describes the related work surrounding the domains of faceted, multi-modal and semantic indexing and search. In particular, we cover concept-based information retrieval. We describe our indexing architecture together with an application example of semantic index in Sect. 3. Focusing on questions of optimization, we explain two methods, followed by discussion and comparison in Sect. 4. We summarize our findings in Sect. 5, and subsequently elaborate on a range of future plans.

2 Related Work

While different modalities often occur together in the same document (scientific paper, website, blog, etc.), search through these modalities is usually done for each modality in isolation. It is well known that combining information from multiple modalities assists in retrieval tasks. For instance, the results of the ImageCLEF campaign’s photographic retrieval task have shown that combining image and text information results in better retrieval than text alone [17]. There are two fundamental approaches to fusing information from multiple modalities: early fusion and late fusion [7].

Late fusion is widely used, as it avoids working in a single fused feature space but, instead, fusing results by reordering them based on the scores from the individual systems. Clinchant et al. [3] propose and test a number of late fusion approaches involving the sum or product combination of weighted scores from text and image retrieval systems. Difficulties arise from

  • weights that must be fixed in advance or that need to be learned from difficult to obtain training data

  • modality weights that might be query dependent and

  • weights that are sensitive to the IR system performance for the various modalities [7]

Separate queries are needed for each modality, so that for example to find a picture of a cat in a database of annotated images, one would need to provide a picture of a cat and text about the cat. There are ways of getting around this limitation, such as choosing the images for the top returned text documents as seeds in an image search [7], but these are generally ad-hoc.

With early fusion, a query would not have to contain elements from all modalities in the dataset. To continue the previous example, pictures of a cat could be found only with text input. Early fusion suffers from the problem that text tends to sparsely inhabit a large feature space, while non-text features have denser distributions in a small feature space. It is however possible to represent images sparsely in higher-dimensional feature spaces through the use of bags of ‘visual words’ [4] that are obtained by clustering local image features. The simplest approach to early fusion is to simply concatenate the feature vectors from different modalities. However, concatenated feature vectors become less distinctive, due to the curse of dimensionality [7], making this approach rather ineffective. A solution proposed by Magalhaes and Rüeger [14] is to transform the feature vectors to reduce the dimension of the text feature vectors and increase the dimension of the image feature vectors using the minimum description length (MDL) principle.

Textual features has been used in many multimodal retrieval systems. For instance, recently, Eskevich et al. [8] considered a wide range of text retrieval methods in the context of multimodal search for medical data, while Sabetghadam et al. [20] used text features in a graph-based model to retrieve images from Wikipedia. However, these works do not particularly exploit text semantics.

In the text retrieval community, text semantics started with Latent Semantic Analysis/Indexing (LSA/LSI) [6], the pioneer approach that initiated a new trend in surface text analysis. LSA was also used for image retrieval [18], but the method’s practicality is limited by efficiency and scalability issues caused by the high-dimensional matrices it operates on. Explicit Semantic Analysis (ESA) is one of the early alternatives, aimed at reducing the computational load [12]. However, unlike LSA, ESA relies on a pre-existing set of concepts, which may not always be available. Random Indexing (RI) [21] is another alternative to LSA/LSI that creates context vectors based on the occurrence of word contexts. It has the benefit of being incremental and operating with significantly less resources while producing similar inductive results as LSA/LSI and not relying on any pre-existing knowledge. Word2Vec [16] further expands this approach while being highly incremental and scalable. When trained on large datasets, it is also possible to capture many linguistic subtleties (e.g., similar relation between Italy and Rome in comparison to France and Paris) that allow basic arithmetic operations within the model. This, in principle, allows exploiting the implicit knowledge within corpora. All of these methods represent the words in vector spaces.

In order to compare the text semantic approaches, Baroni et al. [2] systematically evaluates a set of models with parameter settings across a wide range of lexical semantics tasks. They observe an overall better performance of state-of-the-art context-based models (e.g., Word2Vec) than the classic methods (e.g., LSA).

Approaching the text semantics, Liu et al. [13] introduced the Histogram for Textual Concepts (HTC) method to map tags to a concept dictionary. However, the method is reminiscent of ESA described above, and it was never evaluated for the purpose of text-based image retrieval.

3 Concept-Based Multimedia Retrieval

In this section, first we explain the architecture of our system for semantic indexing and keyword search. Thereafter based on the architecture, an application use-case is applied on the MediaEval Diverse Social Images task [9, 10], using textual concepts. Our concept-based approach shows a significant improvement for keyword search on the test collection in the social media domain.

3.1 Framework for Extended Multimedia Retrieval

We introduce a framework for multimodal concept and facet-based information retrieval and, in the scope of this paper, focus on the indexing component, particularly the semantic indexing features. The interaction between the components of the indexing framework is depicted in Fig. 1. These components represent the conceptual building blocks of the indexing architecture as part of the general framework. The figure presents the document model, the concept model and the indexing model with its individual document facets, such as text-, tag-, and image-typed content. We additionally depicts the information flow between these parts in a simplified form.

Fig. 1.
figure 1

Interaction between document model, concepts and (semantic) concept index

The document model defines a document that functions as the basic unit for content that is composed of facets. A facet is either a text, a tag or an image. This allows many content structures to be created and organized, such as Wikipedia pages, scientific articles, websites, or blogs that often consist of such text, tag and image facets in various combinations. This structure also covers all unimodal variants, such as pure picture collections, since each document may contain any facet type in any order.

The concept model defines the structure of concepts. All concepts share a common identifier (usually a URI) that uniquely represents and differentiates them. A concept can describe either one of the three facet types, expressed as a type. That means, the concept can either be a text concept, a tag concept or a visual concept. Furthermore, a concept has a probability of being true, that allows a learning algorithm to store its confidence.

The indexing model is managed by the IndexManager, which controls the creation process of all indices, based on the configuration of the entire system. Facets are processed into respective indices that are all variations of a general FacetIndexer. TextFacets are indexed as a TextFacetIndex and TagFacets as a TagFacetIndex which are both based on LuceneFootnote 2 that stores it as separate, for their purpose optimized, inverted index file structures. ImageFacets are transformed into an ImageFacetIndex that is processed based on Lire, a Lucene derivative that is specialized on visual features. The indexing architecture therefore has three types of facet indexers, one per facet type, but maintains an arbitrary number of instances for each of them based on the structure of the content collection that is indexed. The DocumentIndex is a data structure that is implemented as a Database that connects all facets to make them accessible and usable for applications.

The concept model provides a definition of concepts for the framework. Concepts are processed into a ConceptIndex that is separate from the DocumentIndex and the FacetIndices. This concept model is used to translate facets into concepts. The ConceptIndex merges both text- and visual concepts into a common concept index space. In the next section, we demonstrate a first step into this direction by applying it solely on text concepts that are represented as an index of word vectors. Future work will expand on this by mapping concepts in an inverted index using Lucene covering both text, tag and visual concepts and representing it by a single index space.

In the following, we describe an application of semantic indexing in a social media domain. We specifically evaluate the effect of semantic-based retrieval on the textual features of multimodal documents.

3.2 Application of Concept-Based Retrieval

We explore the effect of semantic similarity and optimization methods in text-based image retrieval in social media by applying Word2Vec and Random Indexing. This represents one possible scenario for a semantic concept index as shown in Fig. 1 and also examines the effectiveness of concept-based retrieval in this domain.

The evaluation was conducted using Flickr data, in particular in the framework of the MediaEval Retrieving Diverse Social Images Task 2013/2014 [9, 10]. The task addresses result relevance and diversification in social image retrieval. We merged the datasets of 2013 (Div400) [10] and 2014 (Div150Cred) [9] and denoted it as MediaEval. It consists of about 110k photos of 600 world landmark locations (e.g., museums, monuments, churches, etc.). The provided data for each landmark location include a ranked list of photos together with their representative texts (title, description, and tags), Flickr’s metadata, a Wikipedia article of the location and user tagging credibility estimation (only for 2014 edition). The name of each landmark location (e.g., Eiffel Tower) is used as the query for retrieving its related documents. For semantic text similarity, we focus on the relevance of the representative text of the photos containing title, description, and tags. We removed HTML tags and decompounded the terms using a dictionary obtained from the whole corpus.

We used the English Wikipedia text corpus to train our word representation vectors based on Word2Vec and Random Indexing, each with 200 and 600 dimensions. We trained our Word2Vec word representation using Word2Vec toolkitFootnote 3 by applying CBOW approach of Mikolov et al. [16] with context windows of 5 words and subsampling at \(t=1e^{-5}\). The Random Indexing word representations were trained using the Semantic Vectors packageFootnote 4. We used the default parameter settings of the package which considers the whole document as the context window. In both Word2Vec and Random Indexing we considered the words with frequency less than five as noise and filtered them out.

To measure the semantic-based text-to-text similarity, we applied an approach, denoted SimGreedy [19]. The approach calculates the relatedness of document A to document B based on SimGreedy(AB) defined as follows:

$$\begin{aligned} SimGreedy(A,B) = \frac{\sum _{t \in A} idf(t)* maxSim(t,B)}{\sum _{t \in A} idf(t)} \end{aligned}$$
(1)

where t represents a term of document A and idf(t) is the Inverse Document Frequency of the term t. The function maxSim calculates separately the cosine of the term t to each word in document B and returns the highest value. In this method, each word in the source document is aligned to the word in the target document to which it has the highest semantic similarity. Then, the results are aggregated based on the weight of each word to achieve the document-to-document similarity. SimGreedy is defined as the average of SimGreedy(AB) and SimGreedy(BA). Considering n and m as the number of words in documents A and B respectively, the complexity of SimGreedy is of order \(n*m\).

We used the evaluation metric as the precision at a cutoff of 20 documents (P@20) which was also used in the official runs. A standard Solr index was used as the baseline. Statistical significant difference at \(p=0.05\) or lower against the baseline (denoted by †) was calculated using Fisher’s two-sided paired randomization test. The two-sided paired randomization test examines the significance of the difference between two sets of data by calculating the difference of each pair of the datasets and then passing them to a more common significance test such as ‘one-sample t-test’.

The results of evaluating the SimGreedy algorithm with different word representations are shown in Table 1. We observed that using SimGreedy as a semantic-based similarity method outperforms the simple content-based approach. In addition, the word representation method or the number of dimensions does not have a significant effect on the result of the SimGreedy method.

Table 1. MediaEval Retrieving Diverse Social Images Task 2013/2014 [9, 10]. Models trained on Wikipedia using Random Indexing (RI) and Word2Vec (W2V). The sign †denotes statistical significant difference

In order to compare the results with the participating systems in the task, we repeated the experiment on test dataset 2014. As it is shown in Table 2 using SimGreedy and Word2Vec, we achieved the state-of-the-art result of 0.842 for P@20 between 41 runs including even the ones which used image features but not external resources.

Table 2. MediaEval Retrieving Diverse Social Images Task 2014 Results using query expansion. Models are trained on Wikipedia corpus with 200 and 600 dimensions. Our semantic-based approach only uses the textual features. Best indicates the state-of-the-art performing system in the 2014 task for different runs

Considering the achieved results, in the next section we focused on optimizing the performance of the algorithms provided to match with the practical requirements of real-world application problems.

4 Optimizing Semantic Text Similarity

Although SimGreedy shows better performance in comparison to the content-based approach, based on the time complexity discussed before, it has a much longer execution time. We observed that SimGreedy is approximately 40 times slower than Solr so that SimGreedy generally has the query processing time of about 110 to 130 min while it takes about 3 min for Solr. The method can be especially inefficient when the documents become longer. Therefore in the following, we apply two optimization techniques for SimGreedy in order to achieve better execution time without degrading its effectiveness.

4.1 Two-Phase Process

In the first approach, we turn the procedure into a two-phase process [5]. In order to do this, we choose an alternative method with considerably less execution time in comparison to SimGreedy such as using Solr. Then, we apply the faster algorithm to obtain a first ranking of the results and afterwards, the top n percent of the results is re-ranked by applying SimGreedy. Therefore, the SimGreedy algorithm computes only on a portion of the data which is already filtered by the first (faster) one.

Considering that the alternative algorithm has the execution time of t and is k time faster than SimGreedy, applying this approach takes \(t+t\cdot k\cdot n/100\) where n is the percentage of the selected data. In fact, this approach is \(k/(1+k\cdot n/100)\) times faster than running the SimGreedy algorithm standalone. While achieving better execution time, the choice of the parameter n can reduce the effectiveness of the SimGreedy method. Finding the optimal n such that performance remains in the range of significantly indifferent to the non-optimized SimGreedy is a special problem of this method.

Table 3. Execution time in minutes of the standard, Two-Phase, and Approximate Nearest Neighbor (ANN) approaches of SimGreedy. Models are trained on the Wikipedia corpus with 200 dimensions. There is no statistically significant difference between the achieved results of the evaluation metric (P@20).
Fig. 2.
figure 2

Average performance of the Two-Phase approach with best value at around 49 %

In order to apply this technique on the MediaEval collection, we selected Solr as the first phase. SimGreedy as the second phase uses vector representations trained on Wikipedia by Word2Vec and Random Indexing methods, both with 200 dimensions. For all the integer values of n from 1 to 100, we found an extremely similar behaviour between the two methods summarized in Fig. 2. In order to find the best value for n as the cutting point, we identified the highest precision value that is not significantly different (using Fisher’s two-sided paired randomization test with \(p=0.05\)) from the best one (i.e. when n is 100 percent). This corresponds to \(n=49\). Giving the second phase (SimGreedy) is about 40 times slower than the first (Solr), using this approach improves the execution time to almost two times (48 percent) while the performance remains the same.

4.2 Approximate Nearest Neighborhood

In this technique, we exploit the advantages of Approximate Nearest Neighbor (ANN) methods [1]. Similar to Nearest Neighbor search, ANN methods attempt to find the closest neighbors in a vector space. In contrast to the Nearest Neighbor method, ANN approaches approximate the closest neighbors using pre-trained data structures, while in a significantly better searching time. Considering these methods, we can adapt the maxSim function of SimGreedy to an approximate nearest neighbor search where it attempts to return the closest node to a term. Therefore in this approach, first we create an optimized nearest neighbor data structure (indexing process) for each document and then use it to find the most similar terms.

The overhead time of creating the semantic indices depends on different factors such as the vector dimension, the number of terms in a document, and the selected data structure. While this excessive time can influence the overall execution time, it can be especially effective when the indices are used frequently by many queries.

We apply this technique on MediaEval by first creating an ANN data structure—denoted as semantic index—for each document using the scikit-learn libraryFootnote 5. Due to the high dimension of the vectors (\(>30\)), we choose the Ball-Tree data structure with the leaf size of 30. The Ball-Tree data structure recursively divides the data into hyper-spheres. Such hyper-spheres are defined by a centroid C and a radius r so that points with a maximum leaf size are enclosed. With this data structure, a single distance calculation between a test point and the centroid is sufficient to determine a lower and upper bound on the distance to all points within the hyper-sphere. Afterwards, we use the semantic indices to calculate the SimGreedy algorithm. We run the experiment using vector representations with 200 dimensions using both Word2Vec and Random Indexing methods trained on Wikipedia.

Table 3 shows the results compared with the original SimGreedy as well as Two-Phase algorithm. The I/O time consists of reading the documents, fetching the corresponding vector representations of the words and writing the final results which is common between all the approaches. Although the ANN approach has the overhead of indexing time, its query time is significantly less than the original SimGreedy and also Two-Phase approach. We therefore see an improvement of approximately two times in the overall execution time in comparison to the original SimGreedy method. In spite of the time optimization, there is no significant difference between the evaluation results of the methods.

It should also be noted that since in MediaEval task, each topic has its own set of documents, the semantic index of each document is used only one time by its topic. Considering this fact, we expect a larger difference between the overall execution times when the indexed documents are used by all the topics as is the normal case in many information retrieval tasks.

5 Conclusions and Future Work

We explored the effect of textual semantic and optimization methods in the social media domain as an example of a semantic index. We ran experiments on the MediaEval Retrieving Diverse Social Images Task 2013/2014 using Word2Vec and Random Indexing vector representations. Beside achieving state-of-the-art results, we show that SimGreedy—a semantic-based similarity method—outperforms a term-frequency-based baseline using Solr. We then focused on two optimization techniques: Two-Phase and Approximate Nearest Neighbor (ANN) approaches. Both the methods reduced by half the processing time of the SimGreedy method while keeping precision within the boundary of statistically insignificant difference.

Although these techniques similarly optimize the processing time, they show different characteristics in practice. While the Two-Phase approach needs pre-knowledge on the performance of the other search methods for setting the parameters, the ANN method can easily be applied on new domains with no need for parameter tuning. In addition, in the ANN approach, despite the overhead time of creating semantic-based data structures, the query time is significantly faster which is a great benefit in real-time use-cases.

In future work, we exploit the semantics of different facets (e.g. text, image, etc.) by first indexing and then combining them in the scoring process of our multimodal information retrieval platform. The concept index is achieved differently for text and image: For image facets, it represents the probability of a visual concept that has been learned from an image (e.g. from a visual classifier). For text facets, it represents the probability of a term being conceptually similar to its context (e.g., document, window of the terms, and etc.). Despite the effectiveness of SimGreedy (as an approach for semantic similarity), for each term in the source document, it only finds the highest similar term in the destination and ignores the others with less similarity value. We therefore want to study new, alternative similarity measures that match terms with groups of related terms.