Elsevier

Expert Systems with Applications

Volume 40, Issue 14, 15 October 2013, Pages 5448-5455
Expert Systems with Applications

MTCIR: A multi-term tag cloud information retrieval system

https://doi.org/10.1016/j.eswa.2013.04.010Get rights and content

Highlights

  • We built the tag cloud over an attribute of the whole database.

  • The tag cloud uses multi-term tags which provides context.

  • The tag cloud is supported by an underlying mathematical definition.

  • The tag cloud generation process has been clearly defined.

  • The tag cloud allows to improve semantics maintaining multi-terms together.

Abstract

Processing and accessing database resources available on Internet is sometimes complex, specially when textual content is involved. A new user may need a general description of the contents available in the database in order to determine if the information is useful for his/her search needs. In this paper we present MTCIR, a system that summarizes the content of a database and provides the user with simple interfaces to access the information. The system uses a visual interface based on multi-term tag clouds, which presents the content of the database and can be used as assistance in the search process. The novelty of this approach is the underlying structure which provides the text with certain semantics and is able to retrieve the most relevant information. We test our proposal in four datasets and discuss the tag clouds obtained and the metrics computed for each of them.

Introduction

In this paper we focus on the problem of information access in databases, particularly information access to text data. Textual attributes in databases contain useful information that is not structured, and hence sometimes it is not properly processed. This lack of structure presents a drawback for users who can only perform syntactic queries ignoring the semantics associated with terms in the text. This leads to the retrieval of imprecise or erroneous information (Campaña, Martín-Bautista, Medina, & Vila, 2009).

Databases accessible through Internet contain a lot of information which is updated frequently. It is difficult for the users to know the current content of a particular database, or even know how to formulate an appropriate query. A typical scenario presents a user that can recognize the query he wants to perform, but he is not able to express it by himself. In this scenario it is important to provide the user with a set of query suggestions that can help in the search process. Currently tagging systems solve this problem using a categorization of the information resources using tags organized as tag clouds. This tag clouds can be used to retrieve the categorized information at a later time (Hsieh, Stu, Chen, & Chou, 2009).

In this context tag clouds can be seen as tools appropriate for the task of searching, exploring and representing the content of a database. Tag clouds capture the essential information through the representation of the most relevant tags (Rivadeneira, Gruen, Muller, & Millen, 2007). In addition, tag clouds assist users whose search terms are not clearly defined but they can recognize them from a set of possible queries represented by the tags (Hassan Montero, Herrero-Solana, & Guerrero-Bote, 2010).

Usually database users are experts who have a certain degree of knowledge of the query language and the database schema. With the databases open to the Internet, the typology of users is more diverse. In order to assist these new users, the information must be visualized in a friendly way. It is important to represent the content using a visual schema that allows the exploration and querying of the data. Tag clouds can be useful for this purpose, because they are tools that can be used easily by users with no experience with search systems (Leone, Geel, Müller, & Norrie, 2011).

However, for systems not based in tagging, the components of the tag cloud are not tags assigned by users, but terms extracted from the text in the database. These terms are extracted using a criteria to retrieve the most representative information from the text.

In order to overcome the drawbacks previously cited, we propose a method to create a multi-term tag cloud from text attributes in a database. This multi-term tag cloud is supported by an underlying mathematical structure with information extracted from the textual attributes. The formal definition of these structures, called AP-Seqs is presented in Torres-Parejo et al., 2012, Torres-Parejo et al., 2013. The AP-Seqs are ordered sets derived from AP-Sets (Marín et al., 2006, Martín-Bautista et al., 2008), which are sets of frequent co-occurring terms.

The multi-term tag cloud helps to solve the problems present in classic mono-term tag clouds. Mono-term tag clouds lack semantics and do not have an underlying mathematical model, nor a standard generation procedure. This type of tag clouds also fail to appropriately represent the content of information, as concepts represented by multiple terms are ignored, or treated incorrectly.

Here we present a system that extracts semantic information from non-structured texts in databases using mathematical intermediate forms that allow to detect compound terms. The output of the system is a multi-term tag cloud that represents the content of the text attributes processed. This tag cloud that can be used to perform searches.

The paper is organized as follows. Section 2 summarizes current work on tag cloud representation related to our approach. Section 3 describes the methodology followed for the tag cloud generation. Section 4 defines the architecture for the tag cloud generation system. Section 5 provides some examples and discusses the results obtained. Finally, Section 6 presents conclusions and future work.

Section snippets

Related work

In Kuo, Hentrich, Good, and Wilkinson (2007) an application using word clouds to summarize the information retrieved from a biomedical database was presented. This application provides answers with tag clouds extracted from the summaries retrieved by the queries. The tags on this tag cloud are mono-terms selected using frequency as criteria.

A similar idea was later presented in Koutrika, Zadeh, and Garcia-Molina (2009). In this work the results of a query are summarized using a tag cloud. The

Tag cloud generation methodology

In this section, we propose a general methodology for the generation of tag clouds from texts. The steps conforming the methodology can be implemented using different tools and external resources. The particular implementation of some details may vary depending on the tools selected, but the general structure serves as a reference.

The methodology proposal is composed of various sequential processing stages.

The proposed stages are:

  • Syntactic preprocessing: This stage deals with data cleaning from

General architecture of the multi-term tag cloud generation system

The application presented generates tag clouds from text attributes in a database. The tag cloud generated is used to query the content of the text attribute. Not all the information presented in the text attribute can be presented to the user. The tag cloud summarizes the most relevant and frequent terms in the texts. In order to select the appropriate terms to show in the tag cloud, we must process the data and present it in an appropriate way.

Following the tag cloud generation methodology

Using MTCIR – some cases

In this section we present some examples of tag clouds obtained using our approach. We also present some results that support the quality of the tag clouds generated.

The results presented in Torres-Parejo et al. (2013, in press) show that coverage, precision, recall and F-measure for tag clouds generated using this approach, are similar to the values computed for tag clouds created by human experts.

In this paper we show examples of different tag clouds generated using text on different topics.

Conclusions and future work

In this paper we have presented the MTCIR system for multi-term tag cloud generation from text attributes in databases. The tag clouds generated are used to visualize a summary of the textual content of the database, and can be used to retrieve information from it. The system is built following a general methodology to extract tag clouds from unstructured texts using text mining techniques and external knowledge resources. The methodology involves several steps, syntactic preprocessing,

Acknowledgment

This work has been partially supported by the “Consejerı́a de Economı́a, Innovación, y Ciencia de Andalucı́a” (Spain) under research projects P07-TIC-02786, P10-TIC-6109 and P11-TIC-7460.

References (33)

  • H. Cunningham

    GATE, a general architecture for text engineering

    Computers and the Humanities

    (2002)
  • Giménez, J., & Márquez, L. (2004). Svmtool: A general pos tagger generator based on support vector machines. In...
  • Y. Hassan Montero et al.

    Usabilidad de los tag-clouds: Estudio mediante eye-tracking

    Scire: Representación y organización del conocimiento

    (2010)
  • K. Knautz et al.

    Tag clusters as information retrieval interfaces

  • G. Koutrika et al.

    Data clouds: Summarizing keyword search results over structured data

  • B. Kuo et al.

    Tag clouds for summarizing web search results

  • Cited by (10)

    View all citing articles on Scopus
    View full text