MTCIR: A multi-term tag cloud information retrieval system
Introduction
In this paper we focus on the problem of information access in databases, particularly information access to text data. Textual attributes in databases contain useful information that is not structured, and hence sometimes it is not properly processed. This lack of structure presents a drawback for users who can only perform syntactic queries ignoring the semantics associated with terms in the text. This leads to the retrieval of imprecise or erroneous information (Campaña, Martín-Bautista, Medina, & Vila, 2009).
Databases accessible through Internet contain a lot of information which is updated frequently. It is difficult for the users to know the current content of a particular database, or even know how to formulate an appropriate query. A typical scenario presents a user that can recognize the query he wants to perform, but he is not able to express it by himself. In this scenario it is important to provide the user with a set of query suggestions that can help in the search process. Currently tagging systems solve this problem using a categorization of the information resources using tags organized as tag clouds. This tag clouds can be used to retrieve the categorized information at a later time (Hsieh, Stu, Chen, & Chou, 2009).
In this context tag clouds can be seen as tools appropriate for the task of searching, exploring and representing the content of a database. Tag clouds capture the essential information through the representation of the most relevant tags (Rivadeneira, Gruen, Muller, & Millen, 2007). In addition, tag clouds assist users whose search terms are not clearly defined but they can recognize them from a set of possible queries represented by the tags (Hassan Montero, Herrero-Solana, & Guerrero-Bote, 2010).
Usually database users are experts who have a certain degree of knowledge of the query language and the database schema. With the databases open to the Internet, the typology of users is more diverse. In order to assist these new users, the information must be visualized in a friendly way. It is important to represent the content using a visual schema that allows the exploration and querying of the data. Tag clouds can be useful for this purpose, because they are tools that can be used easily by users with no experience with search systems (Leone, Geel, Müller, & Norrie, 2011).
However, for systems not based in tagging, the components of the tag cloud are not tags assigned by users, but terms extracted from the text in the database. These terms are extracted using a criteria to retrieve the most representative information from the text.
In order to overcome the drawbacks previously cited, we propose a method to create a multi-term tag cloud from text attributes in a database. This multi-term tag cloud is supported by an underlying mathematical structure with information extracted from the textual attributes. The formal definition of these structures, called AP-Seqs is presented in Torres-Parejo et al., 2012, Torres-Parejo et al., 2013. The AP-Seqs are ordered sets derived from AP-Sets (Marín et al., 2006, Martín-Bautista et al., 2008), which are sets of frequent co-occurring terms.
The multi-term tag cloud helps to solve the problems present in classic mono-term tag clouds. Mono-term tag clouds lack semantics and do not have an underlying mathematical model, nor a standard generation procedure. This type of tag clouds also fail to appropriately represent the content of information, as concepts represented by multiple terms are ignored, or treated incorrectly.
Here we present a system that extracts semantic information from non-structured texts in databases using mathematical intermediate forms that allow to detect compound terms. The output of the system is a multi-term tag cloud that represents the content of the text attributes processed. This tag cloud that can be used to perform searches.
The paper is organized as follows. Section 2 summarizes current work on tag cloud representation related to our approach. Section 3 describes the methodology followed for the tag cloud generation. Section 4 defines the architecture for the tag cloud generation system. Section 5 provides some examples and discusses the results obtained. Finally, Section 6 presents conclusions and future work.
Section snippets
Related work
In Kuo, Hentrich, Good, and Wilkinson (2007) an application using word clouds to summarize the information retrieved from a biomedical database was presented. This application provides answers with tag clouds extracted from the summaries retrieved by the queries. The tags on this tag cloud are mono-terms selected using frequency as criteria.
A similar idea was later presented in Koutrika, Zadeh, and Garcia-Molina (2009). In this work the results of a query are summarized using a tag cloud. The
Tag cloud generation methodology
In this section, we propose a general methodology for the generation of tag clouds from texts. The steps conforming the methodology can be implemented using different tools and external resources. The particular implementation of some details may vary depending on the tools selected, but the general structure serves as a reference.
The methodology proposal is composed of various sequential processing stages.
The proposed stages are:
- •
Syntactic preprocessing: This stage deals with data cleaning from
General architecture of the multi-term tag cloud generation system
The application presented generates tag clouds from text attributes in a database. The tag cloud generated is used to query the content of the text attribute. Not all the information presented in the text attribute can be presented to the user. The tag cloud summarizes the most relevant and frequent terms in the texts. In order to select the appropriate terms to show in the tag cloud, we must process the data and present it in an appropriate way.
Following the tag cloud generation methodology
Using MTCIR – some cases
In this section we present some examples of tag clouds obtained using our approach. We also present some results that support the quality of the tag clouds generated.
The results presented in Torres-Parejo et al. (2013, in press) show that coverage, precision, recall and F-measure for tag clouds generated using this approach, are similar to the values computed for tag clouds created by human experts.
In this paper we show examples of different tag clouds generated using text on different topics.
Conclusions and future work
In this paper we have presented the MTCIR system for multi-term tag cloud generation from text attributes in databases. The tag clouds generated are used to visualize a summary of the textual content of the database, and can be used to retrieve information from it. The system is built following a general methodology to extract tag clouds from unstructured texts using text mining techniques and external knowledge resources. The methodology involves several steps, syntactic preprocessing,
Acknowledgment
This work has been partially supported by the “Consejerı́a de Economı́a, Innovación, y Ciencia de Andalucı́a” (Spain) under research projects P07-TIC-02786, P10-TIC-6109 and P11-TIC-7460.
References (33)
- et al.
An automatic system for identifying authorities in digital libraries
Expert Systems with Applications
(2013) - et al.
A collaborative desktop tagging system for group knowledge management based on concept space
Expert Systems with Applications
(2009) Knowledge distribution via shared context between blog-based knowledge management systems: A case study of collaborative tagging
Expert Systems with Applications
(2009)- et al.
Reorganizing clouds: A study on tag clustering and evaluation
Expert Systems with Applications
(2012) - (2007)
- Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th...
- et al.
An adapted lesk algorithm for word sense disambiguation using WordNet
- Brants, T. (2000). TnT: A statistical part-of-speech tagger. In Proceedings of the 6th conference on applied natural...
- et al.
Semantic enrichment of database textual attributes
Flexible Query Answering Systems
(2009) - et al.
Semantic processing of database textual attributes using wikipedia
Flexible Query Answering Systems
(2011)
GATE, a general architecture for text engineering
Computers and the Humanities
Usabilidad de los tag-clouds: Estudio mediante eye-tracking
Scire: Representación y organización del conocimiento
Tag clusters as information retrieval interfaces
Data clouds: Summarizing keyword search results over structured data
Tag clouds for summarizing web search results
Cited by (10)
An intelligent system for the acquisition and management of information from bill of quantities in building projects
2016, Expert Systems with ApplicationsCitation Excerpt :The ability to manage large volumes of data is increasingly turning into an essential issue in a society based on knowledge. But not only data management (in a strict sense) is needed; in the same way, the process which allows knowledge discovery, specially from textual data, in large datasets is turning more and more important for organizations (Torres-Parejo, Campaña, Delgado, & Vila, 2013). For this purpose, Information Systems and, particularly, solutions created within the Business Intelligence area, help managers to obtain a better understanding of their commercial procedure and operations in order to support better business decision-making for future projects (Fan & Li, 2013; Hajdasz, 2014; Irani & Kamal, 2014; Xiao & Fan, 2014).
The effect of visualisation on user experience in recommender systems
2021, Information ResearchA survey of tag clouds as tools for information retrieval and content representation
2021, Information VisualizationObtaining WAPO-structure through inverted indexes
2018, Communications in Computer and Information Science