Semantic clustering of XML documents

Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.

Online appendix to semantic clustering of XML documents on article 3.


Aris Gkoulalas-Divanis

With the advent of Extensible Markup Language (XML) and its wide adoption in applications, data extraction from semi-structured documents to facilitate data analysis has become an attractive research direction. The existence of structure in documents provides the means for designing sophisticated approaches for data management and knowledge discovery. These approaches take into consideration both content and structure semantics. In this paper, Tagarelli and Greco propose a framework, along with algorithms to cluster semantically related semi-structured documents, based on commonalities in their structure and content. First, they apply structure analysis to the XML documents to remove the ambiguity in the different tag names and allow the selection of the most appropriate sense for each tag name. Following this, they analyze the documents based on their content similarity, using techniques that consider both syntactic and semantic term relevance. An important characteristic of the proposed approach is the use of a novel representation scheme for mapping XML document trees into transactions consisting of items that carry both structure and content characteristics. The authors employ a transactional clustering algorithm that quantifies similarity by taking into consideration the semantics of the data. Subsequently, the identified clusters of transactions derive a classification of the XML documents for the end user. The authors demonstrate the effectiveness of the proposed approach through experiments on real-world data that test it against state-of-the-art algorithms for clustering XML documents. Overall, this is interesting work. The paper is well structured, motivated, and presented, and the experimental results look promising. For these reasons, researchers in the field will benefit from reading it. Online Computing Reviews Service

