Extracting information and knowledge from the continuously increasing amount of enterprise data available has become a complex and daunting task. The science of data mining that was developed to assist the user with this task is rapidly expanding and is comprised of many interdisciplinary research activities. Research and Trends in Data Mining Technologies and Applications focuses on the integration between the fields of data warehousing and data mining, with emphasis on applicability to real-world problems. It provides useful and timely references for researchers and designers interested in studying and applying new algorithms, mining methods and processes. This book consists of 12 chapters contributed by authors and editorial board members of the International Journal of Data Warehousing and Mining in collaboration with other researchers. The chapters are divided into four sections that exemplify the issues, challenges and new developments in this field.

This volume is the first in the series “Advanced Topics in Data Warehousing and Data Mining”, and David Taniar successfully creates an audience for this volume and a following for the subsequent volumes. To achieve that, some sections cover a wide variety of topics (Sect. I), while other sections are more focused and cover closely related themes (Sect. III). Some chapters are geared towards practitioners (Chaps. I, VII), while others require a stronger mathematical background (Chaps. II, X). Some chapters are introductory (Chap. IX) while others are advanced (Chap. VIII). However, because this is the first volume in a series, it could have been greatly enhanced by an introductory section focusing on the history of the field, abstracting major achievements and outlining current and future trends. Although the chapters are richly illustrated by numerous graphs, they should have been presented in color for greater clarity and impact.

The first section explores three disparate topics: Web log analysis, computing dense cubes, and constructing condensed models of datasets. The three chapters in this section offer novel approaches to solving theoretical and practical issues. Chapter I examines click fact and session fact schemas for knowledge extraction. A hybrid model combining the click fact schema and Hypertext Probabilistic Grammar is explained and utilized in experiments. Chapter II offers a comprehensive approach to the issue of efficient computing of the aggregation of data over arbitrary combinations of dimensions. A new dynamic data structure called Restricted Sparse Statistics Tree (RSST) with a creative cube evolution algorithm is proposed and favorably compared for scalability, the speed of execution and I/O efficiency with other popular cube computation algorithms such as BUC (Bottom-Up Computation), Dwarf and QCT (Quotient Cube Tree). Chapter III searches for similarities of substructures of condensed models of datasets from different sources. When similarities are found, the protocol of one model is applied to the other model. A two-step solution is suggested, namely the construction of a condensed model of the dataset and identification of the similarities between the condensed models. After the construction of the algorithms on sporadic datasets, the framework is applied to different datasets: basketball player statistics in the NBA, a schematically different dataset pertaining to breast cancer, and a set of time series microarray datasets.

The second section covers three related topics: pattern comparisons, frequent patterns, and vertical mining patterns. Chapter IV surveys the research performed in the last decade in the area of data mining pattern comparison, concentrating on three popular pattern types: (a) frequent itemsets and association rules, (b) clusters and clustering, and (c) decision trees. In addition to surveying individual approaches to comparing specific pattern types, the chapter also examines four general approaches (frameworks) for pattern comparison. The chapter concludes with an extensive references section. Chapter V proposes an alternative method to extracting frequent patterns by using a Self-Organizing Map (SOM)—a type of unsupervised learning neural network that uses competitive principles. The SOM approach is parameterized by the size of the output dimension versus the support threshold in the traditional approach. The chapter presents several case studies that analyze the relationships between this approach and the conventional association rule framework and validate the correctness of the proposed method. Chapter VI focuses on a significant issue of the association rule mining—memory management, and proposes a new algorithm that uses a vertical dataset layout to compress the datasets. The evaluation study compares this dif-bits based algorithm with well-known compression techniques such as Tid and Bit vector. The results show clear advantages of the proposed dif-bits algorithm regardless of dataset characteristics.

The third section focuses on topics of data mining in bioinformatics, specifically examining two protein related issues: prediction of protein function and protein interaction. Both are complex topics and are of great interest to scientists and researchers in computational and investigational bioinformatics. Chapter VII presents a tutorial on the hierarchical classification techniques and their application to the prediction of a protein function where hierarchical classification problems are often found. The tutorial is concluded with an in-depth review of hierarchical classification problems in protein functional classification and an extensive reference section. Chapter VIII presents a comprehensive evaluation of the topological structure of protein–protein interaction (PPI) networks across different species. The evaluation was accomplished by mining and analyzing graphs of publicly available datasets. The protein–protein interaction network is modeled as a simple graph and then a proprietary CommBuilder algorithm is used to analyze the topology of the PPI networks. The analysis demonstrated some inadequacies of the power law model, contradicting published results.

The fourth section, the greatest strength of this volume, presents four chapters describing advances in data mining techniques and their use for a wide variety of applications. Chapter IX introduces multiple criteria optimization-based data mining methods based on multiple criteria programming (MCP). After explaining the fundamental mathematical concepts and theoretical notions, three classification models (linear, quadratic and fuzzy-linear) are presented. For each of these models an actual case-study is scored using a specific model. The applications are for (a) credit card ranking assessments, (b) evaluation of HIV-associated dementia and (c) network intrusion detection. Chapter X summarizes a creative approach to classifying decisions based on linguistic rules decoded from support vector machines (SVM). SVM classifiers including the rule extraction procedure for a two-class dataset are explained. Extensive experimental data from several studies is charted using large datasets (Mushroom, Wine and Breast Cancer). This data illustrates how the algorithm for rule-extraction works and how it could be used to extract rules from datasets with discrete or continuous attributes. Chapter XI concentrates on the mathematics, physics and artificial intelligence fields and reports a novel, graph-based data mining approach based on new developments in these fields. To demonstrate the application of new developments a case study based on a variation of a classic graph-theoretic problem (Heaviest k-Subgraph Problem) is presented. The chapter argues that graph representation is the most expressive way to describe the complex properties of real-world problems, and therefore has the greatest potential for knowledge discovery. Chapter XII examines the challenges of Web Services and suggests an original data mining approach to solve them. The Web services environment and issues of cost planning and service discovery are described and then followed up with suggestions (based on predictive, association and clustering mining) on how to apply data mining in Web services. The chapter is complemented with an overview of existing work that crosses paths between data mining and Web services, and some direction on how to apply data mining to the semantic web and ontologies, which represent the next generation of web architecture.

To conclude, this volume written by a notable collection of international researchers and scientists sets forward a sound presentation of the state-of the-art in the field of Data Mining. The breadth of topics, details of coverage of fundamental concepts and references to other research makes it a definitive collection and an indispensable source of current knowledge and application trends in the field. It is well organized, clearly written and easy to read. The field is important, the developments are significant and the book fulfills an essential need for information. Researchers and practitioners in the field of data mining will be looking forward to reading subsequent volumes in this series.