Mining consumer product data via latent semantic indexing

doi:10.1016/S1088-467X(99)00029-3

Intelligent Data Analysis

Volume 3, Issue 5, November 1999, Pages 377-398

Intelligent Data Ana...

https://doi.org/10.1016/S1088-467X(99)00029-3 Get rights and content

Abstract

One important focus of data mining research is in the development of algorithms for extracting valuable information from large databases in order to facilitate business decisions. This study explores a new technique for data mining – latent semantic indexing (LSI). LSI is an efficient information retrieval method for textual documents. By determining the singular value decomposition (SVD) of a large sparse term-by-document matrix, LSI constructs an approximate vector space model which represents important associative relationships between terms and documents that are not evident in individual documents. This paper explores the applicability of the LSI model to numerical databases, namely consumer product data. By properly choosing attributes of data records as terms or documents, a term-by-document frequency matrix is built from which a distribution-based indexing scheme is employed to construct a correlated distribution matrix (CDM). An LSI-like vector space model is then used to detect useful or hidden patterns in the numerical data. The extracted information can then be validated using statistical hypotheses testing or resampling. LSI is an automatic yet intelligent indexing method. Its application to numerical data introduces a promising way to discover knowledge in important commercial application areas such as retail and consumer banking.

Introduction

Large amounts of data have been collected in daily operations of organizations due to inexpensive storage and high computing power, but many companies have been unable to extract useful information from the data and utilize the information to benefit their business. Data mining involves the application of algorithms for extracting valid and useful information from large databases in order to make critical business decisions. The fact that data is being accumulated at a faster rate than it can be analyzed creates a significant demand for efficient data mining systems [7], [11]. Techniques used for data mining include decision trees and rule induction [22], association rules [1], nonlinear regression and classification [5], [13], genetic algorithms [19], [23], and neural networks [6], [14]. This study explores a new technique, latent semantic indexing (LSI), for data mining.

LSI is an efficient information retrieval technique which has been commonly used for textual documents [4], [8]. Traditional lexical-matching methods try to match words of queries with words of documents, which may fail to retrieve related documents or may return unrelated documents to users. This kind of failure to retrieve relevant documents or the retrieval of irrelevant documents is called the word-matching problem. LSI addresses the word-matching problem through the use of statistically derived conceptual indices instead of individual words [4]. Using the singular value decomposition (SVD) [15] of a large sparse term-by-document matrix, LSI constructs a conceptual vector space in which each term or document is represented as a vector in the space. The positioning of term and document vectors within the vector space reveals the underlying semantic structure of association between terms and documents in the data.

This paper explores the applicability of the LSI vector-space model to numerical databases. By properly choosing attributes of data records as terms or documents, a term-by-document frequency matrix is built. A distribution-based indexing scheme is then employed to construct a correlated distribution matrix (CDM) which reflects relationships between attributes of data records. Hence, the LSI-like vector space model is generated so that the encoding of attributes in the space can be analyzed for the detection of useful or hidden patterns. The extracted information can then be validated using statistical hypotheses testing or resampling.

Applications for data mining extract information from the data to make important business decisions, predict business trends, and develop new products. A common application for data mining is to analyze customer purchases (see Fig. 1) to discover patterns among existing customer preferences and then use those patterns for forecasting sales and optimizing marketing strategies. In this work, the LSI model is presented as an automated yet scalable (i.e., practical for large data collections) approach to extract underlying patterns from consumer product data.

The remaining sections of this paper outline the development and application of LSI to numerical databases. Section 2 is a brief overview of the LSI vector space model. Section 3 illustrates how an LSI-like model for numerical databases can be designed and implemented for mining a consumer product database. Finally, a summary and discussion of future work are provided in Section 4.

Section snippets

Latent semantic indexing

The word-matching problem mentioned in Section 1 results from multiple words having the same meaning (synonymy) and many words have more than one meaning (polysemy). For example, a text collection contains documents on house ownership and web home pages with some documents using the word house only, some documents using the word home only, and some documents using both words. For a query on home ownership, traditional lexical-matching methods fail to retrieve documents using the word house

Data mining with LSI

Although many data mining systems are derived from machine learning and neural networks, information retrieval techniques based on conceptual searching algorithms are also evolving. The conceptual vector space model used by LSI attempts to position (or cluster) similar objects in the vector space so that objects related to a given query (but perhaps not containing the exact same terminology) can be retrieved. The success of LSI for textual documents inspires its application to numerical

Summary and future work

In addition to the efficient information retrieval from textual documents, LSI can also be applied efficiently to numerical databases for data mining. The LSI conceptual vector space model represents similar objects in such a way that they can be retrieved even though the objects may not share common attribute values. By projecting user queries into the vector space and matching nearby attributes or categories, underlying patterns can be extracted from large databases. Further, the extracted

Acknowledgements

Special thanks to Professor Mark M. Miller (Department of Journalism, University of Tennessee, Knoxville) and his former Ph.D. student, Dr. Connie Milbourne, for their help in acquiring the A.C. Nielsen scanner data for this research.

References (24)

R Agrawal et al.
Fast discovery of association rules
M.W Berry
Large scale singular value computations
International Journal of Supercomputer Applications
(1996)
M.W. Berry, Z. Drmač, E.R. Jessup, Matrices vector spaces information retrieval, SIAM Review 41 (2) (1999)...
M.W Berry et al.
Using linear algebra for intelligent information retrieval
SIAM Review
(1995)
L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth, Belmont, CA,...
B Cheng et al.
Neural networks – a review from a statistical perspective
Statistical Science
(1994)
K.M. Decker, S. Focardi, Technology overview: a report on data mining, Technical report CSCS TR-95-02, Centrum voor...
S Deerwester et al.
Indexing by latent semantic analysis
Journal of the American Society for Information Science
(1990)
S.T Dumais
Improving the retrieval of information from external sources
Behavior Research Methods Instruments & Computers
(1991)
B Efron et al.
A leisurely look at the bootstrap, the jacknife, and cross-validation
The American Statistician
(1983)

U.M Fayyad et al.

Advances in Knowledge Discovery and Data Mining

(1996)

W Frakes et al.

Information Retrieval: Data Structures and Algorithms

(1992)

Cited by (29)

Identification of interdisciplinary ideas
2016, Information Processing and Management
Citation Excerpt :
It is based on eigenvector techniques from algebra. Dependencies among terms are calculated to group semantically related terms (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999). These groups are named concepts and they represent semantic clusters.
Literature shows interdisciplinary research as an essential driver for innovation. Ideas that are used as a starting point for this research are of an interdisciplinary nature because they combine aspects from different disciplines. The identification of interdisciplinary ideas at an early stage enables the start of interdisciplinary research and thus, it enables advances to be made in the innovation process. We propose a new methodology that combines semantic clustering and classification to estimate the interdisciplinary nature of ideas from a set of given ideas. The set is created automatically by use of an existing idea mining approach. Ideas from this set are semantically clustered to obtain concepts that are latent in the data. The relationship between each concept and each discipline pair from a set of given disciplines is calculated. Based on the degree of relationship, concepts are used to represent the interdisciplinary field spanned by the two disciplines. The ideas standing behind these concepts are identified as interdisciplinary ideas. As a result, the proposed methodology enables an estimation of the interdisciplinary nature of given ideas. The results might be helpful for researchers as well as for decision makers in the field of innovation management.
Idea mining for web-based weak signal detection
2015, Futures
Citation Excerpt :
Thorleuchter and Van den Poel (2013a) use semantic clustering for weak signal analysis to consider differences in authors’ writing styles and contexts. Semantic approaches calculate dependencies among terms e.g. by using eigenvector techniques from algebra to group semantically related terms (clusters) (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999). Each group consists of terms that occur together in several documents but it also consists of terms that might occur together in these documents.
We investigate the impact of idea mining filtering on web-based weak signal detection to improve strategic decision making. Existing approaches for identifying weak signals in strategic decision making use environmental scanning procedures based on standard filtering algorithms. These algorithms discard patterns with low information content; however, they are not able to discard patterns with low relevance to a given strategic problem. Idea mining is proposed as an algorithm that identifies relevant textual patterns from documents or websites to solve a given (strategic) problem. Thus, it enables to estimate patterns’ relevance to the given strategic problem. The provided new methodology that combines weak signal analysis and idea mining is in contrast to existing methodologies. In a case study, a web-based scanning procedure is implemented to identify textual internet data in the field of self-sufficient energy supply. Idea mining is applied for filtering and weak signals are identified based on the proposed approach. The proposed approach is compared to a further – already evaluated – approach processed without using idea mining. The results show that idea mining filtering improves quality of weak signal analysis. This supports decision makers by providing early and suggestive signals of potentially emerging trends, even with only little expressive strength.
Semantic weak signal tracing
2014, Expert Systems with Applications
Citation Excerpt :
Semantic approaches (e.g. LSI) are in contrast to knowledge structure based approaches. They consider term dependencies and use eigenvector techniques from algebra (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999) to discover classes (semantic textual patterns) from a document collection. The semantic textual patterns contain terms that occur together in parts of the documents but also terms that might occur in the document parts.
The weak signal concept according to Ansoff has the aim to advance strategic early warning. It enables to predict the appearance of events in advance that are relevant for an organization. An example is to predict the appearance of a new and relevant technology for a research organization. Existing approaches detect weak signals based on an environmental scanning procedure that considers textual information from the internet. This is because about 80% of all data in the internet are textual information. The texts are processed by a specific clustering approach where clusters that represent weak signals are identified. In contrast to these related approaches, we propose a new methodology that investigates a sequence of clusters measured at successive points in time. This enables to trace the development of weak signals over time and thus, it enables to identify relevant weak signal developments for organization’s decision making in strategic early warning environment.
Semantic compared cross impact analysis
2014, Expert Systems with Applications
The aim of cross impact analysis (CIA) is to predict the impact of a first event on a second. For organization’s strategic planning, it is helpful to identify the impacts among organization’s internal events and to compare these impacts to the corresponding impacts of external events from organization’s competitors. For this, literature has introduced compared cross impact analysis (CCIA) that depicts advantages and disadvantages of the relationships between organization’s events to the relationships between competitors’ events. However, CCIA is restricted to the use of patent data as representative for competitors’ events and it applies a knowledge structure based text mining approach that does not allow considering semantic aspects from highly unstructured textual information. In contrast to related work, we propose an internet based environmental scanning procedure to identify textual patterns represent competitors’ events. To enable processing of this highly unstructured textual information, the proposed methodology uses latent semantic indexing (LSI) to calculate the compared cross impacts (CCI) for an organization. A latent semantic subspace is built that consists of semantic textual patterns. These patterns are selected that represent organization’s events. A web mining approach is used for crawling textual information from the internet based on keywords extracted from each selected pattern. This textual information is projected into the same latent semantic subspace. Based on the relationships between the semantic textual patterns in the subspace, CCI is calculated for different events of an organization. A case study shows that the proposed approach successfully calculates the CCI for technologies processed by a governmental organization. This enables decision makers to direct their investments more targeted.
Quantitative cross impact analysis with latent semantic indexing
2014, Expert Systems with Applications
Cross impact analysis (CIA) consists of a set of related methodologies that predict the occurrence probability of a specific event and that also predict the conditional probability of a first event given a second event. The conditional probability can be interpreted as the impact of the second event on the first. Most of the CIA methodologies are qualitative that means the occurrence and conditional probabilities are calculated based on estimations of human experts. In recent years, an increased number of quantitative methodologies can be seen that use a large number of data from databases and the internet. Nearly 80% of all data available in the internet are textual information and thus, knowledge structure based approaches on textual information for calculating the conditional probabilities are proposed in literature. In contrast to related methodologies, this work proposes a new quantitative CIA methodology to predict the conditional probability based on the semantic structure of given textual information. Latent semantic indexing is used to identify the hidden semantic patterns standing behind an event and to calculate the impact of the patterns on other semantic textual patterns representing a different event. This enables to calculate the conditional probabilities semantically. A case study shows that this semantic approach can be used to predict the conditional probability of a technology on a different technology.
Protecting research and technology from espionage
2013, Expert Systems with Applications
In recent years, governmental and industrial espionage becomes an increased problem for governments and corporations. Especially information about current technology development and research activities are interesting targets for espionage. Thus, we introduce a new and automated methodology that investigates the information leakage risk of projects in research and technology (R&T) processed by an organization concerning governmental or industrial espionage. Latent semantic indexing is applied together with machine based learning and prediction modeling. This identifies semantic textual patterns representing technologies and their corresponding application fields that are of high relevance for the organization’s strategy. These patterns are used to estimate organization’s costs of an information leakage for each project. Further, a web mining approach is processed to identify worldwide knowledge distribution within the relevant technologies and corresponding application fields. This information is used to estimate the probability that an information leakage occur. A risk assessment methodology calculates the information leakage risk for each project. In a case study, the information leakage risk of defense based R&T projects is investigated. This is because defense based R&T is of particularly interest by espionage agents. Overall, it can be shown that the proposed methodology is successful in calculation the espionage information leakage risk of projects. This supports an organization by processing espionage risk management.

View all citing articles on Scopus

^☆: This research was sponsored y the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by Lockheed Martin Energy Research Corp. for the US Department of Energy under Contract No. DE-AC05-96OR22464.

View full text

Mining consumer product data via latent semantic indexing☆

Abstract

Introduction

Section snippets

Latent semantic indexing

Data mining with LSI

Summary and future work

Acknowledgements

Fast discovery of association rules

Large scale singular value computations

International Journal of Supercomputer Applications

Using linear algebra for intelligent information retrieval

SIAM Review

Neural networks – a review from a statistical perspective

Statistical Science

Indexing by latent semantic analysis

Journal of the American Society for Information Science

Improving the retrieval of information from external sources

Behavior Research Methods Instruments & Computers

A leisurely look at the bootstrap, the jacknife, and cross-validation

The American Statistician

Advances in Knowledge Discovery and Data Mining

Information Retrieval: Data Structures and Algorithms