Mining consumer product data via latent semantic indexing

https://doi.org/10.1016/S1088-467X(99)00029-3Get rights and content

Abstract

One important focus of data mining research is in the development of algorithms for extracting valuable information from large databases in order to facilitate business decisions. This study explores a new technique for data mining – latent semantic indexing (LSI). LSI is an efficient information retrieval method for textual documents. By determining the singular value decomposition (SVD) of a large sparse term-by-document matrix, LSI constructs an approximate vector space model which represents important associative relationships between terms and documents that are not evident in individual documents. This paper explores the applicability of the LSI model to numerical databases, namely consumer product data. By properly choosing attributes of data records as terms or documents, a term-by-document frequency matrix is built from which a distribution-based indexing scheme is employed to construct a correlated distribution matrix (CDM). An LSI-like vector space model is then used to detect useful or hidden patterns in the numerical data. The extracted information can then be validated using statistical hypotheses testing or resampling. LSI is an automatic yet intelligent indexing method. Its application to numerical data introduces a promising way to discover knowledge in important commercial application areas such as retail and consumer banking.

Introduction

Large amounts of data have been collected in daily operations of organizations due to inexpensive storage and high computing power, but many companies have been unable to extract useful information from the data and utilize the information to benefit their business. Data mining involves the application of algorithms for extracting valid and useful information from large databases in order to make critical business decisions. The fact that data is being accumulated at a faster rate than it can be analyzed creates a significant demand for efficient data mining systems [7], [11]. Techniques used for data mining include decision trees and rule induction [22], association rules [1], nonlinear regression and classification [5], [13], genetic algorithms [19], [23], and neural networks [6], [14]. This study explores a new technique, latent semantic indexing (LSI), for data mining.

LSI is an efficient information retrieval technique which has been commonly used for textual documents [4], [8]. Traditional lexical-matching methods try to match words of queries with words of documents, which may fail to retrieve related documents or may return unrelated documents to users. This kind of failure to retrieve relevant documents or the retrieval of irrelevant documents is called the word-matching problem. LSI addresses the word-matching problem through the use of statistically derived conceptual indices instead of individual words [4]. Using the singular value decomposition (SVD) [15] of a large sparse term-by-document matrix, LSI constructs a conceptual vector space in which each term or document is represented as a vector in the space. The positioning of term and document vectors within the vector space reveals the underlying semantic structure of association between terms and documents in the data.

This paper explores the applicability of the LSI vector-space model to numerical databases. By properly choosing attributes of data records as terms or documents, a term-by-document frequency matrix is built. A distribution-based indexing scheme is then employed to construct a correlated distribution matrix (CDM) which reflects relationships between attributes of data records. Hence, the LSI-like vector space model is generated so that the encoding of attributes in the space can be analyzed for the detection of useful or hidden patterns. The extracted information can then be validated using statistical hypotheses testing or resampling.

Applications for data mining extract information from the data to make important business decisions, predict business trends, and develop new products. A common application for data mining is to analyze customer purchases (see Fig. 1) to discover patterns among existing customer preferences and then use those patterns for forecasting sales and optimizing marketing strategies. In this work, the LSI model is presented as an automated yet scalable (i.e., practical for large data collections) approach to extract underlying patterns from consumer product data.

The remaining sections of this paper outline the development and application of LSI to numerical databases. Section 2 is a brief overview of the LSI vector space model. Section 3 illustrates how an LSI-like model for numerical databases can be designed and implemented for mining a consumer product database. Finally, a summary and discussion of future work are provided in Section 4.

Section snippets

Latent semantic indexing

The word-matching problem mentioned in Section 1 results from multiple words having the same meaning (synonymy) and many words have more than one meaning (polysemy). For example, a text collection contains documents on house ownership and web home pages with some documents using the word house only, some documents using the word home only, and some documents using both words. For a query on home ownership, traditional lexical-matching methods fail to retrieve documents using the word house

Data mining with LSI

Although many data mining systems are derived from machine learning and neural networks, information retrieval techniques based on conceptual searching algorithms are also evolving. The conceptual vector space model used by LSI attempts to position (or cluster) similar objects in the vector space so that objects related to a given query (but perhaps not containing the exact same terminology) can be retrieved. The success of LSI for textual documents inspires its application to numerical

Summary and future work

In addition to the efficient information retrieval from textual documents, LSI can also be applied efficiently to numerical databases for data mining. The LSI conceptual vector space model represents similar objects in such a way that they can be retrieved even though the objects may not share common attribute values. By projecting user queries into the vector space and matching nearby attributes or categories, underlying patterns can be extracted from large databases. Further, the extracted

Acknowledgements

Special thanks to Professor Mark M. Miller (Department of Journalism, University of Tennessee, Knoxville) and his former Ph.D. student, Dr. Connie Milbourne, for their help in acquiring the A.C. Nielsen scanner data for this research.

References (24)

  • R Agrawal et al.

    Fast discovery of association rules

  • M.W Berry

    Large scale singular value computations

    International Journal of Supercomputer Applications

    (1996)
  • M.W. Berry, Z. Drmač, E.R. Jessup, Matrices vector spaces information retrieval, SIAM Review 41 (2) (1999)...
  • M.W Berry et al.

    Using linear algebra for intelligent information retrieval

    SIAM Review

    (1995)
  • L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth, Belmont, CA,...
  • B Cheng et al.

    Neural networks – a review from a statistical perspective

    Statistical Science

    (1994)
  • K.M. Decker, S. Focardi, Technology overview: a report on data mining, Technical report CSCS TR-95-02, Centrum voor...
  • S Deerwester et al.

    Indexing by latent semantic analysis

    Journal of the American Society for Information Science

    (1990)
  • S.T Dumais

    Improving the retrieval of information from external sources

    Behavior Research Methods Instruments & Computers

    (1991)
  • B Efron et al.

    A leisurely look at the bootstrap, the jacknife, and cross-validation

    The American Statistician

    (1983)
  • U.M Fayyad et al.

    Advances in Knowledge Discovery and Data Mining

    (1996)
  • W Frakes et al.

    Information Retrieval: Data Structures and Algorithms

    (1992)
  • Cited by (29)

    • Identification of interdisciplinary ideas

      2016, Information Processing and Management
      Citation Excerpt :

      It is based on eigenvector techniques from algebra. Dependencies among terms are calculated to group semantically related terms (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999). These groups are named concepts and they represent semantic clusters.

    • Idea mining for web-based weak signal detection

      2015, Futures
      Citation Excerpt :

      Thorleuchter and Van den Poel (2013a) use semantic clustering for weak signal analysis to consider differences in authors’ writing styles and contexts. Semantic approaches calculate dependencies among terms e.g. by using eigenvector techniques from algebra to group semantically related terms (clusters) (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999). Each group consists of terms that occur together in several documents but it also consists of terms that might occur together in these documents.

    • Semantic weak signal tracing

      2014, Expert Systems with Applications
      Citation Excerpt :

      Semantic approaches (e.g. LSI) are in contrast to knowledge structure based approaches. They consider term dependencies and use eigenvector techniques from algebra (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999) to discover classes (semantic textual patterns) from a document collection. The semantic textual patterns contain terms that occur together in parts of the documents but also terms that might occur in the document parts.

    • Semantic compared cross impact analysis

      2014, Expert Systems with Applications
    • Protecting research and technology from espionage

      2013, Expert Systems with Applications
    View all citing articles on Scopus

    This research was sponsored y the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by Lockheed Martin Energy Research Corp. for the US Department of Energy under Contract No. DE-AC05-96OR22464.

    View full text