Mining consumer product data via latent semantic indexing☆
Introduction
Large amounts of data have been collected in daily operations of organizations due to inexpensive storage and high computing power, but many companies have been unable to extract useful information from the data and utilize the information to benefit their business. Data mining involves the application of algorithms for extracting valid and useful information from large databases in order to make critical business decisions. The fact that data is being accumulated at a faster rate than it can be analyzed creates a significant demand for efficient data mining systems [7], [11]. Techniques used for data mining include decision trees and rule induction [22], association rules [1], nonlinear regression and classification [5], [13], genetic algorithms [19], [23], and neural networks [6], [14]. This study explores a new technique, latent semantic indexing (LSI), for data mining.
LSI is an efficient information retrieval technique which has been commonly used for textual documents [4], [8]. Traditional lexical-matching methods try to match words of queries with words of documents, which may fail to retrieve related documents or may return unrelated documents to users. This kind of failure to retrieve relevant documents or the retrieval of irrelevant documents is called the word-matching problem. LSI addresses the word-matching problem through the use of statistically derived conceptual indices instead of individual words [4]. Using the singular value decomposition (SVD) [15] of a large sparse term-by-document matrix, LSI constructs a conceptual vector space in which each term or document is represented as a vector in the space. The positioning of term and document vectors within the vector space reveals the underlying semantic structure of association between terms and documents in the data.
This paper explores the applicability of the LSI vector-space model to numerical databases. By properly choosing attributes of data records as terms or documents, a term-by-document frequency matrix is built. A distribution-based indexing scheme is then employed to construct a correlated distribution matrix (CDM) which reflects relationships between attributes of data records. Hence, the LSI-like vector space model is generated so that the encoding of attributes in the space can be analyzed for the detection of useful or hidden patterns. The extracted information can then be validated using statistical hypotheses testing or resampling.
Applications for data mining extract information from the data to make important business decisions, predict business trends, and develop new products. A common application for data mining is to analyze customer purchases (see Fig. 1) to discover patterns among existing customer preferences and then use those patterns for forecasting sales and optimizing marketing strategies. In this work, the LSI model is presented as an automated yet scalable (i.e., practical for large data collections) approach to extract underlying patterns from consumer product data.
The remaining sections of this paper outline the development and application of LSI to numerical databases. Section 2 is a brief overview of the LSI vector space model. Section 3 illustrates how an LSI-like model for numerical databases can be designed and implemented for mining a consumer product database. Finally, a summary and discussion of future work are provided in Section 4.
Section snippets
Latent semantic indexing
The word-matching problem mentioned in Section 1 results from multiple words having the same meaning (synonymy) and many words have more than one meaning (polysemy). For example, a text collection contains documents on house ownership and web home pages with some documents using the word house only, some documents using the word home only, and some documents using both words. For a query on home ownership, traditional lexical-matching methods fail to retrieve documents using the word house
Data mining with LSI
Although many data mining systems are derived from machine learning and neural networks, information retrieval techniques based on conceptual searching algorithms are also evolving. The conceptual vector space model used by LSI attempts to position (or cluster) similar objects in the vector space so that objects related to a given query (but perhaps not containing the exact same terminology) can be retrieved. The success of LSI for textual documents inspires its application to numerical
Summary and future work
In addition to the efficient information retrieval from textual documents, LSI can also be applied efficiently to numerical databases for data mining. The LSI conceptual vector space model represents similar objects in such a way that they can be retrieved even though the objects may not share common attribute values. By projecting user queries into the vector space and matching nearby attributes or categories, underlying patterns can be extracted from large databases. Further, the extracted
Acknowledgements
Special thanks to Professor Mark M. Miller (Department of Journalism, University of Tennessee, Knoxville) and his former Ph.D. student, Dr. Connie Milbourne, for their help in acquiring the A.C. Nielsen scanner data for this research.
References (24)
- et al.
Fast discovery of association rules
Large scale singular value computations
International Journal of Supercomputer Applications
(1996)- M.W. Berry, Z. Drmač, E.R. Jessup, Matrices vector spaces information retrieval, SIAM Review 41 (2) (1999)...
- et al.
Using linear algebra for intelligent information retrieval
SIAM Review
(1995) - L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth, Belmont, CA,...
- et al.
Neural networks – a review from a statistical perspective
Statistical Science
(1994) - K.M. Decker, S. Focardi, Technology overview: a report on data mining, Technical report CSCS TR-95-02, Centrum voor...
- et al.
Indexing by latent semantic analysis
Journal of the American Society for Information Science
(1990) Improving the retrieval of information from external sources
Behavior Research Methods Instruments & Computers
(1991)- et al.
A leisurely look at the bootstrap, the jacknife, and cross-validation
The American Statistician
(1983)
Advances in Knowledge Discovery and Data Mining
Information Retrieval: Data Structures and Algorithms
Cited by (29)
Identification of interdisciplinary ideas
2016, Information Processing and ManagementCitation Excerpt :It is based on eigenvector techniques from algebra. Dependencies among terms are calculated to group semantically related terms (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999). These groups are named concepts and they represent semantic clusters.
Idea mining for web-based weak signal detection
2015, FuturesCitation Excerpt :Thorleuchter and Van den Poel (2013a) use semantic clustering for weak signal analysis to consider differences in authors’ writing styles and contexts. Semantic approaches calculate dependencies among terms e.g. by using eigenvector techniques from algebra to group semantically related terms (clusters) (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999). Each group consists of terms that occur together in several documents but it also consists of terms that might occur together in these documents.
Semantic weak signal tracing
2014, Expert Systems with ApplicationsCitation Excerpt :Semantic approaches (e.g. LSI) are in contrast to knowledge structure based approaches. They consider term dependencies and use eigenvector techniques from algebra (Jiang, Berry, Donato, Ostrouchov, & Grady, 1999) to discover classes (semantic textual patterns) from a document collection. The semantic textual patterns contain terms that occur together in parts of the documents but also terms that might occur in the document parts.
Semantic compared cross impact analysis
2014, Expert Systems with ApplicationsQuantitative cross impact analysis with latent semantic indexing
2014, Expert Systems with ApplicationsProtecting research and technology from espionage
2013, Expert Systems with Applications
- ☆
This research was sponsored y the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by Lockheed Martin Energy Research Corp. for the US Department of Energy under Contract No. DE-AC05-96OR22464.