Abstract
One of the most difficult issues in text mining is high dimensionality caused by a large number of features (keywords). While various multivariate analyses, such as PCA and SVD (in information retrieval, called LSI), are developed to solve this curse of high dimensionality, they are computationally costly. This paper investigates a regression-based reconstruction method that enables parallelization of PCA/SVD by decomposing a document-term matrix into a set of sub-matrices with consideration of overlapped terms, and then to re-assemble using regression technique. To evaluate our method, we utilize two text datasets in the UCI Machine Learning Repository, called “Bag of Words” and “Reuter 50 50”. To measure the closeness between two documents, cosine similarity is applied while the accuracy is measured in the form of rank order mismatch. Finally, the result shows that, the matrices decomposition and re-assembly can preserve the quality of relation/representation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, Y.H., Ting-Chia, L.: Dimension reduction techniques for accessing Chinese readability. In: Machine Learning and Cybernetics ICMLC (2014)
Ketui, N., Theeramunkong, T.: Effect of weighting factors and unit-selection factors on text summarization. In: Pham, D.-N., Park, S.-B. (eds.) PRICAI 2014. LNCS (LNAI), vol. 8862, pp. 891–897. Springer, Cham (2014). doi:10.1007/978-3-319-13560-1_75
He, Q., Ding, X.: Sparse representation based on local time–frequency template matching for bearing transient fault feature extraction. J. Sound Vib. 370, 424–443 (2016)
Bharti, K.K., Singh, P.K.: A three-stage unsupervised dimension reduction method for text clustering. J. Comput. Sci. 5(2), 156–169 (2014)
Wall, M.E., Rechtsteiner, A., Rocha, L.M.: Singular value decomposition and principal component analysis. In: Berrar, D.P., Dubitzky, W., Granzow, M. (eds.) A Practical Approach to Microarray Data Analysis, pp. 91–109. Springer, Boston (2003)
Jun, S., Park, S.-S., Jang, D.-S.: Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Syst. Appl. 41(7), 3204–3212 (2014)
Gao, J., Zhang, J.: Clustered SVD strategies in latent semantic indexing. Inf. Process. Manage. 41(5), 1051–1063 (2005)
Zabalza, J., et al.: Novel Folded-PCA for improved feature extraction and data reduction with hyperspectral imaging and SAR in remote sensing. ISPRS J. Photogrammetry Remote Sens. 93, 112–122 (2005)
Xiuping, J., Richards, J.A.: Segmented principal components transformation for efficient hyperspectral remote-sensing image display and classification. IEEE Trans. Geosci. Remote Sens. 37(1), 538–542 (1999)
Pascual-González, J., et al.: Combined use of MILP and multi-linear regression to simplify LCA studies. Comput. Chem. Eng. 82, 34–43 (2015)
Qiao, H.: New SVD based initialization strategy for non-negative matrix factorization. Pattern Recogn. Lett. 63, 71–77 (2015)
Shlens, J.: A tutorial on principal component analysis (2003)
Theeramunkong, T.: Introduction to concepts and techniques in data mining and application to text mining (2012)
Kittiphattanabawon, N., Theeramunkong, T., Nantajeewarawat, E.: News relation discovery based on association rule mining with combining factors. IEICE Trans. 94, 404–415 (2011)
Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml
ZhiLiu, UCI Machine Learning Repository (2011). https://archive.ics.uci.edu/ml/datasets/Reuter_50_50
Garcia, D.E.: Latent Semantic Indexing (LSI) A Fast Track Tutorial (2006)
Pavan Kumar, P., Agarwal, A., Bhagvati, C.: A structure based approach for mathematical expression retrieval. In: 6th International Workshop Multi-disciplinary Trends in Artificial Intelligence, MIWAI, Ho Chi Minh City, Vietnam (2012)
Acknowledgement
This work is financially funded and supported by Sirindhorn International Institute of Technology, Thammasat University and Burapha University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Buatoom, U., Theeramunkong, T., Kongprawechnon, W. (2017). A Regression-Based SVD Parallelization Using Overlapping Folds for Textual Data. In: Numao, M., Theeramunkong, T., Supnithi, T., Ketcham, M., Hnoohom, N., Pramkeaw, P. (eds) Trends in Artificial Intelligence: PRICAI 2016 Workshops. PRICAI 2016. Lecture Notes in Computer Science(), vol 10004. Springer, Cham. https://doi.org/10.1007/978-3-319-60675-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-60675-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60674-3
Online ISBN: 978-3-319-60675-0
eBook Packages: Computer ScienceComputer Science (R0)