Abstract
Text classification can greatly improve the performance of information retrieval and information filtering, but high dimensionality of documents baffles the applications of most classification approaches. This paper proposed a Difference-Similitude Matrix (DSM) based method to solve the problem. The method represents a pre-classified collection as an item-document matrix, in which documents in same categories are described with similarities while documents in different categories with differences. Using the DSM reduction algorithm, simpler and more efficient than rough set reduction, we reduced the dimensionality of document space and generated rules for text classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Salton, G., Wong, A., Yang, C.S.: A vector space model for information retrieval. Communications of the ACM 18(11), 613–620 (1975)
Setiono, R., Liu, H.: Neural network feature selector. IEEE Transactions on Neural Networks, vol 8(39), 645–662 (1997)
Barker, A.L.: Selection of Distance Metrics and Feature Subsets for k-Nearest Neighbor Classifiers (1997)
Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991)
Pawlak, Z.: Rough Classification. International Journal of Man-Machine Studies 20(5), 469–483 (1984)
Nguyen, S.H.: Scalable classification method based on rough sets. In: Proceedings of Rough Sets and Current Trends in Computing, pp. 433–440 (2002)
Pawlak, Z.: Rough Sets. Informational Journal of Information and Computer Sciences 11(5), 341–356 (1982)
Xia, D., Yan, P.: A New Method of Knowledge Reduction for Information System – DSM Approach. Research Report of Wuhan University, Wuhan (2001)
Jiang, H., Yan, P., Xia, D.: A New Reduction Algorithm – Difference-Similitude Matrix. In: Proceedings of the Second International Conference on Machine Learning and Cybernetics, 2-5 Xi’an, pp. 1533–1537 (2004)
Wu, M., Xia, D., Yan, P.: A New Knowledge Reduction Method Based on Difference-Similitude Set Theory. In: Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, vol. 3, pp. 1413–1418 (2004)
Aizawa, A.: The feature quantity: An information theoretic perspective of tfidf-like measures. In: Proceedings of SIGIR 2000, pp. 104–111 (2000)
Chen, Y., Wang, J.Z.: Support Vector Learning for Fuzzy Rule-Based Classification System. IEEE Transactions on Fuzzy Systems 11(6), 716–728 (2003)
Li, H., Kenji, Y.: Text Classification Using ESC-based Stochastic Decision List. In: Proceedings of the 8th ACM International Conference on Information and Knowledge Management (CIKM 1999), pp. 122–130 (1999)
Han, E.-H., Kumar, V.: Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. Technical Report #99-019 (1999)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Using EM to Classify Text from Labeled and Unlabeled Documents. Technical Report CMU-CS-98-120, School of Computer Science, CMU, Pittsburgh, p. 15213 (1998)
Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering Using Frequent Itemsets. In: Proceedings of the SIAM International Conference on Data Mining (2003)
Zhou, J., Xia, D., Yan, P.: Incremental Machine Learning Theorem and Algorithm Based on DSM Method. In: Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, vol. 3, pp. 2202–2207 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huang, X., Wu, M., Xia, D., Yan, P. (2005). Difference-Similitude Matrix in Text Classification. In: Wang, L., Jin, Y. (eds) Fuzzy Systems and Knowledge Discovery. FSKD 2005. Lecture Notes in Computer Science(), vol 3614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11540007_3
Download citation
DOI: https://doi.org/10.1007/11540007_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28331-7
Online ISBN: 978-3-540-31828-6
eBook Packages: Computer ScienceComputer Science (R0)