Abstract
A graph-based approach to document classification is described in this paper. The graph representation offers the advantage that it allows for a much more expressive document encoding than the more standard bag of words/phrases approach, and consequently gives an improved classification accuracy. Document sets are represented as graph sets to which a weighted graph mining algorithm is applied to extract frequent subgraphs, which are then further processed to produce feature vectors (one per document) for classification. Weighted subgraph mining is used to ensure classification effectiveness and computational efficiency; only the most significant subgraphs are extracted. The approach is validated and evaluated using several popular classification algorithms together with a real world textual data set. The results demonstrate that the approach can outperform existing text classification algorithms on some dataset. When the size of dataset increased, further processing on extracted frequent features is essential.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cai, C. H., Fu, A. W., Cheng, C. H. and Kwong, W. W. Mining Association Rules with Weighted Items. In Proceedings of International Database Engineering and Applications Symposium, August 1998.
Chi, Y., Nijssen, S., Muntz, R. and Kok, J. Frequent Subgree Mining An Overview. In Fundamenta Informaticae, Special Issue on Graph and Tree Mining, 66(1-2), 161-198, 2005.
Coenen, F. The LUCS-KDD TFPC Classification Association Rule Mining Algorithm. http://www.csc.liv.ac.uk/∼frans/KDD/Software/Apriori_TFPC/aprioriTFPC.html, Dept. of Computer Science, The University of Liverpool, UK, 2004.
Coenen, F., Leng, P. Obtaining Best Parameter Values for Accurate Classification. In Proceedings of International Conference on Data Mining, Pages: 597-600, 2005.
Garey, M. R. and Johnson, D. S. Computers and Intractability - A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York, 1979.
Gee, K. R. and Cook, D. J. Text Classification Using Graph-Encoded Linguistic Elements, In FLAIRS Conference 2005, pp. 487-492.
Geibel, P., Krumnack, U., Pustylnikow, O., Mehler, A., et al. Structure-Sensitive Learning of Text Types, In AI 2007: Advances in Artificial Intelligence, Vol 4830, pp. 642-646.
Huan, J., Wang, W. and Prins, J. Efficient Mining of Frequent Subgraph in the Presence of Isomorphism. In Proceedings of the 2003 International Conference on Data Mining, 2003.
Inokuci, A., Washio, T. and Motoda, H. An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data. In Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pages: 13-23, 2000.
Kuramochi, M. and Karypis, G. Frequent Subgraph Discovery. In Proceedings of 2001 IEEE International Conference on Data Mining, 2001.
Lee, S. D. and Park, H. C. Mining Weighted Frequent Patterns from Path Traversals on Weighted Graph. In IJCSNS International Journal of Computer Science and Network Security, VOL.7, No.4, April 2007.
Markov, A., Last, M. Efficient Graph-based Representation of Web Documents. In Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences, Pages: 52-62, Porto Portugal, 2005.
Markov, A., Last, M. and Kandel, A. Fast Categorization of Web Documents represented by Graphs, In Advances in Web Mining and Web Usage Analysis, Vol 4811, pp. 56-71, 2007.
Mukund, D., Kuramochi, M. and Karypis, G. Frequent Sub-structure based Approaches for Classifying Chemical Compounds. In Proceedings of the Third IEEE International Conference on Data Mining, 2003.
Reynolds, H. T. The Analysis of Cross-classifications. The Free Press, New York, 1977.
Schenker, A. Graph Theorectic Techniques for Web Content Mining. PhD thesis, University of South Florida, 2003.
Tao, F.,Murtagh, F. and Farid,M.Weighted Association RuleMining usingWeighted Support and Significance Framework. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, USA, Aug. 2003.
Tsuruoka, Y. and Tsujii, J. Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. In Proceedings of HLT/EMNLP 2005, pp. 467-474.
Wang, W., Yang, J. and Yu, P. S. Efficient Mining of Weighted Association Rules(WAR). In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, USA, Aug. 2000.
Wang,W., Do, D. B. and Lin, X. Term GraphModel for Text Classification, In Advanced Data Mining and Applications, pp. 19-30, 2005.
Witten, Ian H. and Frank, Eibe. Data Mining: Practical Machine Learning Tools and Techniques (2nd Edition). Morgan Kaufmann, San Francisco, 2005.
Yan, X. and Han, J. gSpan: Graph-based Substructure Pattern Mining. In Proceedings of 2002 International Conference on Data Mining, 2002.
Yun, U. and Leggett, J. J. WFIM: Weighted Frequent Itemset Mining with a Weight Range and a Minimum Weight. InProceedings of the Fifth SIAM International Conference on Data Mining, Pages: 636-640, April 2005.
Yun, U. and Leggett, J. J. WIP: Mining Weighted Interesting Patterns with a Strong Weight and/or Support Affinity. In Proceedings of the Sixth SIAM International Conference on Data Mining, 2006.
Yun, U. WIS:Weighted Interesting Sequential Pattern Mining with a Similar Level of Support and/or Weight. ETRI Journal, Vol. 29, No. 3, Pages: 336-352, June 2007.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag London
About this paper
Cite this paper
Jiang, C., Coenen, F., Sanderson, R., Zito, M. (2010). Text Classification using Graph Mining-based Feature Extraction. In: Bramer, M., Ellis, R., Petridis, M. (eds) Research and Development in Intelligent Systems XXVI. Springer, London. https://doi.org/10.1007/978-1-84882-983-1_2
Download citation
DOI: https://doi.org/10.1007/978-1-84882-983-1_2
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-84882-982-4
Online ISBN: 978-1-84882-983-1
eBook Packages: Computer ScienceComputer Science (R0)