Abstract
In this paper, we propose a novel approach that uses a weighted directed multigraph for text pattern recognition. Instead of the traditional model which is based on the frequency of keywords for text classification, we set up a weighted directed multigraph model using the distances between the keywords as the weights of arcs. We then developed a keyword-frequencydistance- based algorithm which not only utilizes the frequency information of keywords but also their ordering information. We applied this new idea to the detection of plagiarized papers and the detection of fraudulent emails written by the same person. The results on these case studies show that this new method performs much better than traditional methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Apte, C., Damerau, F., Weiss, S.: Text mining with decision rules and decision trees. In: Workshop on Learning from text and the Web, Conference on Automated Learning and Discovery (1998)
Bestgen, Y.: Improving Text Segmentation Using Latent Semantic Analysis: A Reanalysis of Choi, Wiemer-Hastings, and Moore. Computational Linguistics 32(3), 455 (2006)
Hansen, P., Jaumard, B.: Cluster analysis and mathematical programming. Mathematical Programming, 191–215 (1997)
Hardle, W., Simar, L.: Applied Multivariate Statistical Analysis. Springer, Berlin (2003)
Hassan, S., Mihalcea, R., Banea, C.: Random-Walk Term Weighting for Improved Text Classification. In: Proceedings of the IEEE International Conference on Semantic Computing (ICSC 2007), Irvine, CA (September 2007)
Jackson, P., Moulinier, I.: Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. John Benjamins Publishing Co., Amsterdam (2002)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Lan, M., Tan, C., Low, H., Sungy, S.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Proceedings of the 14th international conference on World Wide Web, pp. 1032–1033 (2005)
Landauer, T.K., Foltz, P., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Milligan, G.W.: Cluster analysis. In: Kotz, S. (ed.) Encyclopedia of Statistical Sciences, pp. 120–125. Wiley, New York (1998)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Ng, H., Goh, W., Low, K.: Feature selection, perceptron learning, and a usability case study for text categorization. In: Proc. 20th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 1997), pp. 67–73 (1997)
Ou, Y., Zhang, C.-Q.: A new multimembership clustering method. Journal of Industrial and Management Optimization 3(4), 619–624 (2007)
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. Research and Development in Information Retrieval, pp. 275–281 (1998)
Robertson, R., Sparck-Jones, K.: Simple, proven approaches to text retrieval. Technical Report (1997)
Rosario, B.: Latent Semantic Indexing: An overview. INFOSYS 240 (Spring 2000)
Ruiz, M.E., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002)
Schutze, H., Hull, D.A., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington (1995)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research 2, 45–66 (2001)
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-Mail Content for Author Identification Forensics. SIGMOD Record 30(4), 55–64 (2001)
Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18, 536–545 (2002)
Yang, Y., Liu, X.: A re-examination of text categorisation methods. In: Proc. 22nd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 1999), pp. 67–73 (1999)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, Nashville, US (1997)
Nigerian Fraud Email Gallery, http://potifos.com/fraud/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Wu, Q., Fuller, E., Zhang, CQ. (2010). Graph Model for Pattern Recognition in Text. In: Ting, IH., Wu, HJ., Ho, TH. (eds) Mining and Analyzing Social Networks. Studies in Computational Intelligence, vol 288. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13422-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-13422-7_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13421-0
Online ISBN: 978-3-642-13422-7
eBook Packages: EngineeringEngineering (R0)