Abstract
The main aim of this chapter is to study the effects of structural representation of text documents when applying a connectionist approach to modelling the domain. While text documents are often processed un-structured, we will show in this chapter that the performance and problem solving capability of machine learning methods can be enhanced through the use of suitable structural representations of text documents. It will be shown that the extraction of structure from text documents does not require a knowledge of the underlying semantic relationships among words used in the text. This chapter describes an extension of the bag of words approach. By incorporating the “relatedness” of word tokens as they are used in the context of a document, this results in a structural representation of text documents which is richer in information than the bag of words approach alone. An application to very large datasets for a classification and a regression problem will show that our approach scales very well. The classification problem will be tackled by the latest in a series of techniques which applied the idea of self organizing map to graph domains. It is shown that with the incorporation of the relatedness information as expressed using the Concept Link Graph, the resulting clusters are tighter when compared them with those obtained using a self organizing map alone using a bag of words representation. The regression problem is to rank a text corpus. In this case, the idea is to include content information in the ranking of documents and compare them with those obtained using PageRank. In this case, the results are inconclusive due possibly to the truncation of the representation of the Concept Link Graph representations. It is conjectured that the ranking of documents will be sped up if we include the Concept Link Graph representation of all documents together with their hyperlinked structure. The methods described in this chapter are capable of solving real world and data mining problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Haykin, S.: Neural Networks, A Comprehensive Foundation. Prentice Hall (1998)
Hornik, K.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359–366 (1989)
Scarselli, F., Gori, M., Tsoi, A., Hagenbuchner, M., Monfardini, G.: Computational capabilities of graph neural networks. IEEE Transactions on Neural Networks 20, 81–102 (2009)
McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow
Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. In: Proceedings of EMNLP, pp. 404–411. ACL, Barcelona (2004)
Sowa, J.F.: Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley, Reading (1984)
Chau, R., Tsoi, A.C., Hagenbuchner, M., Lee, V.: A conceptlink graph for text structure mining. In: Mans, B. (ed.) Thirty-Second Australasian Computer Science Conference (ACSC 2009), Wellington, New Zealand. CRPIT, vol. 91, pp. 129–137. ACS (2009)
Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003)
Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer-Verlag Inc., New York (2002)
Hotelling, H.: Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24, 417–441 (1933)
Kohonen, T.: Self-Organisation and Associative Memory, 3rd edn. Springer (1990)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the 7th International Conference on World Wide Web (WWW), Brisbane, Australia, pp. 107–117 (1998)
Chiang, W., Hagenbuchner, M., Tsoi, A.: The wt10g dataset and the evolution of the web. In: 14th International World Wide Web Conference, Alternate track papers and posters, Chiba city, Japan, pp. 938–939 (May 2005)
Green, D.: The evolution of web searching. Online Information Review 24(2), 124–137 (2000)
Despeyroux, T.: Practical semantic analysis of web sites and documents. In: WWW 2004: Proceedings of the 13th International Conference on World Wide Web, New York, USA, pp. 685–693 (May 2004)
Netcraft, “Web server survey” (October 13 , 2005), http://news.netcraft.com/archives/web_server_survey.html
The google platform, http://en.wikipedia.org/wiki/Google_platform (accessed July 07, 2011)
Hagenbuchner, M., Sperduti, A., Tsoi, A.: A self-organizing map for adaptive processing of structured data. IEEE Transactions on Neural Networks 14, 491–505 (2003)
Scarselli, F., Gori, M., Tsoi, A., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Transactions on Neural Networks 20, 61–80 (2009)
Yuan, M.: Efficient computation of the l1 regularized solution path in gaussian graphical models. Journal of Computational and Graphical Statistics 17, 809–826 (2008)
Zhang, S., Hagenbuchner, M., Tsoi, A.C., Sperduti, A.: Self Organizing Maps for the Clustering of Large Sets of Labeled Graphs. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 469–481. Springer, Heidelberg (2009)
Hagenbuchner, M., Da San Martino, G., Tsoi, A.C., Spertudi, A.: Sparsity issues in self-organizing-maps for structures. In: Proceedings of European Symposium on Artificial Neural Networks, vol. ES2011–71 (2011)
Chen, Y., Gan, Q., Suel, T.: Local methods for estimating pagerank values. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM 2004, pp. 381–389. ACM, New York (2004)
Yong, S., Hagenbuchner, M., Tsoi, A.: Ranking web pages using machine learning approaches. In: International Conference on Web Intelligence, Sydney, Australia, December 9-12, vol. 3, pp. 677–680 (2008)
Scarselli, F., Yong, S., Gori, M., Hagenbuchner, M., Tsoi, A., Maggini, M.: Graph neural networks for ranking web pages. In: Web Intelligence Conference, pp. 666–672 (2005)
Zhang, S.J., Hagenbuchner, M., Scarselli, F., Tsoi, A.C.: Supervised Encoding of Graph-of-Graphs for Classification and Regression Problems. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 449–461. Springer, Heidelberg (2010)
Feldman, R., Sanger, J.: The Text Mining Handbook. Cambridge University Press (2007)
Tsoi, A.C., Hagenbuchner, M., Chau, R., Lee, V.: Unsupervised and supervised learning of graph domains. In: Bianchini, M., Maggini, M., Scarselli, F., Jain, L. (eds.) Innovations in Neural Information Paradigms and Applications, pp. 43–66. Springer, Heidelberg (2009)
Hagenbuchner, M., Sperduti, A., Tsoi, A.C., Trentini, F., Scarselli, F., Gori, M.: Clustering XML Documents Using Self-organizing Maps for Structures. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 481–496. Springer, Heidelberg (2006)
Kc, M., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Sperduti, A., Gori, M.: XML Document Mining Using Contextual Self-organizing Maps for Structures. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 510–524. Springer, Heidelberg (2007)
Yong, S.L., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Gori, M.: Document Mining Using Graph Neural Network. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 458–472. Springer, Heidelberg (2007)
Hagenbuchner, M., Tsoi, A., Sperduti, A., Kc, M.: Efficient clustering of structured documents using graph self-organizing maps. In: Comparative Evaluation of XML Information Retrieval Systems, pp. 207–221. Springer, Berlin (2008)
Kc, M., Chau, R., Hagenbuchner, M., Tsoi, A.C., Lee, V.: A Machine Learning Approach to Link Prediction for Interlinked Documents. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 342–354. Springer, Heidelberg (2010)
Muratore, D., Hagenbuchner, M., Scarselli, F., Tsoi, A.C.: Sentence Extraction by Graph Neural Networks. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010, Part III. LNCS, vol. 6354, pp. 237–246. Springer, Heidelberg (2010)
de Mauro, C., Diligenti, M., Gori, M., Maggini, M.: Similarity learning for graph-based image representations. Pattern Recognition Letters 24, 1115–1122 (2003)
Hagenbuchner, M., Kc, M., Tsoi, A.: XML Data Mining: Models, Methods, and Applications. In: Data Driven Encoding of Structures and Link Predictions in Large XML Document Collections. IGI Global (2010) (accepted for publication on May 30, 2010)
Kutty, S., Nayak, R., Li, Y.: Xml documents clustering using tensor space model-a preliminary study. In: ICDM 2010 Workshop on Optimization Based Methods for Emerging Data Mining Problems, pp. 1167–1173 (December 13, 2010)
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1989)
Leung, H., Chung, F., Chan, S., Luk, R.: Xml document clustering using common xpath. In: Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, pp. 91–96. IEEE Computer Society, Washington, DC (2005)
Vercoustre, A.-M., Fegas, M., Gul, S., Lechevallier, Y.: A Flexible Structured-Based Representation for XML Document Mining. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 443–457. Springer, Heidelberg (2006)
Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., Shi, B.-L.: Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 441–451. Springer, Heidelberg (2004)
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. In: Proc.of the15th Int. Conf. on Data Engineering (2000)
Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A methodology for clustering xml documents by structure. Information Systems 31(3), 187–228 (2006)
Nierman, A., Jagadish, H.: Evaluating structural similarity in xml documents. In: Proceedings of International Workshop on Mining Graphs, Trees, and Sequences, pp. 61–66 (2002)
Neuhaus, M., Bunke, H.: Self-organizing maps for learning the edit costs in graph matching. IEEE Transactions on Systems, Man, and Cybernetics, Part B 3(35), 503–514 (2005)
Nayak, R., Tran, T.: A progressive clustering algorithm to group the xml data by structural and semantic similarity. IJPRAI 21(4), 723–743 (2007)
Tagarelli, A., Greco, S.: Toward semantic xml clustering. In: Ghosh, J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM, pp. 188–199. SIAM (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Tsoi, A.C., Hagenbuchner, M., Kc, M., Zhang, S. (2013). Learning Structural Representations of Text Documents in Large Document Collections. In: Bianchini, M., Maggini, M., Jain, L. (eds) Handbook on Neural Information Processing. Intelligent Systems Reference Library, vol 49. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36657-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-36657-4_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36656-7
Online ISBN: 978-3-642-36657-4
eBook Packages: EngineeringEngineering (R0)