Learning Structural Representations of Text Documents in Large Document Collections

Tsoi, Ah Chung; Hagenbuchner, Markus; Kc, Milly; Zhang, ShuJia

doi:10.1007/978-3-642-36657-4_14

Ah Chung Tsoi⁴,
Markus Hagenbuchner⁵,
Milly Kc⁵ &
…
ShuJia Zhang⁵

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 49))

4028 Accesses
1 Citations

Abstract

The main aim of this chapter is to study the effects of structural representation of text documents when applying a connectionist approach to modelling the domain. While text documents are often processed un-structured, we will show in this chapter that the performance and problem solving capability of machine learning methods can be enhanced through the use of suitable structural representations of text documents. It will be shown that the extraction of structure from text documents does not require a knowledge of the underlying semantic relationships among words used in the text. This chapter describes an extension of the bag of words approach. By incorporating the “relatedness” of word tokens as they are used in the context of a document, this results in a structural representation of text documents which is richer in information than the bag of words approach alone. An application to very large datasets for a classification and a regression problem will show that our approach scales very well. The classification problem will be tackled by the latest in a series of techniques which applied the idea of self organizing map to graph domains. It is shown that with the incorporation of the relatedness information as expressed using the Concept Link Graph, the resulting clusters are tighter when compared them with those obtained using a self organizing map alone using a bag of words representation. The regression problem is to rank a text corpus. In this case, the idea is to include content information in the ranking of documents and compare them with those obtained using PageRank. In this case, the results are inconclusive due possibly to the truncation of the representation of the Concept Link Graph representations. It is conjectured that the ranking of documents will be sped up if we include the Concept Link Graph representation of all documents together with their hyperlinked structure. The methods described in this chapter are capable of solving real world and data mining problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Haykin, S.: Neural Networks, A Comprehensive Foundation. Prentice Hall (1998)
Google Scholar
Hornik, K.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359–366 (1989)
Article Google Scholar
Scarselli, F., Gori, M., Tsoi, A., Hagenbuchner, M., Monfardini, G.: Computational capabilities of graph neural networks. IEEE Transactions on Neural Networks 20, 81–102 (2009)
Article Google Scholar
McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow
Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. In: Proceedings of EMNLP, pp. 404–411. ACL, Barcelona (2004)
Google Scholar
Sowa, J.F.: Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley, Reading (1984)
MATH Google Scholar
Chau, R., Tsoi, A.C., Hagenbuchner, M., Lee, V.: A conceptlink graph for text structure mining. In: Mans, B. (ed.) Thirty-Second Australasian Computer Science Conference (ACSC 2009), Wellington, New Zealand. CRPIT, vol. 91, pp. 129–137. ACS (2009)
Google Scholar
Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003)
Article MATH Google Scholar
Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer-Verlag Inc., New York (2002)
MATH Google Scholar
Hotelling, H.: Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24, 417–441 (1933)
Article Google Scholar
Kohonen, T.: Self-Organisation and Associative Memory, 3rd edn. Springer (1990)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the 7th International Conference on World Wide Web (WWW), Brisbane, Australia, pp. 107–117 (1998)
Google Scholar
Chiang, W., Hagenbuchner, M., Tsoi, A.: The wt10g dataset and the evolution of the web. In: 14th International World Wide Web Conference, Alternate track papers and posters, Chiba city, Japan, pp. 938–939 (May 2005)
Google Scholar
Green, D.: The evolution of web searching. Online Information Review 24(2), 124–137 (2000)
Article Google Scholar
Despeyroux, T.: Practical semantic analysis of web sites and documents. In: WWW 2004: Proceedings of the 13th International Conference on World Wide Web, New York, USA, pp. 685–693 (May 2004)
Google Scholar
Netcraft, “Web server survey” (October 13 , 2005), http://news.netcraft.com/archives/web_server_survey.html
The google platform, http://en.wikipedia.org/wiki/Google_platform (accessed July 07, 2011)
Hagenbuchner, M., Sperduti, A., Tsoi, A.: A self-organizing map for adaptive processing of structured data. IEEE Transactions on Neural Networks 14, 491–505 (2003)
Article Google Scholar
Scarselli, F., Gori, M., Tsoi, A., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Transactions on Neural Networks 20, 61–80 (2009)
Article Google Scholar
Yuan, M.: Efficient computation of the l1 regularized solution path in gaussian graphical models. Journal of Computational and Graphical Statistics 17, 809–826 (2008)
Article MathSciNet Google Scholar
Zhang, S., Hagenbuchner, M., Tsoi, A.C., Sperduti, A.: Self Organizing Maps for the Clustering of Large Sets of Labeled Graphs. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 469–481. Springer, Heidelberg (2009)
Chapter Google Scholar
Hagenbuchner, M., Da San Martino, G., Tsoi, A.C., Spertudi, A.: Sparsity issues in self-organizing-maps for structures. In: Proceedings of European Symposium on Artificial Neural Networks, vol. ES2011–71 (2011)
Google Scholar
Chen, Y., Gan, Q., Suel, T.: Local methods for estimating pagerank values. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM 2004, pp. 381–389. ACM, New York (2004)
Chapter Google Scholar
Yong, S., Hagenbuchner, M., Tsoi, A.: Ranking web pages using machine learning approaches. In: International Conference on Web Intelligence, Sydney, Australia, December 9-12, vol. 3, pp. 677–680 (2008)
Google Scholar
Scarselli, F., Yong, S., Gori, M., Hagenbuchner, M., Tsoi, A., Maggini, M.: Graph neural networks for ranking web pages. In: Web Intelligence Conference, pp. 666–672 (2005)
Google Scholar
Zhang, S.J., Hagenbuchner, M., Scarselli, F., Tsoi, A.C.: Supervised Encoding of Graph-of-Graphs for Classification and Regression Problems. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 449–461. Springer, Heidelberg (2010)
Chapter Google Scholar
Feldman, R., Sanger, J.: The Text Mining Handbook. Cambridge University Press (2007)
Google Scholar
Tsoi, A.C., Hagenbuchner, M., Chau, R., Lee, V.: Unsupervised and supervised learning of graph domains. In: Bianchini, M., Maggini, M., Scarselli, F., Jain, L. (eds.) Innovations in Neural Information Paradigms and Applications, pp. 43–66. Springer, Heidelberg (2009)
Chapter Google Scholar
Hagenbuchner, M., Sperduti, A., Tsoi, A.C., Trentini, F., Scarselli, F., Gori, M.: Clustering XML Documents Using Self-organizing Maps for Structures. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 481–496. Springer, Heidelberg (2006)
Chapter Google Scholar
Kc, M., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Sperduti, A., Gori, M.: XML Document Mining Using Contextual Self-organizing Maps for Structures. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 510–524. Springer, Heidelberg (2007)
Chapter Google Scholar
Yong, S.L., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Gori, M.: Document Mining Using Graph Neural Network. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 458–472. Springer, Heidelberg (2007)
Chapter Google Scholar
Hagenbuchner, M., Tsoi, A., Sperduti, A., Kc, M.: Efficient clustering of structured documents using graph self-organizing maps. In: Comparative Evaluation of XML Information Retrieval Systems, pp. 207–221. Springer, Berlin (2008)
Google Scholar
Kc, M., Chau, R., Hagenbuchner, M., Tsoi, A.C., Lee, V.: A Machine Learning Approach to Link Prediction for Interlinked Documents. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 342–354. Springer, Heidelberg (2010)
Chapter Google Scholar
Muratore, D., Hagenbuchner, M., Scarselli, F., Tsoi, A.C.: Sentence Extraction by Graph Neural Networks. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010, Part III. LNCS, vol. 6354, pp. 237–246. Springer, Heidelberg (2010)
Chapter Google Scholar
de Mauro, C., Diligenti, M., Gori, M., Maggini, M.: Similarity learning for graph-based image representations. Pattern Recognition Letters 24, 1115–1122 (2003)
Article MATH Google Scholar
Hagenbuchner, M., Kc, M., Tsoi, A.: XML Data Mining: Models, Methods, and Applications. In: Data Driven Encoding of Structures and Link Predictions in Large XML Document Collections. IGI Global (2010) (accepted for publication on May 30, 2010)
Google Scholar
Kutty, S., Nayak, R., Li, Y.: Xml documents clustering using tensor space model-a preliminary study. In: ICDM 2010 Workshop on Optimization Based Methods for Emerging Data Mining Problems, pp. 1167–1173 (December 13, 2010)
Google Scholar
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1989)
Google Scholar
Leung, H., Chung, F., Chan, S., Luk, R.: Xml document clustering using common xpath. In: Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, pp. 91–96. IEEE Computer Society, Washington, DC (2005)
Chapter Google Scholar
Vercoustre, A.-M., Fegas, M., Gul, S., Lechevallier, Y.: A Flexible Structured-Based Representation for XML Document Mining. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 443–457. Springer, Heidelberg (2006)
Chapter Google Scholar
Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., Shi, B.-L.: Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 441–451. Springer, Heidelberg (2004)
Chapter Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. In: Proc.of the15th Int. Conf. on Data Engineering (2000)
Google Scholar
Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A methodology for clustering xml documents by structure. Information Systems 31(3), 187–228 (2006)
Article Google Scholar
Nierman, A., Jagadish, H.: Evaluating structural similarity in xml documents. In: Proceedings of International Workshop on Mining Graphs, Trees, and Sequences, pp. 61–66 (2002)
Google Scholar
Neuhaus, M., Bunke, H.: Self-organizing maps for learning the edit costs in graph matching. IEEE Transactions on Systems, Man, and Cybernetics, Part B 3(35), 503–514 (2005)
Article Google Scholar
Nayak, R., Tran, T.: A progressive clustering algorithm to group the xml data by structural and semantic similarity. IJPRAI 21(4), 723–743 (2007)
Google Scholar
Tagarelli, A., Greco, S.: Toward semantic xml clustering. In: Ghosh, J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM, pp. 188–199. SIAM (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Macau University of Science and Technology, Guangzhou, Macau SAR,China
Ah Chung Tsoi
University of Wollongong, Wollongong, Australia
Markus Hagenbuchner, Milly Kc & ShuJia Zhang

Authors

Ah Chung Tsoi
View author publications
You can also search for this author in PubMed Google Scholar
Markus Hagenbuchner
View author publications
You can also search for this author in PubMed Google Scholar
Milly Kc
View author publications
You can also search for this author in PubMed Google Scholar
ShuJia Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ah Chung Tsoi .

Editor information

Editors and Affiliations

, Dipto. Ingegneria dell'Informazione, Università degli Studi di Siena, Via Roma 56, Siena, 53100, Italy
Monica Bianchini
Fac. Ingegneria, Dipto. Ingegneria dell'Informazione, Università Siena, Via Roma 56, Siena, 53100, Italy
Marco Maggini
University of Canberra, School of Electrical and Information, Adjunct Professor, Mawson Lakes Campus, ACT, 2601, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tsoi, A.C., Hagenbuchner, M., Kc, M., Zhang, S. (2013). Learning Structural Representations of Text Documents in Large Document Collections. In: Bianchini, M., Maggini, M., Jain, L. (eds) Handbook on Neural Information Processing. Intelligent Systems Reference Library, vol 49. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36657-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-36657-4_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36656-7
Online ISBN: 978-3-642-36657-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics