Skip to main content
Log in

DoSO: a document self-organizer

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we propose a Document Self Organizer (DoSO), an extension of the classic Self Organizing Map (SOM) model, in order to deal more efficiently with a document clustering task. Starting from a document representation model, based on important “concepts” exploiting Wikipedia knowledge, that we have previously developed in order to overcome some of the shortcomings of the Bag-of-Words (BOW) model, we demonstrate how SOM’s performance can be boosted by using the most important concepts of the document collection to explicitly initialize the neurons. We also show how a hierarchical approach can be utilized in the SOM model and how this can lead to a more comprehensive final clustering result with hierarchical descriptive labels attached to neurons and clusters. Experiments show that the proposed model (DoSO) yields promising results both in terms of extrinsic and SOM evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. See http://kdd.ics.uci.edu/.

  2. See http://icame.uib.no/.

References

  • Alias-i (2008). LingPipe 4.1.0 (online). http://alias-i.com/lingpipe. Accessed 23 Jan 2012

  • Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12, 461–486.

    Article  Google Scholar 

  • Banerjee, S., Ramanathan, K., & Gupta, A. (2007). Clustering short texts using Wikipedia. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 787–788). New York, NY, U.S.A.: ACM.

    Google Scholar 

  • Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., et al. (2009). DBpedia—A crystallization point for the Web of data. Journal Web Semantics, 7(3), 154–165.

    Article  Google Scholar 

  • Bloehdorn, S., Cimiano, P., & Hotho, A. (2006). Learning ontologies to improve text clustering and classification. In M. Spiliopoulou, R. Kruse, A. Nürnberger, C. Borgelt, & W. Gaul (Eds.), From data and information analysis to knowledge engineering: Proceedings of the 29th annual conference of the German classification society (GfKl 2005), 9–11 Mar 2005, Magdeburg, Germany. Studies in classification, data analysis, and knowledge organization (Vol. 30, pp. 334–341). Berlin-Heidelberg, Germany: Springer.

    Google Scholar 

  • Breaux, T. D., & Reed, J. W. (2005). Using ontology in hierarchical information clustering. In HICSS ’05: Proceedings of the proceedings of the 38th annual Hawaii international conference on system sciences (HICSS’05)—track 4 (p. 111.2). Washington, DC, U.S.A.: IEEE Computer Society.

    Google Scholar 

  • Bunescu, R. C., & Pasca, M. (2007). Using encyclopedic knowledge for named entity disambiguation. In EACL. The Association for Computer Linguistics.

  • A. Carnegie Group Inc., & B. Reuters Ltd. (1997). Reuters-21578 text categorization test collection.

  • Carpenter, G. A., Grossberg, S., & Rosen, D. B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4(6), 759–771.

    Article  Google Scholar 

  • Chen, H., Schuffels, C., & Orwig, R. (1996). Internet categorization and search: A self-organizing approach. Journal of Visual Communication and Image Representation, 7(1), 88–102.

    Article  Google Scholar 

  • Cucerzan, S. (2007). Large-scale named entity disambiguation based on Wikipedia data. In Proc. 2007 joint conference on EMNLP and CNLL (pp. 708–716).

  • Davison, M. L. (1983). Multidimensional scaling. New York: Wiley.

    MATH  Google Scholar 

  • Demartines, P., & Herault, J. (1997). Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. IEEE Transactions on Neural Networks, 8(1), 148–154.

    Article  Google Scholar 

  • Francis, W. N., & Kucera, H. (1964). Manual of information to accompany a standard corpus of present-day edited American english, for use with digital computers. Providence, Rhode Island.

  • Fung, B. C. M., Wang, K., & Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proc. of the 3rd SIAM international conference on data mining (SDM) (pp. 59–70). San Francisco, CA: SIAM.

    Google Scholar 

  • Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI’06: Proceedings of the 21st national conference on artificial intelligence (pp. 1301–1306). Menlo Park, CA: AAAI Press.

    Google Scholar 

  • Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In IJCAI’07: Proceedings of the 20th international joint conference on artifical intelligence (pp. 1606–1611). San Francisco, CA, U.S.A.: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  • Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

    MATH  Google Scholar 

  • Hammouda, K. M., & Kamel, M. S. (2004). Efficient phrase-based document indexing for Web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16, 1279–1296.

    Article  Google Scholar 

  • He, J., Tan, A.-h., & Tan, C.-l. (2002). ART-C: A neural architecture for self-organization under constraints. In In proceedings of international joint conference on neural networks (IJCNN) (pp. 2550–2555).

  • Himberg, J. (2000). A SOM based cluster visualization and its application for false coloring. In IJCNN ’00: Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks (IJCNN’00) (Vol. 3, p. 3587). Washington, DC, U.S.A.: IEEE Computer Society.

    Google Scholar 

  • Hofmann, T. (1999). The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In In IJCAI (pp. 682–687).

  • Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. In Y. Ding, K. van Rijsbergen, I. Ounis, & J. Jose (Eds.), Proceedings of the semantic Web workshop of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (SIGIR 2003), 1 Aug 2003, Toronto Canada.

  • Hotho, A., & Stumme, G. (2002). Conceptual clustering of text clusters. In Proceedings of FGML workshop (pp. 37–45). Special Interest Group of German Informatics Society (FGML).

  • Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., et al. (2008). Enhancing text clustering by leveraging Wikipedia semantics. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 179–186). New York, NY, U.S.A.: ACM.

    Chapter  Google Scholar 

  • Hu, X., Zhang, X., Lu, C., Park, E. K., & Zhou, X. (2009). Exploiting Wikipedia as external knowledge for document clustering. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 389–396). New York, NY, U.S.A.: ACM.

    Chapter  Google Scholar 

  • Huang, A., Milne, D., Frank, E., & Witten, I. H. (2009). Clustering documents using a Wikipedia-based concept representation. In Proceedings of the 13th Pacific–Asia Conference on advances in knowledge discovery and data mining. PAKDD ’09 (pp. 628–636). Berlin, Heidelberg: Springer.

    Chapter  Google Scholar 

  • Jin, H., Wong, M.-L., & Leung, K. S. (2005). Scalable model-based clustering for large databases based on data summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(11), 1710–1719.

    Article  Google Scholar 

  • Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.

    Article  Google Scholar 

  • Junker, M., Sintek, M., & Rinck, M. (2000). Learning for text categorization and information extraction with ILP. Learning Language in Logic, 247–258.

  • Kangas, J., Kohonen, T., & Laaksonen, J. (1990). Variants of self-organizing maps. IEEE Transactions on Neural Networks, 1(1), 93–99.

    Article  Google Scholar 

  • Karypis, G. (2002). CLUTO—A clustering toolkit (Vol. 02–017). Technical Report.

  • Kiran, G. V. R., & Shankar, R. (2010). Enhancing document clustering using various external knowledge sources. In Proceedings of the 15th Australasian document computing symposium.

  • Kohonen, T. (1989). Self-organization and associative memory (3rd Edn.). New York, NY, U.S.A.: Springer New York, Inc.

    Book  Google Scholar 

  • Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., et al. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574–585.

    Article  Google Scholar 

  • Kohonen, T., Schroeder, M. R., & Huang, T. S. (Eds.) (2001). Self-organizing maps. Secaucus, NJ, U.S.A.: Springer New York, Inc.

    MATH  Google Scholar 

  • Kraaijveld, M. (1992). A non-linear projection method based on Kohonen’s topology preserving maps. In 11th IAPR international conference on pattern recognition, 1992. Conference B: Pattern recognition methodology and systems, proceedings (Vol. II, pp. 41 –45).

  • Lagus, K., Kaski, S., & Kohonen, T. (2004). Mining massive document collections by the WEBSOM method. Informing Science, 163(1–3), 135–156.

    Article  Google Scholar 

  • Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the international conference on machine learning. Tahoe City, California, U.S.A.: Morgan Kaufmann.

    Google Scholar 

  • Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 16–22). New York, NY, U.S.A.: ACM.

    Chapter  Google Scholar 

  • Li, Y., Luk, W. P. R., Ho, K. S. E., & Chung, F. L. K. (2007). Improving weak ad-hoc queries using Wikipedia as external corpus. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 797–798). New York, NY, U.S.A.: ACM.

    Chapter  Google Scholar 

  • Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic map for information retrieval. In SIGIR ’91: Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval (pp. 262–269). New York, NY, U.S.A.: ACM.

    Chapter  Google Scholar 

  • Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002). Document clustering with cluster refinement and model selection capabilities. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 191–198). New York, NY, U.S.A.: ACM.

    Chapter  Google Scholar 

  • Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2), 313–330.

    Google Scholar 

  • Mendes, P., Jakob, M., Garca-Silva, A., & Bizer, C. (2011). Dbpedia spotlight: Shedding light on the Web of documents. In In the proceedings of the 7th international conference on semantic systems (I-semantics).

  • Merkl, D. (1998). Text classification with self-organizing maps: Some lessons learned. Neurocomputing, 21(1–3), 61–77.

    Article  Google Scholar 

  • Merkl, D., & Rauber, A. (1997). Alternative ways for cluster visualization in self-organizing maps. In In Proc. of the workshop on self-organizing maps (WSOM97) (pp. 106–111).

  • Mihalcea, R., & Csomai, A. (2007). Wikify!: Linking documents to encyclopedic knowledge. In CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management (pp. 233–242). New York, NY, U.S.A.: ACM.

    Chapter  Google Scholar 

  • Miikkulainen, R. (1990). Script recognition with hierarchical feature maps. Connection Science, 2, 83–101.

    Article  Google Scholar 

  • Milne, D., & Witten, I. H. (2008). Learning to link with Wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge management. CIKM ’08 (pp 509–518). New York, NY, U.S.A.: ACM.

    Chapter  Google Scholar 

  • Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 301–312.

    Article  Google Scholar 

  • Moutarde, F., & Ultsch, A. (2005). U*F clustering: A new performant “cluster-mining” method based on segmentation of self-organizing maps. In Workshop on self-organizing maps (WSOM’2005).

  • Navigli, R., & Ponzetto, S. P. (2010). Babelnet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics. ACL ’10 (pp. 216–225). Stroudsburg, PA, U.S.A.: Association for Computational Linguistics.

    Google Scholar 

  • Pampalk, E., Rauber, A., & Merkl, D. (2002). Using smoothed data histograms for cluster visualization in self-organizing maps. In ICANN ’02: Proceedings of the international conference on artificial neural networks (pp. 871–876). London, U.K.: Springer.

    Google Scholar 

  • Pölzlbauer, G. (2004). Survey and comparison of quality measures for self-organizing maps. In J. Paralič, G. Pölzlbauer, & A. Rauber (Eds.), Proceedings of the fifth workshop on data analysis (WDA’04), Sliezsky dom, Vysoké Tatry, 24–27 June 2004 (pp. 67–82). Slovakia: Elfa Academic Press.

    Google Scholar 

  • Pullwitt, D. (2002). Integrating contextual information to enhance som-based text document clustering. Neural Networks, 15(8–9), 1099–1106.

    Article  Google Scholar 

  • Ratinov, L., Roth, D., Downey, D., & Anderson, M. (2011). Local and global algorithms for disambiguation to Wikipedia. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (Vol. 1, pp. 1375–1384). HLT ’11. Stroudsburg, PA, U.S.A.: Association for Computational Linguistics.

    Google Scholar 

  • Rauber, A. (1999). LabelSOM: On the labeling of self-organizing maps. In International joint conference on neural networks, 1999. IJCNN ’99 (Vol. 5, pp. 3527–3532).

  • Rauber, A., Merkl, D., & Dittenbach, M. (2002). The growing hierarchical self-organizing map: Exploratory analysis of high-dimensional data. IEEE Transactions on Neural Networks, 13, 1331–1341.

    Article  Google Scholar 

  • Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.

    Article  Google Scholar 

  • Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York, U.S.A.: McGraw-Hill.

    MATH  Google Scholar 

  • Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.

    Article  MATH  Google Scholar 

  • Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18(5), 401–409.

    Article  Google Scholar 

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing, Manchester, UK.

  • Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. In ROMAND ’04: Proceedings of the 3rd workshop on robust methods in analysis of natural language data (pp. 104–113). Morristown, NJ, U.S.A.: Association for Computational Linguistics.

    Chapter  Google Scholar 

  • Shehata, S., Karray, F., & Kamel, M. S. (2010). An efficient concept-based mining model for enhancing text clustering. IEEE Transactions on Knowledge and Data Engineering, 22, 1360–1371.

    Article  Google Scholar 

  • Slonim, N., Friedman, N., & Tishby, N. (2002). Unsupervised document classification using sequential information maximization. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 129–136). New York, NY, U.S.A.: ACM.

    Chapter  Google Scholar 

  • Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3), 233–272.

    Article  MATH  Google Scholar 

  • Spanakis, G., Siolas, G., & Stafylopatis, A. (2011). Exploiting Wikipedia knowledge for conceptual hierarchical clustering of documents. The Computer Journal, Section C: Computational Intelligence. doi:10.1093/comjnl/bxr024.

    Google Scholar 

  • Stanford (2009). Named entity recognizer (online). http://www-nlp.stanford.edu/software/CRF-NER.shtml. Accessed 23 Jan 2012

  • Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In M. Grobelnik, D. Mladenic, & N. Milic-Frayling (Eds.), KDD-2000 workshop on text mining, Boston, MA (pp. 109–111).

  • Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., & Lakhal, L. (2002). Computing iceberg concept lattices with TITANIC. Data & Knowledge Engineering, 42(2), 189–222.

    Article  MATH  Google Scholar 

  • Suchanek, F. M., Kasneci, G., & Weikum, G. (2008). Yago: A large ontology from Wikipedia and Wordnet. Journal Web Semantics, 6, 203–217.

    Article  Google Scholar 

  • Talavera, L., & Bejar, J. (2001). Generality-based conceptual clustering with probabilistic concepts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 196–206.

    Article  Google Scholar 

  • Tenenbaum, J. B., Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.

    Article  Google Scholar 

  • Toral, A., & Munoz, R. (2006). A proposal to automatically build and maintain gazetteers for named entity recognition by using Wikipedia. In EACL. The Association for Computer Linguistics.

  • Ultsch, A., & Siemon, H. P. (1990). Kohonen’s self organizing feature maps for exploratory data analysis. In Proceedings of international neural networks conference (INNC) (pp. 305–308). Kluwer Academic Press.

  • Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3), 586–600.

    Article  Google Scholar 

  • Vinokourov, A., & Girolami, M. (2002). A probabilistic framework for the hierarchic organisation and classification of document collections. Journal of Intelligent Information Systems, 18, 153–172.

    Article  Google Scholar 

  • Wang, P., & Domeniconi, C. (2008). Building semantic kernels for text classification using Wikipedia. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 713–721). New York, NY, U.S.A.: ACM.

    Chapter  Google Scholar 

  • Wang, P., Hu, J., Zeng, H.-J., & Chen, Z. (2009). Using Wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281.

    Article  Google Scholar 

  • Wang, B. B., Mckay, R. I. B., Abbass, H. A., & Barlow, M. (2003). A comparative study for domain ontology guided feature extraction. In ACSC ’03: Proceedings of the 26th Australasian computer science conference (pp. 69–78). Darlinghurst, Australia, Australia: Australian Computer Society, Inc.

    Google Scholar 

  • Wikipedia (2011). Wikipedia API (online). http://en.Wikipedia.org/w/api.php. Accessed 18 Oct 2011

  • Willett, P. (1988). Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5), 577–597.

    Article  Google Scholar 

  • Xiong, H., Steinbach, M., Tan, P., & Kumar, V. (2004). HICAP: Hierarchical clustering with pattern preservation. In Proceedings of SIAM international conference on data mining (pp. 279–290). Philadelphia, PA: SIAM.

    Google Scholar 

  • Xue, X.-B., & Zhou, Z.-H. (2009). Distributional features for text categorization. IEEE Transactions on Knowledge and Data Engineering, 21(3), 428–442.

    Article  MathSciNet  Google Scholar 

  • Yin, H. (2002). ViSOM—A novel method for multivariate data projection and structure visualization. IEEE Transactions on Neural Networks, 13(1), 237–243.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gerasimos Spanakis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Spanakis, G., Siolas, G. & Stafylopatis, A. DoSO: a document self-organizer. J Intell Inf Syst 39, 577–610 (2012). https://doi.org/10.1007/s10844-012-0204-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-012-0204-9

Keywords

Navigation