skip to main content
10.1145/3444757.3485078acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmedesConference Proceedingsconference-collections
research-article

Unsupervised Topical Organization of Documents using Corpus-based Text Analysis

Authors Info & Claims
Published:09 November 2021Publication History

ABSTRACT

This study aims at automating the process of topical keyword organization of set of documents in an input text corpus. It is conducted in the context of a larger project to investigate efficient unsupervised learning techniques to automatically extract relevant classes and their keyword descriptions from a set of the United Nations (UN) documents, and use the latter to produce reference corpora allowing to classify future UN documents. We assume that the reference classes are unknown in advance, and thus suggest an unsupervised clustering approach which accepts as input a bunch of unstructured text documents, and produces as output groups of similar documents describing similar topics. The input document feature vectors are augmented with term co-occurrence and relatedness scores produced from a distributional thesaurus built on the same (or a related) corpus. The augmented feature vectors are then run through a hierarchical clustering process to identify groups of similar documents, which serve as candidates for topical organization and keyword extraction. Experiments on a manually labelled dataset of documents classified against the UN's Sustainable Development Goals (SDGs) confirm the quality and potential of the approach.

References

  1. Rakesh Agrawal, Christos Faloutsos, and Arun N. Swami, Efficient Similarity Search in Sequence Databases, 1993. International Conference on the Foundations of Data Organization and Algorithms (FODO), pp. 69--165 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amir Ahmad and Shehroz Khan, 2019. Survey of State-of-the-Art Mixed Data Clustering Algorithms. IEEE Access. 7: 31883--31902.Google ScholarGoogle Scholar
  3. Gianni Amati and C. J. Van Rijsbergen, 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS). 20(4): 357--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bogdan Boteanu, Ionut Mironica, and Bogdan Ionescu, 2015. Hierarchical Clustering Pseudo-Relevance Feedback for Social Image Search Result Diversification. International Conference on Content-Based Multimedia Indexing (CBMI'15), pp. 1--6.Google ScholarGoogle Scholar
  5. Mohand Boughanem, 2006. Introduction to Information Retrieval. Proceedings of EARIA'06 (Ecole d'Automne en Recherche d'Information et Application), Ch. 1.Google ScholarGoogle Scholar
  6. Hiram Calvo, Alexander Gelbukh, and Adam Kilgarriff, 2005. Distributional Thesaurus Versus WordNet: A Comparison of Backoff Techniques for Unsupervised PP Attachment. International Conference on Computational Linguistics and NLP (CICLing) pp. 177--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2011. Thematic Exploration of Linked Data. International Workshop on Very Large Data Search (VLDS), pp. 11--16.Google ScholarGoogle Scholar
  8. Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2012. Structured Sata Clouding across Multiple Webs. Information Systems. 37(4): 352--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2014. inWalk: Interactive and Thematic Walks inside the Web of Data. International Conference on Extended DataBase Technology (EDBT'14), pp. 628--631.Google ScholarGoogle Scholar
  10. Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2017. Exploratory Analysis of Textual Data Streams. Future Generation Computer Systems. 68: 391--406.Google ScholarGoogle ScholarCross RefCross Ref
  11. Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2018. Topic Summary Views for Exploration of Large Scholarly Datasets. Journal of Data Semantics. 7(3): 155--170.Google ScholarGoogle ScholarCross RefCross Ref
  12. Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, and Timos Sellis, 2006. A Methodology for Clustering XML Documents by Structure. Information Systems. 31(3):187--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mark Davies, The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary & Linguistic Computing, 2010. 25(4): 447--464.Google ScholarGoogle Scholar
  14. Scott Deerwester, Susan Dumais, and Thomas Landauer, 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science. 41(6):391--407.Google ScholarGoogle ScholarCross RefCross Ref
  15. Bernard Desgraupes, 2017. Clustering Indices - Package clusterCrit for R. University Paris Ouest, Lab Modal'X, 33 p.Google ScholarGoogle Scholar
  16. Alfio Ferrara, Lorenzo Genta, Stephano Montanelli, and Silvana Castano, 2015. Dimensional Clustering of Linked Data: Techniques and Applications. Transactions on Large Scale Data and Knowledge Centered Systems. 19: 55--86Google ScholarGoogle ScholarCross RefCross Ref
  17. Nelson Francis and Henry Kucera, 1982. Frequency Analysis of English Usage. Houghton Mifflin, Boston.Google ScholarGoogle Scholar
  18. Norbert Fuhr, Probabilistic Models in Information Retrieval. 1992. The Computer Journal. 35 (3):243--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. C. Gower and G. J. S. Ross, 1969. Minimum Spanning Trees and Single Linkage Cluster Analysis. Applied Statistics, 18. pp. 54--64.Google ScholarGoogle ScholarCross RefCross Ref
  20. Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis, 2001. Clustering Algorithms and Validity Measures. International Conference on Scientific and Statistical Database Management, 3--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ramzi Haraty R. and Mazen Hamdoun, 2002. Iterative Querying in Web-based Database Applications. ACM Symposium on Applied Computing (SAC), 458--462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ramzi Haraty, Nashat Mansour, and Walid Daher, 2003. An Arabic Auto-indexing System for Information Retrieval. Applied Informatics, pp. 1221--1226.Google ScholarGoogle Scholar
  23. Bogdan Ionescu, Adrian Popescu, Mihai Lupu, Alexandru-Lucian Gînsca, Bogdan Boteanu, Henning Müller, 2015. Div150Cred: A Social Image Retrieval Result Diversification with User Tagging Credibility Dataset. ACM Multimedia Systems (MMSys), pp. 207--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Bogdan Ionescu, Adrian Popescu, Anca-Livia Radu, Henning Müller, 2014. Result Diversification in Social Image Retrieval: A Benchmarking Framework. Multimedia Tools and Applications (MTAP), pp. 1--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Joon Ho Lee, 1994. Properties of Extended Boolean Models in Information Retrieval. International ACM SIGIR Conference, Springer-Verlag, pp.182--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Nashat Mansour, Ramzi A. Haraty, Walid Daher, Manal Houri, 2008. An Auto-Indexing Method for Arabic Text. Information Processing and Management journal, 44(4):1538--1545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michael McGill, 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 400 p. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. George Miller, Christiane Fellbaum, 2007. WordNet Then and Now. Language Resources and Evaluation. 41(2): 209--214.Google ScholarGoogle ScholarCross RefCross Ref
  29. J.C. van Rijsbergen, 1079. Information Retrieval. Butterworths, London, 208 p. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nick Roussopoulos, Stephen Kelley, Frédéic Vincent, 1995. Nearest Neighbor Queries. Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 71--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Gerard Salton, 1971. The SMART Retrieval System. Prentice Hall, N.J., 556 p.Google ScholarGoogle Scholar
  32. Gerard Salton and Chris Buckley, 1988. Term-weighting Approaches in Automatic Text Retrieval. Information Processing and Management. 24(5):513--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Gerard Salton and Michael Mcgill, 1983. Introduction to Modern Information Retrieval. McGraw-Hill, Tokio, 400 p. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jimmy Tekli, Bechara al Bouna, Youssef Bou Issa, Marc Kamradt, Ramzi A. Haraty, 2018. (k, l)-Clustering for Transactional Data Streams Anonymization. Information Security Practice and Experience. pp. 544--556.Google ScholarGoogle Scholar
  35. Richard Chbeir, Yi Luo, Joe Tekli, Kokou Yétongnon, Carlos Raymundo Ibañez, Agma J. M. Traina, Caetano Traina Jr., and Marc Al Assad, 2014. SemIndex: Semantic-Aware Inverted Index. Symposium on Advances in Databases and Information Systems (ADBIS), pp. 290--307.Google ScholarGoogle Scholar
  36. Joe Tekli, Richard Chbeir, Agma J. M. Traina, and Caetano Traina Jr., 2019. SemIndex+: A Semantic Indexing Scheme for Structured, Unstructured, and Partly Structured Data. Knowledge-Based Systems. 164: 378--403.Google ScholarGoogle ScholarCross RefCross Ref
  37. Joe Tekli, Richard Chbeir, Agma J. M. Traina, Caetano Traina, Kokou Yétongnon, Carlos Raymundo Ibañez, Marc Al Assad, and Christian Kallas, 2018. Full-fledged Semantic Indexing and Querying Model Designed for Seamless Integration in Legacy RDBMS. Data and Knowledge Engineering, 117: 133--173.Google ScholarGoogle ScholarCross RefCross Ref
  38. Joe Tekli, Richard Chbeir, and Kokou Yétongnon., Structural Similarity Evaluation between XML Documents and DTDs. Inter. Conf. on Web Information Systems Engineering (WISE), 2007, 196--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Julie Weeds, David J. Weir, Diana McCarthy, 2004. Characterising Measures of Lexical Distributional Similarity. Int. Conf. on Comput. Linguistics (COLING), Article No. 1015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Peter Willett, 2006. The Porter Stemming Algorithm: Then and Now. Program. 40(3): 219--223.Google ScholarGoogle Scholar

Index Terms

  1. Unsupervised Topical Organization of Documents using Corpus-based Text Analysis

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          MEDES '21: Proceedings of the 13th International Conference on Management of Digital EcoSystems
          November 2021
          181 pages
          ISBN:9781450383141
          DOI:10.1145/3444757

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 November 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate267of682submissions,39%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader