ABSTRACT
This study aims at automating the process of topical keyword organization of set of documents in an input text corpus. It is conducted in the context of a larger project to investigate efficient unsupervised learning techniques to automatically extract relevant classes and their keyword descriptions from a set of the United Nations (UN) documents, and use the latter to produce reference corpora allowing to classify future UN documents. We assume that the reference classes are unknown in advance, and thus suggest an unsupervised clustering approach which accepts as input a bunch of unstructured text documents, and produces as output groups of similar documents describing similar topics. The input document feature vectors are augmented with term co-occurrence and relatedness scores produced from a distributional thesaurus built on the same (or a related) corpus. The augmented feature vectors are then run through a hierarchical clustering process to identify groups of similar documents, which serve as candidates for topical organization and keyword extraction. Experiments on a manually labelled dataset of documents classified against the UN's Sustainable Development Goals (SDGs) confirm the quality and potential of the approach.
- Rakesh Agrawal, Christos Faloutsos, and Arun N. Swami, Efficient Similarity Search in Sequence Databases, 1993. International Conference on the Foundations of Data Organization and Algorithms (FODO), pp. 69--165 Google ScholarDigital Library
- Amir Ahmad and Shehroz Khan, 2019. Survey of State-of-the-Art Mixed Data Clustering Algorithms. IEEE Access. 7: 31883--31902.Google Scholar
- Gianni Amati and C. J. Van Rijsbergen, 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS). 20(4): 357--389. Google ScholarDigital Library
- Bogdan Boteanu, Ionut Mironica, and Bogdan Ionescu, 2015. Hierarchical Clustering Pseudo-Relevance Feedback for Social Image Search Result Diversification. International Conference on Content-Based Multimedia Indexing (CBMI'15), pp. 1--6.Google Scholar
- Mohand Boughanem, 2006. Introduction to Information Retrieval. Proceedings of EARIA'06 (Ecole d'Automne en Recherche d'Information et Application), Ch. 1.Google Scholar
- Hiram Calvo, Alexander Gelbukh, and Adam Kilgarriff, 2005. Distributional Thesaurus Versus WordNet: A Comparison of Backoff Techniques for Unsupervised PP Attachment. International Conference on Computational Linguistics and NLP (CICLing) pp. 177--188. Google ScholarDigital Library
- Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2011. Thematic Exploration of Linked Data. International Workshop on Very Large Data Search (VLDS), pp. 11--16.Google Scholar
- Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2012. Structured Sata Clouding across Multiple Webs. Information Systems. 37(4): 352--371. Google ScholarDigital Library
- Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2014. inWalk: Interactive and Thematic Walks inside the Web of Data. International Conference on Extended DataBase Technology (EDBT'14), pp. 628--631.Google Scholar
- Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2017. Exploratory Analysis of Textual Data Streams. Future Generation Computer Systems. 68: 391--406.Google ScholarCross Ref
- Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2018. Topic Summary Views for Exploration of Large Scholarly Datasets. Journal of Data Semantics. 7(3): 155--170.Google ScholarCross Ref
- Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, and Timos Sellis, 2006. A Methodology for Clustering XML Documents by Structure. Information Systems. 31(3):187--228. Google ScholarDigital Library
- Mark Davies, The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary & Linguistic Computing, 2010. 25(4): 447--464.Google Scholar
- Scott Deerwester, Susan Dumais, and Thomas Landauer, 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science. 41(6):391--407.Google ScholarCross Ref
- Bernard Desgraupes, 2017. Clustering Indices - Package clusterCrit for R. University Paris Ouest, Lab Modal'X, 33 p.Google Scholar
- Alfio Ferrara, Lorenzo Genta, Stephano Montanelli, and Silvana Castano, 2015. Dimensional Clustering of Linked Data: Techniques and Applications. Transactions on Large Scale Data and Knowledge Centered Systems. 19: 55--86Google ScholarCross Ref
- Nelson Francis and Henry Kucera, 1982. Frequency Analysis of English Usage. Houghton Mifflin, Boston.Google Scholar
- Norbert Fuhr, Probabilistic Models in Information Retrieval. 1992. The Computer Journal. 35 (3):243--255. Google ScholarDigital Library
- J. C. Gower and G. J. S. Ross, 1969. Minimum Spanning Trees and Single Linkage Cluster Analysis. Applied Statistics, 18. pp. 54--64.Google ScholarCross Ref
- Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis, 2001. Clustering Algorithms and Validity Measures. International Conference on Scientific and Statistical Database Management, 3--22. Google ScholarDigital Library
- Ramzi Haraty R. and Mazen Hamdoun, 2002. Iterative Querying in Web-based Database Applications. ACM Symposium on Applied Computing (SAC), 458--462. Google ScholarDigital Library
- Ramzi Haraty, Nashat Mansour, and Walid Daher, 2003. An Arabic Auto-indexing System for Information Retrieval. Applied Informatics, pp. 1221--1226.Google Scholar
- Bogdan Ionescu, Adrian Popescu, Mihai Lupu, Alexandru-Lucian Gînsca, Bogdan Boteanu, Henning Müller, 2015. Div150Cred: A Social Image Retrieval Result Diversification with User Tagging Credibility Dataset. ACM Multimedia Systems (MMSys), pp. 207--212. Google ScholarDigital Library
- Bogdan Ionescu, Adrian Popescu, Anca-Livia Radu, Henning Müller, 2014. Result Diversification in Social Image Retrieval: A Benchmarking Framework. Multimedia Tools and Applications (MTAP), pp. 1--31. Google ScholarDigital Library
- Joon Ho Lee, 1994. Properties of Extended Boolean Models in Information Retrieval. International ACM SIGIR Conference, Springer-Verlag, pp.182--190. Google ScholarDigital Library
- Nashat Mansour, Ramzi A. Haraty, Walid Daher, Manal Houri, 2008. An Auto-Indexing Method for Arabic Text. Information Processing and Management journal, 44(4):1538--1545. Google ScholarDigital Library
- Michael McGill, 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 400 p. Google ScholarDigital Library
- George Miller, Christiane Fellbaum, 2007. WordNet Then and Now. Language Resources and Evaluation. 41(2): 209--214.Google ScholarCross Ref
- J.C. van Rijsbergen, 1079. Information Retrieval. Butterworths, London, 208 p. Google ScholarDigital Library
- Nick Roussopoulos, Stephen Kelley, Frédéic Vincent, 1995. Nearest Neighbor Queries. Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 71--79. Google ScholarDigital Library
- Gerard Salton, 1971. The SMART Retrieval System. Prentice Hall, N.J., 556 p.Google Scholar
- Gerard Salton and Chris Buckley, 1988. Term-weighting Approaches in Automatic Text Retrieval. Information Processing and Management. 24(5):513--523. Google ScholarDigital Library
- Gerard Salton and Michael Mcgill, 1983. Introduction to Modern Information Retrieval. McGraw-Hill, Tokio, 400 p. Google ScholarDigital Library
- Jimmy Tekli, Bechara al Bouna, Youssef Bou Issa, Marc Kamradt, Ramzi A. Haraty, 2018. (k, l)-Clustering for Transactional Data Streams Anonymization. Information Security Practice and Experience. pp. 544--556.Google Scholar
- Richard Chbeir, Yi Luo, Joe Tekli, Kokou Yétongnon, Carlos Raymundo Ibañez, Agma J. M. Traina, Caetano Traina Jr., and Marc Al Assad, 2014. SemIndex: Semantic-Aware Inverted Index. Symposium on Advances in Databases and Information Systems (ADBIS), pp. 290--307.Google Scholar
- Joe Tekli, Richard Chbeir, Agma J. M. Traina, and Caetano Traina Jr., 2019. SemIndex+: A Semantic Indexing Scheme for Structured, Unstructured, and Partly Structured Data. Knowledge-Based Systems. 164: 378--403.Google ScholarCross Ref
- Joe Tekli, Richard Chbeir, Agma J. M. Traina, Caetano Traina, Kokou Yétongnon, Carlos Raymundo Ibañez, Marc Al Assad, and Christian Kallas, 2018. Full-fledged Semantic Indexing and Querying Model Designed for Seamless Integration in Legacy RDBMS. Data and Knowledge Engineering, 117: 133--173.Google ScholarCross Ref
- Joe Tekli, Richard Chbeir, and Kokou Yétongnon., Structural Similarity Evaluation between XML Documents and DTDs. Inter. Conf. on Web Information Systems Engineering (WISE), 2007, 196--211. Google ScholarDigital Library
- Julie Weeds, David J. Weir, Diana McCarthy, 2004. Characterising Measures of Lexical Distributional Similarity. Int. Conf. on Comput. Linguistics (COLING), Article No. 1015. Google ScholarDigital Library
- Peter Willett, 2006. The Porter Stemming Algorithm: Then and Now. Program. 40(3): 219--223.Google Scholar
Index Terms
- Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
Recommendations
Cluster-based sparse topical coding for topic mining and document clustering
In this paper, we introduce a document clustering method based on Sparse Topical Coding, called Cluster-based Sparse Topical Coding. Topic modeling is capable of improving textual document clustering by describing documents via bag-of-words models and ...
A segment-based approach to clustering multi-topic documents
Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. ...
An Intelligent Information System for Organizing Online Text Documents
This paper describes an intelligent information system for effectively managing huge amounts of online text documents (such as Web documents) in a hierarchical manner. The organizational capabilities of this system are able to evolve semi-automatically ...
Comments