research-article

Unsupervised Topical Organization of Documents using Corpus-based Text Analysis

Authors:
Sarkis Sarkissian

School of Engineering, Dept. of Elec. and Compt. Eng. Lebanese American University, Byblos, Lebanon

School of Engineering, Dept. of Elec. and Compt. Eng. Lebanese American University, Byblos, Lebanon
View Profile

,
Joe Tekli

School of Engineering, Dept. of Elec. and Compt. Eng. Lebanese American University, Byblos, Lebanon

School of Engineering, Dept. of Elec. and Compt. Eng. Lebanese American University, Byblos, Lebanon
View Profile

MEDES '21: Proceedings of the 13th International Conference on Management of Digital EcoSystemsNovember 2021Pages 87–94https://doi.org/10.1145/3444757.3485078

Published:09 November 2021Publication History

MEDES '21: Proceedings of the 13th International Conference on Management of Digital EcoSystems

Pages 87–94

ABSTRACT

This study aims at automating the process of topical keyword organization of set of documents in an input text corpus. It is conducted in the context of a larger project to investigate efficient unsupervised learning techniques to automatically extract relevant classes and their keyword descriptions from a set of the United Nations (UN) documents, and use the latter to produce reference corpora allowing to classify future UN documents. We assume that the reference classes are unknown in advance, and thus suggest an unsupervised clustering approach which accepts as input a bunch of unstructured text documents, and produces as output groups of similar documents describing similar topics. The input document feature vectors are augmented with term co-occurrence and relatedness scores produced from a distributional thesaurus built on the same (or a related) corpus. The augmented feature vectors are then run through a hierarchical clustering process to identify groups of similar documents, which serve as candidates for topical organization and keyword extraction. Experiments on a manually labelled dataset of documents classified against the UN's Sustainable Development Goals (SDGs) confirm the quality and potential of the approach.

References

Rakesh Agrawal, Christos Faloutsos, and Arun N. Swami, Efficient Similarity Search in Sequence Databases, 1993. International Conference on the Foundations of Data Organization and Algorithms (FODO), pp. 69--165 Google ScholarDigital Library
Amir Ahmad and Shehroz Khan, 2019. Survey of State-of-the-Art Mixed Data Clustering Algorithms. IEEE Access. 7: 31883--31902.Google Scholar
Gianni Amati and C. J. Van Rijsbergen, 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS). 20(4): 357--389. Google ScholarDigital Library
Bogdan Boteanu, Ionut Mironica, and Bogdan Ionescu, 2015. Hierarchical Clustering Pseudo-Relevance Feedback for Social Image Search Result Diversification. International Conference on Content-Based Multimedia Indexing (CBMI'15), pp. 1--6.Google Scholar
Mohand Boughanem, 2006. Introduction to Information Retrieval. Proceedings of EARIA'06 (Ecole d'Automne en Recherche d'Information et Application), Ch. 1.Google Scholar
Hiram Calvo, Alexander Gelbukh, and Adam Kilgarriff, 2005. Distributional Thesaurus Versus WordNet: A Comparison of Backoff Techniques for Unsupervised PP Attachment. International Conference on Computational Linguistics and NLP (CICLing) pp. 177--188. Google ScholarDigital Library
Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2011. Thematic Exploration of Linked Data. International Workshop on Very Large Data Search (VLDS), pp. 11--16.Google Scholar
Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2012. Structured Sata Clouding across Multiple Webs. Information Systems. 37(4): 352--371. Google ScholarDigital Library
Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2014. inWalk: Interactive and Thematic Walks inside the Web of Data. International Conference on Extended DataBase Technology (EDBT'14), pp. 628--631.Google Scholar
Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2017. Exploratory Analysis of Textual Data Streams. Future Generation Computer Systems. 68: 391--406.Google ScholarCross Ref
Silvana Castano, Alfio Ferrara, and Stefano Montanelli, 2018. Topic Summary Views for Exploration of Large Scholarly Datasets. Journal of Data Semantics. 7(3): 155--170.Google ScholarCross Ref
Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, and Timos Sellis, 2006. A Methodology for Clustering XML Documents by Structure. Information Systems. 31(3):187--228. Google ScholarDigital Library
Mark Davies, The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary & Linguistic Computing, 2010. 25(4): 447--464.Google Scholar
Scott Deerwester, Susan Dumais, and Thomas Landauer, 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science. 41(6):391--407.Google ScholarCross Ref
Bernard Desgraupes, 2017. Clustering Indices - Package clusterCrit for R. University Paris Ouest, Lab Modal'X, 33 p.Google Scholar
Alfio Ferrara, Lorenzo Genta, Stephano Montanelli, and Silvana Castano, 2015. Dimensional Clustering of Linked Data: Techniques and Applications. Transactions on Large Scale Data and Knowledge Centered Systems. 19: 55--86Google ScholarCross Ref
Nelson Francis and Henry Kucera, 1982. Frequency Analysis of English Usage. Houghton Mifflin, Boston.Google Scholar
Norbert Fuhr, Probabilistic Models in Information Retrieval. 1992. The Computer Journal. 35 (3):243--255. Google ScholarDigital Library
J. C. Gower and G. J. S. Ross, 1969. Minimum Spanning Trees and Single Linkage Cluster Analysis. Applied Statistics, 18. pp. 54--64.Google ScholarCross Ref
Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis, 2001. Clustering Algorithms and Validity Measures. International Conference on Scientific and Statistical Database Management, 3--22. Google ScholarDigital Library
Ramzi Haraty R. and Mazen Hamdoun, 2002. Iterative Querying in Web-based Database Applications. ACM Symposium on Applied Computing (SAC), 458--462. Google ScholarDigital Library
Ramzi Haraty, Nashat Mansour, and Walid Daher, 2003. An Arabic Auto-indexing System for Information Retrieval. Applied Informatics, pp. 1221--1226.Google Scholar
Bogdan Ionescu, Adrian Popescu, Mihai Lupu, Alexandru-Lucian Gînsca, Bogdan Boteanu, Henning Müller, 2015. Div150Cred: A Social Image Retrieval Result Diversification with User Tagging Credibility Dataset. ACM Multimedia Systems (MMSys), pp. 207--212. Google ScholarDigital Library
Bogdan Ionescu, Adrian Popescu, Anca-Livia Radu, Henning Müller, 2014. Result Diversification in Social Image Retrieval: A Benchmarking Framework. Multimedia Tools and Applications (MTAP), pp. 1--31. Google ScholarDigital Library
Joon Ho Lee, 1994. Properties of Extended Boolean Models in Information Retrieval. International ACM SIGIR Conference, Springer-Verlag, pp.182--190. Google ScholarDigital Library
Nashat Mansour, Ramzi A. Haraty, Walid Daher, Manal Houri, 2008. An Auto-Indexing Method for Arabic Text. Information Processing and Management journal, 44(4):1538--1545. Google ScholarDigital Library
Michael McGill, 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 400 p. Google ScholarDigital Library
George Miller, Christiane Fellbaum, 2007. WordNet Then and Now. Language Resources and Evaluation. 41(2): 209--214.Google ScholarCross Ref
J.C. van Rijsbergen, 1079. Information Retrieval. Butterworths, London, 208 p. Google ScholarDigital Library
Nick Roussopoulos, Stephen Kelley, Frédéic Vincent, 1995. Nearest Neighbor Queries. Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 71--79. Google ScholarDigital Library
Gerard Salton, 1971. The SMART Retrieval System. Prentice Hall, N.J., 556 p.Google Scholar
Gerard Salton and Chris Buckley, 1988. Term-weighting Approaches in Automatic Text Retrieval. Information Processing and Management. 24(5):513--523. Google ScholarDigital Library
Gerard Salton and Michael Mcgill, 1983. Introduction to Modern Information Retrieval. McGraw-Hill, Tokio, 400 p. Google ScholarDigital Library
Jimmy Tekli, Bechara al Bouna, Youssef Bou Issa, Marc Kamradt, Ramzi A. Haraty, 2018. (k, l)-Clustering for Transactional Data Streams Anonymization. Information Security Practice and Experience. pp. 544--556.Google Scholar
Richard Chbeir, Yi Luo, Joe Tekli, Kokou Yétongnon, Carlos Raymundo Ibañez, Agma J. M. Traina, Caetano Traina Jr., and Marc Al Assad, 2014. SemIndex: Semantic-Aware Inverted Index. Symposium on Advances in Databases and Information Systems (ADBIS), pp. 290--307.Google Scholar
Joe Tekli, Richard Chbeir, Agma J. M. Traina, and Caetano Traina Jr., 2019. SemIndex+: A Semantic Indexing Scheme for Structured, Unstructured, and Partly Structured Data. Knowledge-Based Systems. 164: 378--403.Google ScholarCross Ref
Joe Tekli, Richard Chbeir, Agma J. M. Traina, Caetano Traina, Kokou Yétongnon, Carlos Raymundo Ibañez, Marc Al Assad, and Christian Kallas, 2018. Full-fledged Semantic Indexing and Querying Model Designed for Seamless Integration in Legacy RDBMS. Data and Knowledge Engineering, 117: 133--173.Google ScholarCross Ref
Joe Tekli, Richard Chbeir, and Kokou Yétongnon., Structural Similarity Evaluation between XML Documents and DTDs. Inter. Conf. on Web Information Systems Engineering (WISE), 2007, 196--211. Google ScholarDigital Library
Julie Weeds, David J. Weir, Diana McCarthy, 2004. Characterising Measures of Lexical Distributional Similarity. Int. Conf. on Comput. Linguistics (COLING), Article No. 1015. Google ScholarDigital Library
Peter Willett, 2006. The Porter Stemming Algorithm: Then and Now. Program. 40(3): 219--223.Google Scholar

Index Terms

Unsupervised Topical Organization of Documents using Corpus-based Text Analysis
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Retrieval tasks and goals
      1. Clustering and classification
      2. Information extraction

Recommendations

Cluster-based sparse topical coding for topic mining and document clustering

In this paper, we introduce a document clustering method based on Sparse Topical Coding, called Cluster-based Sparse Topical Coding. Topic modeling is capable of improving textual document clustering by describing documents via bag-of-words models and ...
Read More
A segment-based approach to clustering multi-topic documents

Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. ...
Read More
An Intelligent Information System for Organizing Online Text Documents

This paper describes an intelligent information system for effectively managing huge amounts of online text documents (such as Web documents) in a hierarchical manner. The organizational capabilities of this system are able to evolve semi-automatically ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MEDES '21: Proceedings of the 13th International Conference on Management of Digital EcoSystems
November 2021
181 pages
ISBN:9781450383141
DOI:10.1145/3444757
Conference Chairs:
Richard Chbeir,
Yannis Manolopoulos,
Ladjel Bellatreche,
Djamal Benslimane,
Program Chairs:
Mirjana Ivanovic,
Zakaria Maamar
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 November 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Augmented TF-IDF
Corpus statistics
Distributional thesaurus
Document clustering
Keyword extraction
Topical organization
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate267of682submissions,39%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 45
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unsupervised Topical Organization of Documents using Corpus-based Text Analysis

MEDES '21: Proceedings of the 13th International Conference on Management of Digital EcoSystems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cluster-based sparse topical coding for topic mining and document clustering

A segment-based approach to clustering multi-topic documents

An Intelligent Information System for Organizing Online Text Documents