Abstract:
Grouping large amounts of data is critical for various tasks, including the identification of content on a specific topic of interest (such as terrorism-related content) ...Show MoreMetadata
Abstract:
Grouping large amounts of data is critical for various tasks, including the identification of content on a specific topic of interest (such as terrorism-related content) within a collection of material gathered from online sources. Various existing approaches typically extract relevant features using topic distributions and/or embedding methods, and subsequently apply clustering techniques in the derived representation space. In this work, we present a comparative study using Latent Dirichlet Allocation (LDA), Paragraph-Vector Distributed Bag-of-Words (PV-DBOW), and Paragraph-Vector Distributed Memory (PV-DM) models as representation methods, in conjunction with five traditional clustering algorithms, namely k-means, spherical k-means, possibilistic fuzzy c-means, agglomerative clustering and NMF, on two publicly available and one proprietary datasets. Fifteen combinations are formed which are assessed using external clustering validity measures, such as Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) against available ground-truth. Our results indicate that using PV-DBOW leads in general to better clustering performance in all datasets.
Date of Conference: 26-27 November 2019
Date Added to IEEE Xplore: 05 June 2020
ISBN Information: