ABSTRACT
This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI Record Linkage toolkit is employed for most of the record linkage pipeline tasks (i.e. preprocessing, scalable feature representation, blocking and clustering) and the OpenCalais platform for entity extraction. The resulting clusters are evaluated with multiple clustering quality metrics. The experiments show very good clustering results and significant speedups in the clustering process, which indicates the suitability of both the record linkage formulation and the JedAI toolkit for improving the scalability for large-scale document clustering tasks.
- {Becker et al., 2011} Becker, H., Naaman, M., and Gravano, L. (2011). Beyond trending topics: Real-world event identification on twitter. ICWSM, 11(2011):438--441.Google Scholar
- {Bi et al., 2016} Bi, X., Zhao, X., Ma, W., Zhang, Z., and Zhan, H. (2016). Record linkage for event identification in xml feeds stream using ELM. In ELM-2015, volume 1, pages 463--476. Springer.Google Scholar
- {Brizan and Tansel, 2006} Brizan, D. G. and Tansel, A. U. (2006). A. survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3):5.Google Scholar
- {Daniel et al., 2003} Daniel, N., Radev, D., and Allison, T. (2003). Sub-event based multi-document summarization. In HLT-NAACL 2003 Workshop on Text summarization, volume 5, pages 9--16. ACL. Google ScholarDigital Library
- {Giannakopoulos, 2009} Giannakopoulos, G. (2009). Automatic Summarization from Multiple Documents. Ph. D. dissertation, University of the Aegean, Department of Information and Communication Systems Engineering.Google Scholar
- {Giannakopoulos and Karkaletsis, 2009} Giannakopoulos, G. and Karkaletsis, V. (2009). N-gram graphs: Representing documents and document sets in summary system evaluation. In TAC 2009.Google Scholar
- {Gomaa and Fahmy, 2013} Gomaa, W. H. and Fahmy, A. A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13).Google Scholar
- {Hassanzadeh et al., 2009} Hassanzadeh, O., Chiang, F., Lee, H. C., and Miller, R. J. (2009). Framework for evaluating clustering algorithms in duplicate detection. VLDB 2009, 2(1):1282--1293. Google ScholarDigital Library
- {Kuang et al., 2015} Kuang, D., Choo, J., and Park, H. (2015). Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional Clustering Algorithms, pages 215--243. Springer.Google ScholarCross Ref
- {Kusner et al., 2015} Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. (2015). From word embeddings to document distances. In ICML 2015, pages 957--966. Google ScholarDigital Library
- {Papadakis et al., 2016} Papadakis, G., Svirsky, J., Gal, A., and Palpanas, T. (2016). Comparative analysis of approximate blocking techniques for entity resolution. VLDB 2016, 9(9):684--695. Google ScholarDigital Library
- {Papadakis et al., 2017} Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., and Koubarakis, M. (2017). JedAI: The force behind entity resolution. In ESWC 2017, pages 161--166. Springer.Google Scholar
- {Reuter et al., 2011} Reuter, T., Cimiano, P., Drumond, L., Buza, K., and Schmidt-Thieme, L. (2011). Scalable event-based clustering of social media via record linkage techniques. In ICWSM 2011.Google Scholar
- {Schenker et al., 2005} Schenker, A., Kandel, A., Bunke, H., and Last, M. (2005). Graph-theoretic techniques for web content mining, volume 62. World Scientific. Google ScholarDigital Library
- {Tsatsaronis et al., 2010} Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). Text relatedness based on a word thesaurus. JAIR, 37:1--39. Google ScholarDigital Library
- {Tsekouras et al., 2017} Tsekouras, L., Varlamis, I., and Giannakopoulos, G. (2017). A graph-based text similarity measure that employs named entity information. In RANLP 2017, pages 765--771.Google Scholar
- {Wijaya and Bressan, 2009} Wijaya, D. T. and Bressan, S. (2009). Ricochet: A family of unconstrained algorithms for graph clustering. In DASFAA 2009, pages 153--167. Springer. Google ScholarDigital Library
Index Terms
- Document clustering as a record linkage problem
Recommendations
Multiple instance learning for group record linkage
PAKDD'12: Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part IRecord linkage is the process of identifying records that refer to the same entities from different data sources. While most research efforts are concerned with linking individual records, new approaches have recently been proposed to link groups of ...
A taxonomy of privacy-preserving record linkage techniques
The process of identifying which records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as record linkage, data matching or entity ...
Iterative record linkage for cleaning and integration
DMKD '04: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discoveryRecord linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares ...
Comments