short-paper

Document clustering as a record linkage problem

Authors:
Nikiforos Pittaras

Institute of Informatics and Telecommunications, N.C.S.R. Demokritos, Greece

Institute of Informatics and Telecommunications, N.C.S.R. Demokritos, Greece
View Profile

,
George Giannakopoulos

Institute of Informatics and Telecommunications, N.C.S.R. Demokritos, Greece

Institute of Informatics and Telecommunications, N.C.S.R. Demokritos, Greece
View Profile

,
Leonidas Tsekouras

Institute of Informatics and Telecommunications, N.C.S.R. Demokritos, Greece

Institute of Informatics and Telecommunications, N.C.S.R. Demokritos, Greece
View Profile

,
Iraklis Varlamis

Department of Informatics and Telematics, Harokopio University of Athens, Greece

Department of Informatics and Telematics, Harokopio University of Athens, Greece
View Profile

DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018August 2018Article No.: 39Pages 1–4https://doi.org/10.1145/3209280.3229109

Published:28 August 2018Publication History

DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018

Pages 1–4

ABSTRACT

This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI Record Linkage toolkit is employed for most of the record linkage pipeline tasks (i.e. preprocessing, scalable feature representation, blocking and clustering) and the OpenCalais platform for entity extraction. The resulting clusters are evaluated with multiple clustering quality metrics. The experiments show very good clustering results and significant speedups in the clustering process, which indicates the suitability of both the record linkage formulation and the JedAI toolkit for improving the scalability for large-scale document clustering tasks.

References

{Becker et al., 2011} Becker, H., Naaman, M., and Gravano, L. (2011). Beyond trending topics: Real-world event identification on twitter. ICWSM, 11(2011):438--441.Google Scholar
{Bi et al., 2016} Bi, X., Zhao, X., Ma, W., Zhang, Z., and Zhan, H. (2016). Record linkage for event identification in xml feeds stream using ELM. In ELM-2015, volume 1, pages 463--476. Springer.Google Scholar
{Brizan and Tansel, 2006} Brizan, D. G. and Tansel, A. U. (2006). A. survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3):5.Google Scholar
{Daniel et al., 2003} Daniel, N., Radev, D., and Allison, T. (2003). Sub-event based multi-document summarization. In HLT-NAACL 2003 Workshop on Text summarization, volume 5, pages 9--16. ACL. Google ScholarDigital Library
{Giannakopoulos, 2009} Giannakopoulos, G. (2009). Automatic Summarization from Multiple Documents. Ph. D. dissertation, University of the Aegean, Department of Information and Communication Systems Engineering.Google Scholar
{Giannakopoulos and Karkaletsis, 2009} Giannakopoulos, G. and Karkaletsis, V. (2009). N-gram graphs: Representing documents and document sets in summary system evaluation. In TAC 2009.Google Scholar
{Gomaa and Fahmy, 2013} Gomaa, W. H. and Fahmy, A. A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13).Google Scholar
{Hassanzadeh et al., 2009} Hassanzadeh, O., Chiang, F., Lee, H. C., and Miller, R. J. (2009). Framework for evaluating clustering algorithms in duplicate detection. VLDB 2009, 2(1):1282--1293. Google ScholarDigital Library
{Kuang et al., 2015} Kuang, D., Choo, J., and Park, H. (2015). Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional Clustering Algorithms, pages 215--243. Springer.Google ScholarCross Ref
{Kusner et al., 2015} Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. (2015). From word embeddings to document distances. In ICML 2015, pages 957--966. Google ScholarDigital Library
{Papadakis et al., 2016} Papadakis, G., Svirsky, J., Gal, A., and Palpanas, T. (2016). Comparative analysis of approximate blocking techniques for entity resolution. VLDB 2016, 9(9):684--695. Google ScholarDigital Library
{Papadakis et al., 2017} Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., and Koubarakis, M. (2017). JedAI: The force behind entity resolution. In ESWC 2017, pages 161--166. Springer.Google Scholar
{Reuter et al., 2011} Reuter, T., Cimiano, P., Drumond, L., Buza, K., and Schmidt-Thieme, L. (2011). Scalable event-based clustering of social media via record linkage techniques. In ICWSM 2011.Google Scholar
{Schenker et al., 2005} Schenker, A., Kandel, A., Bunke, H., and Last, M. (2005). Graph-theoretic techniques for web content mining, volume 62. World Scientific. Google ScholarDigital Library
{Tsatsaronis et al., 2010} Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). Text relatedness based on a word thesaurus. JAIR, 37:1--39. Google ScholarDigital Library
{Tsekouras et al., 2017} Tsekouras, L., Varlamis, I., and Giannakopoulos, G. (2017). A graph-based text similarity measure that employs named entity information. In RANLP 2017, pages 765--771.Google Scholar
{Wijaya and Bressan, 2009} Wijaya, D. T. and Bressan, S. (2009). Ricochet: A family of unconstrained algorithms for graph clustering. In DASFAA 2009, pages 153--167. Springer. Google ScholarDigital Library

Index Terms

Document clustering as a record linkage problem
1. Information systems
  1. Information retrieval
    1. Document representation
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Multiple instance learning for group record linkage
PAKDD'12: Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I

Record linkage is the process of identifying records that refer to the same entities from different data sources. While most research efforts are concerned with linking individual records, new approaches have recently been proposed to link groups of ...
Read More
A taxonomy of privacy-preserving record linkage techniques

The process of identifying which records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as record linkage, data matching or entity ...
Read More
Iterative record linkage for cleaning and integration
DMKD '04: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery

Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018
August 2018
311 pages
ISBN:9781450357692
DOI:10.1145/3209280

Copyright © 2018 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 August 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Clustering
Entity Resolution
Record Linkage
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate178of537submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 93
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Document clustering as a record linkage problem

DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multiple instance learning for group record linkage

A taxonomy of privacy-preserving record linkage techniques

Iterative record linkage for cleaning and integration

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Document clustering as a record linkage problem

DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multiple instance learning for group record linkage

A taxonomy of privacy-preserving record linkage techniques

Iterative record linkage for cleaning and integration

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media