skip to main content
10.1145/3209280.3229109acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper

Document clustering as a record linkage problem

Authors Info & Claims
Published:28 August 2018Publication History

ABSTRACT

This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI Record Linkage toolkit is employed for most of the record linkage pipeline tasks (i.e. preprocessing, scalable feature representation, blocking and clustering) and the OpenCalais platform for entity extraction. The resulting clusters are evaluated with multiple clustering quality metrics. The experiments show very good clustering results and significant speedups in the clustering process, which indicates the suitability of both the record linkage formulation and the JedAI toolkit for improving the scalability for large-scale document clustering tasks.

References

  1. {Becker et al., 2011} Becker, H., Naaman, M., and Gravano, L. (2011). Beyond trending topics: Real-world event identification on twitter. ICWSM, 11(2011):438--441.Google ScholarGoogle Scholar
  2. {Bi et al., 2016} Bi, X., Zhao, X., Ma, W., Zhang, Z., and Zhan, H. (2016). Record linkage for event identification in xml feeds stream using ELM. In ELM-2015, volume 1, pages 463--476. Springer.Google ScholarGoogle Scholar
  3. {Brizan and Tansel, 2006} Brizan, D. G. and Tansel, A. U. (2006). A. survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3):5.Google ScholarGoogle Scholar
  4. {Daniel et al., 2003} Daniel, N., Radev, D., and Allison, T. (2003). Sub-event based multi-document summarization. In HLT-NAACL 2003 Workshop on Text summarization, volume 5, pages 9--16. ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. {Giannakopoulos, 2009} Giannakopoulos, G. (2009). Automatic Summarization from Multiple Documents. Ph. D. dissertation, University of the Aegean, Department of Information and Communication Systems Engineering.Google ScholarGoogle Scholar
  6. {Giannakopoulos and Karkaletsis, 2009} Giannakopoulos, G. and Karkaletsis, V. (2009). N-gram graphs: Representing documents and document sets in summary system evaluation. In TAC 2009.Google ScholarGoogle Scholar
  7. {Gomaa and Fahmy, 2013} Gomaa, W. H. and Fahmy, A. A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13).Google ScholarGoogle Scholar
  8. {Hassanzadeh et al., 2009} Hassanzadeh, O., Chiang, F., Lee, H. C., and Miller, R. J. (2009). Framework for evaluating clustering algorithms in duplicate detection. VLDB 2009, 2(1):1282--1293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. {Kuang et al., 2015} Kuang, D., Choo, J., and Park, H. (2015). Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional Clustering Algorithms, pages 215--243. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  10. {Kusner et al., 2015} Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. (2015). From word embeddings to document distances. In ICML 2015, pages 957--966. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. {Papadakis et al., 2016} Papadakis, G., Svirsky, J., Gal, A., and Palpanas, T. (2016). Comparative analysis of approximate blocking techniques for entity resolution. VLDB 2016, 9(9):684--695. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. {Papadakis et al., 2017} Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., and Koubarakis, M. (2017). JedAI: The force behind entity resolution. In ESWC 2017, pages 161--166. Springer.Google ScholarGoogle Scholar
  13. {Reuter et al., 2011} Reuter, T., Cimiano, P., Drumond, L., Buza, K., and Schmidt-Thieme, L. (2011). Scalable event-based clustering of social media via record linkage techniques. In ICWSM 2011.Google ScholarGoogle Scholar
  14. {Schenker et al., 2005} Schenker, A., Kandel, A., Bunke, H., and Last, M. (2005). Graph-theoretic techniques for web content mining, volume 62. World Scientific. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. {Tsatsaronis et al., 2010} Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). Text relatedness based on a word thesaurus. JAIR, 37:1--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. {Tsekouras et al., 2017} Tsekouras, L., Varlamis, I., and Giannakopoulos, G. (2017). A graph-based text similarity measure that employs named entity information. In RANLP 2017, pages 765--771.Google ScholarGoogle Scholar
  17. {Wijaya and Bressan, 2009} Wijaya, D. T. and Bressan, S. (2009). Ricochet: A family of unconstrained algorithms for graph clustering. In DASFAA 2009, pages 153--167. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Document clustering as a record linkage problem

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018
        August 2018
        311 pages
        ISBN:9781450357692
        DOI:10.1145/3209280

        Copyright © 2018 ACM

        Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 28 August 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate178of537submissions,33%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader