Skip to main content
Log in

An effective weighted rule-based method for entity resolution

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Entity resolution is an important task in data cleaning to detect records that belong to the same entity. It has a critical impact on digital libraries where different entities share the same name without any identifier key. Conventional methods adopt similarity measures and clustering techniques to reveal the records of a specific entity. Due to the lack of performance, recent methods build rules on records’ attributes with distinct values for entities to overcome some drawbacks. However, they use inadequate attributes and ignore common and empty attributes values which affect the quality of entity resolution. In this paper, we define a multi-attributes weighted rule system (MAWR) that investigates all values of records’ attributes in order to represent the difficult record-entity mapping. Then, we propose a rule generation algorithm based on this system. We also propose an entity resolution algorithm (MAWR-ER) depending on the generated rules to identify entities. We verify our method on real data, and the experimental results prove the effectiveness and efficiency of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://dblp.uni-trier.de.

  2. https://www.aminer.cn/disambiguation.

References

  1. Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for distributed probabilistic data. Distrib. Parallel Databases 31(4), 509–542 (2013)

    Article  MATH  Google Scholar 

  2. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 39–48 (2003)

  3. Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of the 21st International Conference on Data Engineering, 2005. ICDE 2005. IEEE, pp. 865–876 (2005)

  4. Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)

    Article  Google Scholar 

  5. Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J. Data Inf. Qual. (JDIQ) 2(2), 10 (2011)

    Google Scholar 

  6. Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proceedings of the 12th International Conference on World Wide Web. ACM, pp. 90–101 (2003)

  7. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)

    Article  Google Scholar 

  8. Li, L., Wang, H., Gao, H., Li, J.: Eif: a framework of effective entity identification. In: International Conference on Web-Age Information Management, pp. 717–728. Springer, New York (2010)

  9. Li, L., Li, J., Wang, H., Gao, H.: Context-based entity description rule for entity resolution. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, pp. 1725–1730 (2011)

  10. Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2015)

    Article  Google Scholar 

  11. Saha, T.K., Zhang, B., Al Hasan, M.: Name disambiguation from link data in a collaboration graph using temporal and topological features. Soc. Netw. Anal. Min. 5(1), 11 (2015)

    Article  Google Scholar 

  12. Shu, L., Long, B., Meng, W.: A latent topic model for complete entity resolution. In: IEEE 25th International Conference on Data Engineering. ICDE’09. IEEE, pp. 880–891 (2009)

  13. Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)

    Article  MATH  Google Scholar 

  14. Yin, X., Han, J., Philip, S.Y.: Object distinction: distinguishing objects with identical names. In: IEEE 23rd International Conference on Data Engineering. ICDE 2007. IEEE, pp. 1242–1246 (2007)

Download references

Acknowledgements

This paper was partially supported by NSFC Grant U1509216, The National Key Research and Development Program of China 2016YFB1000703, NSFC Grant 61472099,61602129, National Sci-Tech Support Plan 2015BAH10F01, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Provience LC2016026. The authors would like to thank Prof. Hong Gao and Prof. Jianzhong Li for their support in this work and also the anonymous reviewers for their valuable comments that greatly improved this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hiba Abu Ahmad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abu Ahmad, H., Wang, H. An effective weighted rule-based method for entity resolution. Distrib Parallel Databases 36, 593–612 (2018). https://doi.org/10.1007/s10619-018-7240-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-018-7240-6

Keywords

Navigation