Skip to main content

Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering

  • Conference paper
Web-Age Information Management (WAIM 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8485))

Included in the following conference series:

Abstract

Entity resolution has been widely used in data mining applications to find similar records. However, the increasing scale and complexity of data has restricted the performance of entity resolution. In this paper, we propose a novel entity resolution framework that clusters large-scale data with distributed entity resolution method. We model the clustering problem as finding similarity sub connected graphs from records. Firstly, our approach finds pairs of records whose similarities are above a given threshold based on appjoin algorithm which extends the ppjoin algorithm and are executed on MapReduce framework. Then, we propose a cache-based algorithm which cluster entities with similar pairs based on the Disjoin Set algorithm and are also designed for MapReduce framework. Experimental results on real dataset show that our algorithms can achieve more efficiency than previous algorithms on the entity resolution and clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J., 255–276 (2009)

    Google Scholar 

  2. Arasu, A., Ganti, V., Kaushik, R.: Efficient Exact Set-Similarity Joins. In: VLDB, pp. 918–929 (2006)

    Google Scholar 

  3. Dohnal, V., Gennaro, C., Zezula, P.: Similarity Join in Metric Spaces Using eD-Index. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 484–493. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  4. Chaudhuri, S., Ganti, V., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: ICDE (2006)

    Google Scholar 

  5. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. Computer Networks, 1157–1166 (1997)

    Google Scholar 

  6. Cho, J., Shivakumar, N., Garcia-Molina, H.: Finding Replicated Web Collections. In: SIGMOD Conference, pp. 355–366 (2000)

    Google Scholar 

  7. Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures: a large-scale study in the orkut social network. In: KDD, pp. 678–684 (2005)

    Google Scholar 

  8. Gibson, D., Kumar, R., Tomkins, A.: Discovering Large Dense Subgraphs in Massive Graphs. In: VLDB, pp. 721–732 (2005)

    Google Scholar 

  9. On, B., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: Improving Grouped-Entity Resolution Using Quasi-Cliques. In: ICDM, pp. 1008–1015 (2006)

    Google Scholar 

  10. Chaudhuri, S., Ganti, V., Xin, D.: Mining Document Collections to Facilitate Accurate Approximate Entity Matching. In: PVLDB, pp. 395–406 (2009)

    Google Scholar 

  11. Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(7), 1–38 (2008)

    Article  Google Scholar 

  12. Lee, H., Ng, R.T., Shim, K.: Similarity Join Size Estimation using Locality Sensitive Hashing. In: PVLDB, pp. 338–349 (2011)

    Google Scholar 

  13. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 15 (2011)

    Google Scholar 

  14. Ribeiro, L.A., Härder, T.: Efficient set similarity joins using min-prefixes. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds.) ADBIS 2009. LNCS, vol. 5739, pp. 88–102. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  15. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)

    Google Scholar 

  16. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD Conference, pp. 85–96 (2012)

    Google Scholar 

  17. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 137–150 (2004)

    Google Scholar 

  18. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference, pp. 495–506 (2010)

    Google Scholar 

  19. Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: SIGMOD Conference, pp. 693–696 (2012)

    Google Scholar 

  20. Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy Joins Using MapReduce. In: ICDE, pp. 498–509 (2012)

    Google Scholar 

  21. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference, pp. 495–506 (2010)

    Google Scholar 

  22. Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., Li, R.: MapDupReducer: detecting near duplicates over massive datasets. In: SIGMOD Conference, pp. 1119–1122 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Nie, T., Lee, Wc., Shen, D., Yu, G., Kou, Y. (2014). Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering. In: Li, F., Li, G., Hwang, Sw., Yao, B., Zhang, Z. (eds) Web-Age Information Management. WAIM 2014. Lecture Notes in Computer Science, vol 8485. Springer, Cham. https://doi.org/10.1007/978-3-319-08010-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08010-9_16

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08009-3

  • Online ISBN: 978-3-319-08010-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics