Abstract
Entity resolution has been widely used in data mining applications to find similar records. However, the increasing scale and complexity of data has restricted the performance of entity resolution. In this paper, we propose a novel entity resolution framework that clusters large-scale data with distributed entity resolution method. We model the clustering problem as finding similarity sub connected graphs from records. Firstly, our approach finds pairs of records whose similarities are above a given threshold based on appjoin algorithm which extends the ppjoin algorithm and are executed on MapReduce framework. Then, we propose a cache-based algorithm which cluster entities with similar pairs based on the Disjoin Set algorithm and are also designed for MapReduce framework. Experimental results on real dataset show that our algorithms can achieve more efficiency than previous algorithms on the entity resolution and clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J., 255–276 (2009)
Arasu, A., Ganti, V., Kaushik, R.: Efficient Exact Set-Similarity Joins. In: VLDB, pp. 918–929 (2006)
Dohnal, V., Gennaro, C., Zezula, P.: Similarity Join in Metric Spaces Using eD-Index. In: MaÅ™Ãk, V., Å tÄ›pánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 484–493. Springer, Heidelberg (2003)
Chaudhuri, S., Ganti, V., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: ICDE (2006)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. Computer Networks, 1157–1166 (1997)
Cho, J., Shivakumar, N., Garcia-Molina, H.: Finding Replicated Web Collections. In: SIGMOD Conference, pp. 355–366 (2000)
Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures: a large-scale study in the orkut social network. In: KDD, pp. 678–684 (2005)
Gibson, D., Kumar, R., Tomkins, A.: Discovering Large Dense Subgraphs in Massive Graphs. In: VLDB, pp. 721–732 (2005)
On, B., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: Improving Grouped-Entity Resolution Using Quasi-Cliques. In: ICDM, pp. 1008–1015 (2006)
Chaudhuri, S., Ganti, V., Xin, D.: Mining Document Collections to Facilitate Accurate Approximate Entity Matching. In: PVLDB, pp. 395–406 (2009)
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(7), 1–38 (2008)
Lee, H., Ng, R.T., Shim, K.: Similarity Join Size Estimation using Locality Sensitive Hashing. In: PVLDB, pp. 338–349 (2011)
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 15 (2011)
Ribeiro, L.A., Härder, T.: Efficient set similarity joins using min-prefixes. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds.) ADBIS 2009. LNCS, vol. 5739, pp. 88–102. Springer, Heidelberg (2009)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD Conference, pp. 85–96 (2012)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 137–150 (2004)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference, pp. 495–506 (2010)
Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: SIGMOD Conference, pp. 693–696 (2012)
Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy Joins Using MapReduce. In: ICDE, pp. 498–509 (2012)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference, pp. 495–506 (2010)
Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., Li, R.: MapDupReducer: detecting near duplicates over massive datasets. In: SIGMOD Conference, pp. 1119–1122 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Nie, T., Lee, Wc., Shen, D., Yu, G., Kou, Y. (2014). Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering. In: Li, F., Li, G., Hwang, Sw., Yao, B., Zhang, Z. (eds) Web-Age Information Management. WAIM 2014. Lecture Notes in Computer Science, vol 8485. Springer, Cham. https://doi.org/10.1007/978-3-319-08010-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-08010-9_16
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08009-3
Online ISBN: 978-3-319-08010-9
eBook Packages: Computer ScienceComputer Science (R0)