Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering

Nie, Tiezheng; Lee, Wang-chien; Shen, Derong; Yu, Ge; Kou, Yue

doi:10.1007/978-3-319-08010-9_16

Tiezheng Nie²⁰,
Wang-chien Lee²¹,
Derong Shen²⁰,
Ge Yu²⁰ &
…
Yue Kou²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8485))

Included in the following conference series:

International Conference on Web-Age Information Management

5863 Accesses
2 Citations

Abstract

Entity resolution has been widely used in data mining applications to find similar records. However, the increasing scale and complexity of data has restricted the performance of entity resolution. In this paper, we propose a novel entity resolution framework that clusters large-scale data with distributed entity resolution method. We model the clustering problem as finding similarity sub connected graphs from records. Firstly, our approach finds pairs of records whose similarities are above a given threshold based on appjoin algorithm which extends the ppjoin algorithm and are executed on MapReduce framework. Then, we propose a cache-based algorithm which cluster entities with similar pairs based on the Disjoin Set algorithm and are also designed for MapReduce framework. Experimental results on real dataset show that our algorithms can achieve more efficiency than previous algorithms on the entity resolution and clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J., 255–276 (2009)
Google Scholar
Arasu, A., Ganti, V., Kaushik, R.: Efficient Exact Set-Similarity Joins. In: VLDB, pp. 918–929 (2006)
Google Scholar
Dohnal, V., Gennaro, C., Zezula, P.: Similarity Join in Metric Spaces Using eD-Index. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 484–493. Springer, Heidelberg (2003)
Chapter Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: ICDE (2006)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. Computer Networks, 1157–1166 (1997)
Google Scholar
Cho, J., Shivakumar, N., Garcia-Molina, H.: Finding Replicated Web Collections. In: SIGMOD Conference, pp. 355–366 (2000)
Google Scholar
Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures: a large-scale study in the orkut social network. In: KDD, pp. 678–684 (2005)
Google Scholar
Gibson, D., Kumar, R., Tomkins, A.: Discovering Large Dense Subgraphs in Massive Graphs. In: VLDB, pp. 721–732 (2005)
Google Scholar
On, B., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: Improving Grouped-Entity Resolution Using Quasi-Cliques. In: ICDM, pp. 1008–1015 (2006)
Google Scholar
Chaudhuri, S., Ganti, V., Xin, D.: Mining Document Collections to Facilitate Accurate Approximate Entity Matching. In: PVLDB, pp. 395–406 (2009)
Google Scholar
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(7), 1–38 (2008)
Article Google Scholar
Lee, H., Ng, R.T., Shim, K.: Similarity Join Size Estimation using Locality Sensitive Hashing. In: PVLDB, pp. 338–349 (2011)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 15 (2011)
Google Scholar
Ribeiro, L.A., Härder, T.: Efficient set similarity joins using min-prefixes. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds.) ADBIS 2009. LNCS, vol. 5739, pp. 88–102. Springer, Heidelberg (2009)
Chapter Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Google Scholar
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD Conference, pp. 85–96 (2012)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 137–150 (2004)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference, pp. 495–506 (2010)
Google Scholar
Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: SIGMOD Conference, pp. 693–696 (2012)
Google Scholar
Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy Joins Using MapReduce. In: ICDE, pp. 498–509 (2012)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference, pp. 495–506 (2010)
Google Scholar
Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., Li, R.: MapDupReducer: detecting near duplicates over massive datasets. In: SIGMOD Conference, pp. 1119–1122 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Science and Engineering, Northeastern University, Shenyang, 110819, P.R. China
Tiezheng Nie, Derong Shen, Ge Yu & Yue Kou
Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
Wang-chien Lee

Authors

Tiezheng Nie
View author publications
You can also search for this author in PubMed Google Scholar
Wang-chien Lee
View author publications
You can also search for this author in PubMed Google Scholar
Derong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yue Kou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, University of Utah, 50 S. Central Campus Drive, 84112, Salt Lake City,, UT, USA
Feifei Li
Department of Computer Science, Tsinghua University, 100084, Beijing, China
Guoliang Li
POSTECH, Republic of Korea
Seung-won Hwang
Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering,, Shanghai Jiao Tong University, China
Bin Yao
Advanced Digital Sciences Center (ADSC), 138632, Singapore, Singapore
Zhenjie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nie, T., Lee, Wc., Shen, D., Yu, G., Kou, Y. (2014). Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering. In: Li, F., Li, G., Hwang, Sw., Yao, B., Zhang, Z. (eds) Web-Age Information Management. WAIM 2014. Lecture Notes in Computer Science, vol 8485. Springer, Cham. https://doi.org/10.1007/978-3-319-08010-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-08010-9_16
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08009-3
Online ISBN: 978-3-319-08010-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics