Abstract
In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on MapReduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proc of the 22nd International Conference on Data Engineering, ICDE, Washington (2006)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., et al.: Approximate string joins in a database (almost) for free. In: Proc of the 27th International Conference on Very Large Data Bases, VLDB, pp. 491–500. Rome (2001)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proc of the 32nd International Conference on Very Large Data Bases, VLDB, pp. 918–929. Seoul (2006)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proc of the 16th International Conference on World Wide Web, pp. 131–140. ACM, Alberta (2007)
Xiao, C., Wang, W., Lin, X., et al.: Efficient Similarity Joins for Near Duplicate Detection. In: Proc of the 17th International Conference on World Wide Web, pp. 131–140. ACM, New York (2011)
Xiao, C., Wang, W., Lin, X.: Ed-join: An efficient algorithm for similarity joins with edit distance constraints. Proc of the VLDB Endowment 1(1), 933–944 (2008)
Li, G., Deng, D., Wang, J., et al.: Pass-join: A partition-based method for similarity joins. Proceedings of the VLDB Endowment 5(3), 253–264 (2011)
Jiang, Y., Deng, D., Wang, J., et al.: Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 341–348. ACM (2013)
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Chaiken, R., Jenkins, B., Larson, P.Å., et al.: SCOPE: Easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment 1(2), 1265–1276 (2008)
Schneider, D.A., De Witt, D.J.: A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. ACM (1989)
Blanas, S., Patel, J.M., Ercegovac, V., et al.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)
Olston, C., Reed, B., Silberstein, A., et al.: Automatic Optimization of Parallel Dataflow Programs. In: USENIX Annual Technical Conference, pp. 267–273 (2008)
Yang, H., Dasdan, A., Hsiao, R.L., et al.: Map-reduce-merge: Simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM (2007)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 495–506 (2010)
Gionis, A., Indyk, P., Motwan, R.: Similarity Search in High Dimensions via Hashing. VLDB 1999, 518–529 (1999)
Graupmann, J., Schenkel, R., Weikum, G.: The spheresearch engine for unified ranked retrieval of heterogeneous XML and web documents. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB Endowment, pp. 529–540 (2005)
Baxter, L., Tripathy, S., Ishaque, N., et al.: Signatures of adaptation to obligate biotrophy in the Hyaloperonospora arabidopsidis genome. Science 330(6010), 1549–1551 (2010)
Chakrabarti, K., et al.: An efficient filter for approximate membership checking. In: Proceedings of ACM SIGMOD International Conference on Management of Data 2008, pp. 805–818 (2008)
Xiao, C., et al.: Top-k set similarity joins. In: Proceedings of the 25th International Conference on Data Engineering, pp. 916–927 (2009)
Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: Proceedings of the 24th International Conference on Data Engineering, pp. 40–49 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Lin, C., Yu, H., Weng, W., He, X. (2014). Large-Scale Similarity Join with Edit-Distance Constraints. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science, vol 8422. Springer, Cham. https://doi.org/10.1007/978-3-319-05813-9_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-05813-9_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05812-2
Online ISBN: 978-3-319-05813-9
eBook Packages: Computer ScienceComputer Science (R0)