FSampleJoin: A Fixed-Sample-Based Method for String Similarity Joins Using MapReduce

Sun, Decai; Wang, Xiaoxia

doi:10.1007/978-3-030-24265-7_22

FSampleJoin: A Fixed-Sample-Based Method for String Similarity Joins Using MapReduce

Decai Sun^11,12 &
Xiaoxia Wang¹¹

Conference paper
First Online: 11 July 2019

1616 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11633))

Abstract

Data integration and data cleaning have received significant attention in the last three decades, and similarity joins is a basic operation in these areas. In this paper, a new fixed-sample-based algorithm, called FSampleJoin, is proposed to do string similarity joins using MapReduce. Our algorithm employs a filter-verify based framework. In filter stage, a fixed-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs. In verify stage, a secondary filter is employed to eliminate the dissimilar string pairs further, and the remaining candidate pairs are verified with length-aware verification method. Experimental results show that our algorithm outperforms state-of-the-art approaches though they are similar in condition of edit distance zero.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Silva, Y.N., Pearson, S.S., Chon, J., et al.: Similarity joins: their implementation and interactions with other database operators. Inf. Syst. 52(8–9), 149–162 (2015)
Article Google Scholar
Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(399–417), 2 (2016)
Google Scholar
Pagh, R.: Large-scale similarity joins with guarantees. In: 18th International Conference on Database Theory, ICDT 2015, pp. 15–24. Dagstuhl, Brussels Belgium (2015)
Google Scholar
Wu, C., Zapevalova, E., Chen, Y., Li, F.: Time optimization of multiple knowledge transfers in the big data environment. CMC: Comput. Mater. Continua 54(3), 269–285 (2018)
Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
MathSciNet Google Scholar
Lin, C., Yu, H., Weng, W., He, X.: Large-scale similarity join with edit-distance constraints. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014. LNCS, vol. 8422, pp. 328–342. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05813-9_22
Chapter Google Scholar
Deng, D., Li, G.L., Hao, S., et al.: MassJoin: a MapReduce-based method for scalable string similarity joins. In: IEEE 30th International Conference on Data Engineering, ICDE 2014, pp. 340–351. IEEE, New York (2014)
Google Scholar
Chen, G., Yang, K., Chen, L., et al.: Metric similarity joins using MapReduce. IEEE Trans. Knowl. Data Eng. 29(656–69), 7–8 (2017)
Google Scholar
Xiao, B., Wang, Z., Liu, Q., Liu, X.: SMK-means: an improved mini batch K-means algorithm based on Mapreduce with big data. CMC: Comput. Mater. Continua 56(3), 365–379 (2018)
MathSciNet Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: 2010 ACM SIGMOD International Conference on Management of data, pp. 495–506. ACM, Indianapolis (2010)
Google Scholar
Metwally, A., Faloutsos, C.: V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. Vldb Endowment 5(8), 704–715 (2012)
Article Google Scholar
Afrati, F.N., Sarma, A.D., Menestrina, D., et al.: Fuzzy joins using MapReduce. In: IEEE 28th International Conference on Data Engineering, pp. 498–509. IEEE, Washington, DC (2012)
Google Scholar
Ma, Y., Meng, X., Wang, S.: Parallel similarity joins on massive high-dimensional data using MapReduce. Concurrency Comput. Pract. Experience 28(1), 166–183 (2016)
Article Google Scholar
Rong, C. Lin, C., Silva, Y.N., Wang, J., Lu, W., Du, X.: Fast and scalable distributed set similarity joins for big data analytics. In: Proceedings of the IEEE 33rd International Conference on Data Engineering, pp. 1059–1070. IEEE, New York (2017)
Google Scholar
Li, G., Dong, D., Wang, J., et al.: PASS-JOIN: a partition-based method for similarity joins. Proc. Vldb Endowment 5(3), 253–264 (2011)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported in part by the Humanity and Social Science Youth foundation of Ministry of Education of China under Grant 15YJC870021, Scientific Research Foundation of the Education Department of Liaoning Province of China under Grant L2015010, NSFC under Grant 61602056, Natural Science Foundation of Liaoning Province of China under Grant 20170540015, and Social Science Foundation of Liaoning Province of China under Grant L18AXW001.

Author information

Authors and Affiliations

College of Information Science and Technology, Bohai University, Jinzhou, 121013, China
Decai Sun & Xiaoxia Wang
Key Laboratory of Big Data in Digital Publishing, SAPPRFT, Jinzhou, China
Decai Sun

Authors

Decai Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxia Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Decai Sun .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Xingming Sun
Nanjing University of Information Science and Technology, Nanjing, China
Zhaoqing Pan
Purdue University, West Lafayette, IN, USA
Elisa Bertino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, D., Wang, X. (2019). FSampleJoin: A Fixed-Sample-Based Method for String Similarity Joins Using MapReduce. In: Sun, X., Pan, Z., Bertino, E. (eds) Artificial Intelligence and Security. ICAIS 2019. Lecture Notes in Computer Science(), vol 11633. Springer, Cham. https://doi.org/10.1007/978-3-030-24265-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-24265-7_22
Published: 11 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-24264-0
Online ISBN: 978-3-030-24265-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics