Skip to main content

FSampleJoin: A Fixed-Sample-Based Method for String Similarity Joins Using MapReduce

  • Conference paper
  • First Online:
  • 1616 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11633))

Abstract

Data integration and data cleaning have received significant attention in the last three decades, and similarity joins is a basic operation in these areas. In this paper, a new fixed-sample-based algorithm, called FSampleJoin, is proposed to do string similarity joins using MapReduce. Our algorithm employs a filter-verify based framework. In filter stage, a fixed-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs. In verify stage, a secondary filter is employed to eliminate the dissimilar string pairs further, and the remaining candidate pairs are verified with length-aware verification method. Experimental results show that our algorithm outperforms state-of-the-art approaches though they are similar in condition of edit distance zero.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Silva, Y.N., Pearson, S.S., Chon, J., et al.: Similarity joins: their implementation and interactions with other database operators. Inf. Syst. 52(8–9), 149–162 (2015)

    Article  Google Scholar 

  2. Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(399–417), 2 (2016)

    Google Scholar 

  3. Pagh, R.: Large-scale similarity joins with guarantees. In: 18th International Conference on Database Theory, ICDT 2015, pp. 15–24. Dagstuhl, Brussels Belgium (2015)

    Google Scholar 

  4. Wu, C., Zapevalova, E., Chen, Y., Li, F.: Time optimization of multiple knowledge transfers in the big data environment. CMC: Comput. Mater. Continua 54(3), 269–285 (2018)

    Google Scholar 

  5. Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  6. Lin, C., Yu, H., Weng, W., He, X.: Large-scale similarity join with edit-distance constraints. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014. LNCS, vol. 8422, pp. 328–342. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05813-9_22

    Chapter  Google Scholar 

  7. Deng, D., Li, G.L., Hao, S., et al.: MassJoin: a MapReduce-based method for scalable string similarity joins. In: IEEE 30th International Conference on Data Engineering, ICDE 2014, pp. 340–351. IEEE, New York (2014)

    Google Scholar 

  8. Chen, G., Yang, K., Chen, L., et al.: Metric similarity joins using MapReduce. IEEE Trans. Knowl. Data Eng. 29(656–69), 7–8 (2017)

    Google Scholar 

  9. Xiao, B., Wang, Z., Liu, Q., Liu, X.: SMK-means: an improved mini batch K-means algorithm based on Mapreduce with big data. CMC: Comput. Mater. Continua 56(3), 365–379 (2018)

    MathSciNet  Google Scholar 

  10. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: 2010 ACM SIGMOD International Conference on Management of data, pp. 495–506. ACM, Indianapolis (2010)

    Google Scholar 

  11. Metwally, A., Faloutsos, C.: V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. Vldb Endowment 5(8), 704–715 (2012)

    Article  Google Scholar 

  12. Afrati, F.N., Sarma, A.D., Menestrina, D., et al.: Fuzzy joins using MapReduce. In: IEEE 28th International Conference on Data Engineering, pp. 498–509. IEEE, Washington, DC (2012)

    Google Scholar 

  13. Ma, Y., Meng, X., Wang, S.: Parallel similarity joins on massive high-dimensional data using MapReduce. Concurrency Comput. Pract. Experience 28(1), 166–183 (2016)

    Article  Google Scholar 

  14. Rong, C. Lin, C., Silva, Y.N., Wang, J., Lu, W., Du, X.: Fast and scalable distributed set similarity joins for big data analytics. In: Proceedings of the IEEE 33rd International Conference on Data Engineering, pp. 1059–1070. IEEE, New York (2017)

    Google Scholar 

  15. Li, G., Dong, D., Wang, J., et al.: PASS-JOIN: a partition-based method for similarity joins. Proc. Vldb Endowment 5(3), 253–264 (2011)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Humanity and Social Science Youth foundation of Ministry of Education of China under Grant 15YJC870021, Scientific Research Foundation of the Education Department of Liaoning Province of China under Grant L2015010, NSFC under Grant 61602056, Natural Science Foundation of Liaoning Province of China under Grant 20170540015, and Social Science Foundation of Liaoning Province of China under Grant L18AXW001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Decai Sun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, D., Wang, X. (2019). FSampleJoin: A Fixed-Sample-Based Method for String Similarity Joins Using MapReduce. In: Sun, X., Pan, Z., Bertino, E. (eds) Artificial Intelligence and Security. ICAIS 2019. Lecture Notes in Computer Science(), vol 11633. Springer, Cham. https://doi.org/10.1007/978-3-030-24265-7_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-24265-7_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-24264-0

  • Online ISBN: 978-3-030-24265-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics