Abstract
String similarity join is widely used in many fields, e.g. data cleaning, web search, pattern recognition and DNA sequence matching. During the recent years, many similarity join methods have been proposed, for example Pass-Join, Ed-Join, Trie-Join, and so on, among which the Pass-Join algorithm based on edit distance can achieve much better overall performance than the others. But Pass-Join can not effectively filter those candidate pairs which are partially similar. Here a novel algorithm called GFSF is proposed, which introduces two additional filtering steps based on character frequency vector. Through this way, the number of pairs which are only partially similar are greatly reduced, thus greatly reducing the total time of string similarity join process. The experimental results show that the overall performance of the proposed method is better than Pass-Join.
Supported by the Natural Science Foundation of China (61303004), the National Key Technology Support Program (2015BAH16F00/F01) and the Key Technology Program of Xiamen City (3502Z20151016).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Metwally, A., Agrawal, D., Abbadi, A.E.: Detectives: Detecting coalition hit inflation attacks in advertising networks streams. In: Proceedings of 16th International Conference on World Wide Web, pp. 241–250. ACM Press, New York (2007)
Ji, S., Li, G., Li, C., et al.: Efficient interactive fuzzy keyword search. In: Proceedings of the 18th International Conference on World Wide Web, pp. 371–380. ACM Press, New York (2009)
Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. Int. J. Very Large Data Bases 18(2), 469–500 (2009)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, p. 5. IEEE Press (2006)
Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
Wang, J., Li, G., Feng, J.: Trie-Join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)
Li, G., Deng, D., Wang, J., et al.: Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow. 5(3), 253–264 (2011)
Sarwagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of ACM SIGMOD International Conference on Management of data, pp. 743–754. ACM Press, New York (2004)
Xiao, C., Wang, W., Lin, X., et al.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15 (2011)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM Press, New York (2010)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International WWW Conference, pp. 131–140 (2007)
Wang, J., Li, G., Fe, J.: Fast-join: an efficient method for fuzzy token matching based string similarity join. In: Proceedings of the 27th IEEE International Conference on Data Engineering, pp. 458–469. IEEE Press (2011)
Chaudhuri, S., Ganjam, K., Ganti, V., et al.: Robust, efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 313–324. ACM Press, New York (2003)
Gravano, L., Ipeirotis, P., Jagadish, H., et al.: Approximate string joins in a database (almost) for free. In: Proceedings of the International Conference on Very Large Databases, pp. 491–500 (2001)
Metwally, A., Faloutsos, C.: V-SMART-Join: A scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: MassJoin: A MapReduce-based method for scalable string similarity joins. In: ICDE 2014, pp. 340–351 (2014)
Huang, J., Zhang, R., Buyya, R., Chen, J.: MELODY-JOIN: Efficient Earth Mover’s Distance similarity joins using MapReduce. In: ICDE 2014, pp. 808–819 (2014)
Chen, L., Gao, Y., Li, X., Jensen, C.S., Chen, G.: Effcient metric indexing for similarity search. In: Proceedings of IEEE 31st International Conference on Data Engineering, pp. 591–602, April 2015
Maehara, T., Kusumoto, M., Kawarabayashi, K.: Scalable SimRank join algorithm. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 603–614 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Lin, Z., Luo, D., Lai, Y. (2016). GFSF: A Novel Similarity Join Method Based on Frequency Vector. In: Cui, B., Zhang, N., Xu, J., Lian, X., Liu, D. (eds) Web-Age Information Management. WAIM 2016. Lecture Notes in Computer Science(), vol 9659. Springer, Cham. https://doi.org/10.1007/978-3-319-39958-4_40
Download citation
DOI: https://doi.org/10.1007/978-3-319-39958-4_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39957-7
Online ISBN: 978-3-319-39958-4
eBook Packages: Computer ScienceComputer Science (R0)