Abstract
Top-\(k\) query is an important and essential operator for data analysis over string collections. However, when uncertainty comes into big data, it calls for new parallel algorithms for efficient query processing on large scale uncertain strings. In this paper, we proposed a MapReduce-based parallel algorithm, called MUSK, for answering top-\(k\) queries over large scale uncertain strings. We used the probabilistic \(n\)-grams to generate key-value pairs. To improve the performance, a novel lower bound for expected edit distance was derived to prune strings based on a new defined function gram mapping distance. By integrating the bound with TA, the filtering power in the Map stage was optimized effectively to decrease the transmission cost. Comprehensive experimental results on both real-world and synthetic datasets showed that MUSK outperformed the baseline approach with speeds up to 6 times in the best case, which indicated good scalability over large datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: An in-depth study. In: VLDB, pp. 472–483. VLDB Endowment (2010)
Li, F., Ooi, B.C., Tamer Özsu, M., Wu, S.: Distributed Data Management Using MapReduce. ACM Computing Survey 46(3) (2014)
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD, pp. 327–338. ACM (2010)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp. 102–113 (2001)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266. IEEE (2008)
Kim, Y., Woo, K.-G., Park, H., Shim, K.: Efficient processing of substring match queries with inverted q-gram indexes. In: ICDE, pp. 721–732. IEEE (2010)
Wang, X., Ding, X., Tung, K.H., Zhang, Z.: Efficient and effective KNN sequence search with approximate n-grams. In: VLDB, pp. 1–12. VLDB Endowment (2013)
Deng, D., Li, G., Feng, J., Li, W.-S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936. IEEE (2013)
Hua, M., Pei, J., Zhang, W., Lin X.: Efficiently answering probabilistic threshold top-k queries on uncertain data. In: ICDE, pp. 85–96. IEEE (2008)
Yi, K., Li, F., Kollios, G., Srivastava, D.: Efficient processing of top-k queries in uncertain databases. In: ICDE, pp. 1406–1408. IEEE (2008)
Ge, T., Li, Z.: Approximate substring matching over uncertain strings. In: VLDB, pp. 772–782. VLDB Endowment (2011)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD, pp. 495–506. ACM (2010)
Deng, D., Li, G., Hao, S., Wang, J., Feng, J., Li, W.-S.: MassJoin: A MapReduce-based method for scalable string similarity joins. In: ICDE. IEEE (2014)
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Wang, X., Ding, X., Tung, K.H., Ying, S., Jin, H.: An efficient graph indexing method. In: ICDE, pp. 805–816. IEEE (2012)
Kugn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 83–97 (1955)
Bandeira, N., Clauser, K., Pevzner, P.: Shotgun Protein Sequencing: Assembly of peptide tandem mass spectra from Mixtures of Modified Proteins. Molecular and Cellular Proteomics 6(7) (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Xu, H., Ding, X., Jin, H., Jiang, W. (2015). Parallel Top-k Query Processing on Uncertain Strings Using MapReduce. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-18123-3_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18122-6
Online ISBN: 978-3-319-18123-3
eBook Packages: Computer ScienceComputer Science (R0)