Parallel Top-k Query Processing on Uncertain Strings Using MapReduce

Xu, Hui; Ding, Xiaofeng; Jin, Hai; Jiang, Wenbin

doi:10.1007/978-3-319-18123-3_6

Hui Xu¹⁷,
Xiaofeng Ding¹⁷,
Hai Jin¹⁷ &
…
Wenbin Jiang¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9050))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1804 Accesses
3 Citations

Abstract

Top-\(k\) query is an important and essential operator for data analysis over string collections. However, when uncertainty comes into big data, it calls for new parallel algorithms for efficient query processing on large scale uncertain strings. In this paper, we proposed a MapReduce-based parallel algorithm, called MUSK, for answering top-\(k\) queries over large scale uncertain strings. We used the probabilistic \(n\)-grams to generate key-value pairs. To improve the performance, a novel lower bound for expected edit distance was derived to prune strings based on a new defined function gram mapping distance. By integrating the bound with TA, the filtering power in the Map stage was optimized effectively to decrease the transmission cost. Comprehensive experimental results on both real-world and synthetic datasets showed that MUSK outperformed the baseline approach with speeds up to 6 times in the best case, which indicated good scalability over large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Google Scholar
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: An in-depth study. In: VLDB, pp. 472–483. VLDB Endowment (2010)
Google Scholar
Li, F., Ooi, B.C., Tamer Özsu, M., Wu, S.: Distributed Data Management Using MapReduce. ACM Computing Survey 46(3) (2014)
Google Scholar
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD, pp. 327–338. ACM (2010)
Google Scholar
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp. 102–113 (2001)
Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266. IEEE (2008)
Google Scholar
Kim, Y., Woo, K.-G., Park, H., Shim, K.: Efficient processing of substring match queries with inverted q-gram indexes. In: ICDE, pp. 721–732. IEEE (2010)
Google Scholar
Wang, X., Ding, X., Tung, K.H., Zhang, Z.: Efficient and effective KNN sequence search with approximate n-grams. In: VLDB, pp. 1–12. VLDB Endowment (2013)
Google Scholar
Deng, D., Li, G., Feng, J., Li, W.-S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936. IEEE (2013)
Google Scholar
Hua, M., Pei, J., Zhang, W., Lin X.: Efficiently answering probabilistic threshold top-k queries on uncertain data. In: ICDE, pp. 85–96. IEEE (2008)
Google Scholar
Yi, K., Li, F., Kollios, G., Srivastava, D.: Efficient processing of top-k queries in uncertain databases. In: ICDE, pp. 1406–1408. IEEE (2008)
Google Scholar
Ge, T., Li, Z.: Approximate substring matching over uncertain strings. In: VLDB, pp. 772–782. VLDB Endowment (2011)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD, pp. 495–506. ACM (2010)
Google Scholar
Deng, D., Li, G., Hao, S., Wang, J., Feng, J., Li, W.-S.: MassJoin: A MapReduce-based method for scalable string similarity joins. In: ICDE. IEEE (2014)
Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Article Google Scholar
Wang, X., Ding, X., Tung, K.H., Ying, S., Jin, H.: An efficient graph indexing method. In: ICDE, pp. 805–816. IEEE (2012)
Google Scholar
Kugn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 83–97 (1955)
Article MathSciNet Google Scholar
Bandeira, N., Clauser, K., Pevzner, P.: Shotgun Protein Sequencing: Assembly of peptide tandem mass spectra from Mixtures of Modified Proteins. Molecular and Cellular Proteomics 6(7) (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Hui Xu, Xiaofeng Ding, Hai Jin & Wenbin Jiang

Authors

Hui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Ding
View author publications
You can also search for this author in PubMed Google Scholar
Hai Jin
View author publications
You can also search for this author in PubMed Google Scholar
Wenbin Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofeng Ding .

Editor information

Editors and Affiliations

Universität München, München, Germany
Matthias Renz
University of Southern California, Los Angeles, USA
Cyrus Shahabi
University of Queensland, Brisbane, Australia
Xiaofang Zhou
Monash University, Clayton, Australia
Muhammad Aamir Cheema

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, H., Ding, X., Jin, H., Jiang, W. (2015). Parallel Top-k Query Processing on Uncertain Strings Using MapReduce. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-18123-3_6
Published: 09 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18122-6
Online ISBN: 978-3-319-18123-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics