Skip to main content

Parallel Top-k Query Processing on Uncertain Strings Using MapReduce

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9050))

Included in the following conference series:

Abstract

Top-\(k\) query is an important and essential operator for data analysis over string collections. However, when uncertainty comes into big data, it calls for new parallel algorithms for efficient query processing on large scale uncertain strings. In this paper, we proposed a MapReduce-based parallel algorithm, called MUSK, for answering top-\(k\) queries over large scale uncertain strings. We used the probabilistic \(n\)-grams to generate key-value pairs. To improve the performance, a novel lower bound for expected edit distance was derived to prune strings based on a new defined function gram mapping distance. By integrating the bound with TA, the filtering power in the Map stage was optimized effectively to decrease the transmission cost. Comprehensive experimental results on both real-world and synthetic datasets showed that MUSK outperformed the baseline approach with speeds up to 6 times in the best case, which indicated good scalability over large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)

    Google Scholar 

  2. Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: An in-depth study. In: VLDB, pp. 472–483. VLDB Endowment (2010)

    Google Scholar 

  3. Li, F., Ooi, B.C., Tamer Özsu, M., Wu, S.: Distributed Data Management Using MapReduce. ACM Computing Survey 46(3) (2014)

    Google Scholar 

  4. Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD, pp. 327–338. ACM (2010)

    Google Scholar 

  5. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp. 102–113 (2001)

    Google Scholar 

  6. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266. IEEE (2008)

    Google Scholar 

  7. Kim, Y., Woo, K.-G., Park, H., Shim, K.: Efficient processing of substring match queries with inverted q-gram indexes. In: ICDE, pp. 721–732. IEEE (2010)

    Google Scholar 

  8. Wang, X., Ding, X., Tung, K.H., Zhang, Z.: Efficient and effective KNN sequence search with approximate n-grams. In: VLDB, pp. 1–12. VLDB Endowment (2013)

    Google Scholar 

  9. Deng, D., Li, G., Feng, J., Li, W.-S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936. IEEE (2013)

    Google Scholar 

  10. Hua, M., Pei, J., Zhang, W., Lin X.: Efficiently answering probabilistic threshold top-k queries on uncertain data. In: ICDE, pp. 85–96. IEEE (2008)

    Google Scholar 

  11. Yi, K., Li, F., Kollios, G., Srivastava, D.: Efficient processing of top-k queries in uncertain databases. In: ICDE, pp. 1406–1408. IEEE (2008)

    Google Scholar 

  12. Ge, T., Li, Z.: Approximate substring matching over uncertain strings. In: VLDB, pp. 772–782. VLDB Endowment (2011)

    Google Scholar 

  13. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD, pp. 495–506. ACM (2010)

    Google Scholar 

  14. Deng, D., Li, G., Hao, S., Wang, J., Feng, J., Li, W.-S.: MassJoin: A MapReduce-based method for scalable string similarity joins. In: ICDE. IEEE (2014)

    Google Scholar 

  15. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)

    Article  Google Scholar 

  16. Wang, X., Ding, X., Tung, K.H., Ying, S., Jin, H.: An efficient graph indexing method. In: ICDE, pp. 805–816. IEEE (2012)

    Google Scholar 

  17. Kugn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  18. Bandeira, N., Clauser, K., Pevzner, P.: Shotgun Protein Sequencing: Assembly of peptide tandem mass spectra from Mixtures of Modified Proteins. Molecular and Cellular Proteomics 6(7) (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofeng Ding .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Xu, H., Ding, X., Jin, H., Jiang, W. (2015). Parallel Top-k Query Processing on Uncertain Strings Using MapReduce. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18123-3_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18122-6

  • Online ISBN: 978-3-319-18123-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics