Skip to main content

Big-SeqDB-Gen: A Formal and Scalable Approach for Parallel Generation of Big Synthetic Sequence Databases

  • Conference paper
Performance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things (TPCTC 2015)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9508))

Included in the following conference series:

Abstract

The recognition that data is of big economic value and the significant hardware achievements in low cost data storage, high-speed networks and high performance parallel computing, foster new research directions on large-scale knowledge discovery from big sequence databases. There are many applications involving sequence databases, such as customer shopping sequences, web clickstreams, and biological sequences. All these applications are concerned by the big data problem. There is no doubt that fast mining of billions of sequences is a challenge. However, due to the non availability of big data sets, it is not possible to assess knowledge discovery algorithms over big sequence databases. For both privacy and security concerns, Companies do not disclose their data. In the other hand, existing synthetic sequence generators are not up to the big data challenge.

In this paper, first we propose a formal and scalable approach for Parallel Generation of Big Synthetic Sequence Databases. Based on Whitney numbers, the underlying Parallel Sequence Generator (i) creates billions of distinct sequences in parallel and (ii) ensures that injected sequential patterns satisfy user-specified sequences’ characteristics. Second, we report a scalability and scale-out performance study of the Parallel Sequence Generator, for various sequence databases’ sizes and various number of Sequence Generators in a shared-nothing cluster of nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    URL: http://www.research.ibm.com/labs/almaden/index.shtml#assocSynData does not point to the benchmark homepage.

References

  1. Han, P.J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems, 3rd edn. Morgan Kaufmann, Burlington (2011)

    MATH  Google Scholar 

  2. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the 11th International Conference on Data Engineering (ICDE), pp. 3–14 (1995)

    Google Scholar 

  3. Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: 5th International Conference on Extending Database Technology Proceedings (EDBT), pp. 3–17 (1996)

    Google Scholar 

  4. Zaki, M.J.: Efficient enumeration of frequent sequences. In: Proceedings of ACM CIKM International Conference on Information and Knowledge Management, pp. 68–75 (1998)

    Google Scholar 

  5. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2000)

    Google Scholar 

  6. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Prefixspan: mining sequential patterns by prefix-projected growth. In: Proceedings of the 17th International Conference on Data Engineering, pp. 215–224 (2001)

    Google Scholar 

  7. Sun, P., Chawla, S., Arunasalam, B.: Mining for outliers in sequential databases. In: Proceedings of the 6th SIAM International Conference on Data Mining, pp. 94–105 (2006)

    Google Scholar 

  8. Hemalatha, C.S., Vaidehi, V., Lakshmi, R.: Minimal infrequent pattern based approach for mining outliers in data streams. Expert Syst. Appl. 42, 1998–2012 (2015)

    Article  Google Scholar 

  9. Cheng, H., Yan, X., Han, J.: Seqindex: indexing sequences by sequential pattern analysis. In: Proceedings of SIAM International Conference on Data Mining. SDM, pp. 601–605 (2005)

    Google Scholar 

  10. Lin, M.Y., Lee, S.Y.: Fast discovery of sequential patterns through memory indexing and database partitioning. J. Inf. Sci. Eng. 21, 109–128 (2005)

    Google Scholar 

  11. Xin, D., Han, J., Yan, X., Cheng, H.: Mining compressed frequent-pattern sets. In: Proceedings of the 31st International Conference on Very Large DataBases, pp. 709–720 (2005)

    Google Scholar 

  12. Lam, H.T., Mörchen, F., Fradkin, D., Calders, T.: Mining compressing sequential patterns. Stat. Anal. Data Min. 7, 34–52 (2014)

    Article  MathSciNet  Google Scholar 

  13. Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Briefings Bioinform. 11, 473–483 (2010)

    Article  Google Scholar 

  14. Rajaraman, A.: More data usually beats better algorithms (2008). http://anand.typepad.com/datawocky/2008/04/data-versus-alg.html

  15. Srikant, R.: IBM quest synthetic data generator (1999). http://sourceforge.net/projects/ibmquestdatagen/files/

  16. Grid5000: Large-scale and versatile testbed for experiment-driven research: distributed computing-HPC and big data (2015). https://www.grid5000.fr/

  17. Kum, H.C., Chang, J.H., Wang, W.: Benchmarking the effectiveness of sequential pattern mining methods. Data Knowl. Eng. 60, 30–50 (2007)

    Article  Google Scholar 

  18. Moussa, R.: Mining big sequence databases (2015). https://sites.google.com/site/rimmoussa/miningbigseqdb

  19. Pei, J., Mao, R., Hu, K., Zhu, H.: Towards data mining benchmarking: a testbedfor performance study of frequent pattern mining. In: Proceedings of ACM SIGMOD International Conference on Management of Data, p. 592 (2000)

    Google Scholar 

  20. Gray, J.: Sort benchmark home page (2008). http://research.microsoft.com/barc/SortBenchmark/

  21. Tilmann, R., Meikel, P.: Parallel data generation for performance analysis of large, complex RDBMS. In: Proceedings of the 4th International Workshop on Testing Database Systems, pp. 5:1–5:6 (2011)

    Google Scholar 

  22. Poess, M., Rabl, T., Frank, M., Danisch, M.: A PDGF implementation for TPC-H. In: Nambiar, R., Poess, M. (eds.) TPCTC 2011. LNCS, vol. 7144, pp. 196–212. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  23. Luo, C., Gao, W., Jia, Z., Han, R., Li, J., Lin, X., Wang, L., Zhu, Y., Zhan, J.: Handbook of BigDataBench: A Big Data Benchmark Suite (2015). http://prof.ict.ac.cn/BigDataBench

  24. Jim, G., Prakash, S., Susanne, E., Ken, B., Weinberger, P.J.: Quickly generating billion-record synthetic databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 243–252 (1994)

    Google Scholar 

  25. Transaction Processing Council: TPC benchmarks (2015). http://www.tpc.org/

  26. Karl, H.: The art of building a good benchmark. In: Proceedings of TPC-TC, pp. 18–30 (2009)

    Google Scholar 

  27. Raïssi, C., Pei, J.: Towards bounding sequential patterns. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1379–1387 (2011)

    Google Scholar 

  28. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI), pp. 137–150 (2004)

    Google Scholar 

Download references

Acknowledgements

We acknowledge with thanks a VLDB travel fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rim Moussa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Moussa, R. (2016). Big-SeqDB-Gen: A Formal and Scalable Approach for Parallel Generation of Big Synthetic Sequence Databases. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things. TPCTC 2015. Lecture Notes in Computer Science(), vol 9508. Springer, Cham. https://doi.org/10.1007/978-3-319-31409-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31409-9_5

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31408-2

  • Online ISBN: 978-3-319-31409-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics