Methods for the Efficient Discovery of Large Item-Indexable Sequential Patterns

Henriques, Rui; Antunes, Cláudia; Madeira, Sara C.

doi:10.1007/978-3-319-08407-7_7

Rui Henriques^10,11,
Cláudia Antunes¹¹ &
Sara C. Madeira^10,11

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8399))

Included in the following conference series:

International Workshop on New Frontiers in Mining Complex Patterns

609 Accesses
8 Citations

Abstract

An increasingly relevant set of tasks, such as the discovery of biclusters with order-preserving properties, can be mapped as a sequential pattern mining problem on data with item-indexable properties. An item-indexable database, typically observed in biomedical domains, does not allow item repetitions per sequence and is commonly dense. Although multiple methods have been proposed for the efficient discovery of sequential patterns, their performance rapidly degrades over item-indexable databases. The target tasks for these databases benefit from lengthy patterns and tolerate local mismatches. However, existing methods that consider noise relaxations to increase the average short length of sequential patterns scale poorly, aggravating the yet critical efficiency. In this work, we first propose a new sequential pattern mining method, IndexSpan, which is able to mine sequential patterns over item-indexable databases with heightened efficiency. Second, we propose a pattern-merging procedure, MergeIndexBic, to efficiently discover lengthy noise-tolerant sequential patterns. The superior performance of IndexSpan and MergeIndexBic against competitive alternatives is demonstrated on both synthetic and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Detailed description of tasks available in http://web.ist.utl.pt/rmch/software/indexspan.
2.
Software and datasets available in: http://web.ist.utl.pt/rmch/software/indexspan/.
3.
Implementation from SPMF: http://www.philippe-fournier-viger.com/spmf/.
4.
Implementation from SPMF: http://www.philippe-fournier-viger.com/spmf/.
5.
http://www.upo.es/eps/bigs/datasets.html
http://www.bioinf.jku.at/software/fabia/gene_expression.html

References

Agrawal, R., Srikant, R.: Mining sequential patterns. In: ICDE. pp. 3–14. IEEE CS, Washington (1995)
Google Scholar
Antunes, C., Oliveira, A.L.: Mining patterns using relaxations of user defined constraints. In: Knowledge Discovery in Inductive Databases (2004)
Google Scholar
Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a bitmap representation. In: KDD. pp. 429–435. ACM, New York (2002)
Google Scholar
Bayardo, R.J.: Efficiently mining long patterns from databases. SIGMOD Rec. 27(2), 85–93 (1998)
Article Google Scholar
Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.: Discovering local structure in gene expression data: the order-preserving submatrix problem. In: RECOMB. pp. 49–57. ACM, New York (2002)
Google Scholar
Cheng, H., Yu, P.S., Han, J.: Approximate frequent itemset mining in the presence of random noise. In: Maimon, O., Rokach, L. (eds.) Soft Computing for Knowledge Discovery and Data Mining, pp. 363–389. Springer, New York (2008)
Chapter Google Scholar
Chiu, D.Y., Wu, Y.H., Chen, A.L.P.: An efficient algorithm for mining frequent sequences by a new strategy without support counting. In: ICDE. p. 375. IEEE CS, Washington (2004)
Google Scholar
Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15(1), 55–86 (2007)
Article MathSciNet Google Scholar
Han, J., Yang, Q., Kim, E.: Plan mining by divide-and-conquer. In: ACM SIGMOD IW on Research Issues in DMKD (1999)
Google Scholar
Henriques, R., Madeira, S., Antunes, C.: Indexspan: efficient discovery of item-indexable sequential patterns. In: ECML/PKDD IW on New Frontiers in Mining Complex Patterns (2013)
Google Scholar
Kumar, P., Krishna, P., Raju, S.: Pattern Discovery Using Sequence Data Mining: Applications and Studies. IGI Global, Hershey (2011)
Book Google Scholar
Lin, D.-I., Kedem, Z.M.: Pincer search: a new algorithm for discovering the maximum frequent set. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377, pp. 105–119. Springer, Heidelberg (1998)
Google Scholar
Liu, J., Wang, W.: Op-cluster: clustering by tendency in high dimensional space. In: ICDM. p. 187. IEEE CS, Washington (2003)
Google Scholar
Liu, J., Yang, J., Wang, W.: Biclustering in gene expression data by tendency. In: IEEE Computational Systems Bioinformatics Conference, pp. 182–193. IEEE (2004)
Google Scholar
Mabroukeh, N.R., Ezeife, C.I.: A taxonomy of sequential pattern mining algorithms. ACM Comput. Surv. 43(1), 3:1–3:41 (2010)
Article Google Scholar
Martin, D., Brun, C., Remy, E., Mouren, P., Thieffry, D., Jacq, B.: Gotoolbox: functional analysis of gene datasets based on gene ontology. Genome Biol. 5(12), 101 (2004)
Article Google Scholar
Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.: Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans. Knowl. Data Eng. 16(11), 1424–1440 (2004)
Article Google Scholar
Pei, J., Han, J., Mortazavi-Asl, B., Zhu, H.: Mining access patterns efficiently from web logs. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 396–407. Springer, Heidelberg (2000)
Chapter Google Scholar
Raedt, L.D., Guns, T., Nijssen, S.: Constraint programming for data mining and machine learning. In: AAAI. AAAI Press (2010)
Google Scholar
Salvemini, E., Fumarola, F., Malerba, D., Han, J.: FAST sequence mining based on sparse Id-lists. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 316–325. Springer, Heidelberg (2011)
Chapter Google Scholar
Serin, A., Vingron, M.: Debi: discovering differentially expressed biclusters using a frequent itemset approach. Algorithms Mol. Biol. 6, 1–12 (2011)
Article Google Scholar
Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996)
Google Scholar
qing Wei, Y., Liu, D., shan Duan, L.: Distributed prefixspan algorithm based on mapreduce. In: Information Technology in Medicine and Education, vol. 2, pp. 901–904 (2012)
Google Scholar
Yan, X., Han, J., Afshar, R.: CloSpan: mining closed sequential patterns in large datasets. In: SDM. pp. 166–177 (2003)
Google Scholar
Yang, J., Wang, W., Yu, P.S., Han, J.: Mining long sequential patterns in a noisy environment. In: SIGMOD. pp. 406–417. ACM, New York (2002)
Google Scholar
Yang, Z., Wang, Y., Kitsuregawa, M.: LAPIN: effective sequential pattern mining algorithms by last position induction for dense databases. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 1020–1023. Springer, Heidelberg (2007)
Google Scholar
Zaki, M.J.: Spade: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1–2), 31–60 (2001)
Article MATH Google Scholar
Zheng, Y., Zhang, L., Xie, X., Ma, W.Y.: Mining interesting locations and travel sequences from gps trajectories. In: WWW. pp. 791–800. ACM (2009)
Google Scholar
Zhu, F., Yan, X., Han, J., Yu, P.S.: Mining Frequent Approximate Sequential Patterns. Chapman & Hall, London (2009)
Google Scholar
Zhu, F., Yan, X., Han, J., Yu, P., Cheng, H.: Mining colossal frequent patterns by core pattern fusion. In: ICDE. pp. 706–715 (2007)
Google Scholar

Download references

Acknowledgments

This is an extension of previous work [10] supported by Fundação para a Ciência e Tecnologia under the project D2PM (PTDC/EIA-EIA/110074/2009), project Neuroclinomics (PTDC/EIA-EIA/ 111239/2009), and PhD grant SFRH/BD/75924/2011.

Author information

Authors and Affiliations

KDBio, Inesc-ID, Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
Rui Henriques & Sara C. Madeira
Department of Computer Science and Engineering, IST, Universidade de Lisboa, Lisboa, Portugal
Rui Henriques, Cláudia Antunes & Sara C. Madeira

Authors

Rui Henriques
View author publications
You can also search for this author in PubMed Google Scholar
Cláudia Antunes
View author publications
You can also search for this author in PubMed Google Scholar
Sara C. Madeira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Henriques .

Editor information

Editors and Affiliations

Università degli Studi di Bari Aldo Moro, Bari, Italy
Annalisa Appice
Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Università degli Studi di Bari Aldo Moro, Bari, Italy
Corrado Loglisci
ICAR, CNR, Rende, Italy
Giuseppe Manco
Rende, Italy
Elio Masciari
Department of Computer Science, University of North Carolina, Charlotte, North Carolina, USA
Zbigniew W. Ras

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Henriques, R., Antunes, C., Madeira, S.C. (2014). Methods for the Efficient Discovery of Large Item-Indexable Sequential Patterns. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z. (eds) New Frontiers in Mining Complex Patterns. NFMCP 2013. Lecture Notes in Computer Science(), vol 8399. Springer, Cham. https://doi.org/10.1007/978-3-319-08407-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-08407-7_7
Published: 06 July 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08406-0
Online ISBN: 978-3-319-08407-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics