A New Algorithm for Fast Discovery of Maximal Sequential Patterns in a Document Collection

García-Hernández, René Arnulfo; Martínez-Trinidad, José Francisco; Carrasco-Ochoa, Jesús Ariel

doi:10.1007/11671299_53

René Arnulfo García-Hernández¹⁷,
José Francisco Martínez-Trinidad¹⁷ &
Jesús Ariel Carrasco-Ochoa¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3878))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1446 Accesses
20 Citations

Abstract

Sequential pattern mining is an important tool for solving many data mining tasks and it has broad applications. However, only few efforts have been made to extract this kind of patterns in a textual database. Due to its broad applications in text mining problems, finding these textual patterns is important because they can be extracted from text independently of the language. Also, they are human readable patterns or descriptors of the text, which do not lose the sequential order of the words in the document. But the problem of discovering sequential patterns in a database of documents presents special characteristics which make it intractable for most of the apriori-like candidate-generation-and-test approaches. Recent studies indicate that the pattern-growth methodology could speed up the sequential pattern mining. In this paper we propose a pattern-growth based algorithm (DIMASP) to discover all the maximal sequential patterns in a document database. Furthermore, DIMASP is incremental and independent of the support threshold. Finally, we compare the performance of DIMASP against GSP, DELISP, GenPrefixSpan and cSPADE algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fayyad, U., Piatetsky-Shapiro, G.: Advances in Knowledge Discovery and Data mining. AAAI Press, Menlo Park (1996)
Google Scholar
Feldman, R., Dagan, I.: Knowledge Discovery in Textual Databases (KDT). In: Proceedings of the 1st International Conference on Knowledge Discovery, KDD 1995 (1995)
Google Scholar
Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: 5^thIntl. Conf. Extending Database Discovery and Data Mining (1996)
Google Scholar
Pei, J., Han, J.: PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In: Proc. International Conference on Data Engineering, ICDE 2001 (2001)
Google Scholar
Antunes, C., Oliveira, A.: Generalization of Pattern-growth Methods for Sequential Pattern Mining with Gap Constraints. In: Third IAPR Workshop on Machine Learning and Data Mining MLDM 2003 (2003)
Google Scholar
Lin, M.-Y., Lee, S.-Y., Wang, S.-S.: DELISP: Efficient Discovery of Generalized Sequential Patterns by Delimited Pattern-Growth Technology. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 189–209. Springer, Heidelberg (2002)
Google Scholar
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Zaki, M.J.: Sequence Mining in Categorical Domains: Incorporating Constraints. In: 9th International Conference on Information and Knowledge Management, Washington, DC, November 2000, pp. 422–429 (2000)
Google Scholar
Youssefi, A.H., Duke, D.J., Zaki, M.J.: Visual Web Mining. In: 13th International World Wide Web Conference, New York, NY (2004)
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2000) c.9 &10
Google Scholar
Pei, J., Han, J., et al.: Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE Transactions on Knowledge and Data Engineering 16(10) (October 2004)
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Astrophysics, Optics and Electronics (INAOE), Puebla, México
René Arnulfo García-Hernández, José Francisco Martínez-Trinidad & Jesús Ariel Carrasco-Ochoa

Authors

René Arnulfo García-Hernández
View author publications
You can also search for this author in PubMed Google Scholar
José Francisco Martínez-Trinidad
View author publications
You can also search for this author in PubMed Google Scholar
Jesús Ariel Carrasco-Ochoa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A. (2006). A New Algorithm for Fast Discovery of Maximal Sequential Patterns in a Document Collection. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_53

Download citation

DOI: https://doi.org/10.1007/11671299_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32205-4
Online ISBN: 978-3-540-32206-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics