A text representation model using Sequential Pattern-Growth method

Alias, Suraya; Mohammad, Siti Khaotijah; Hoon, Gan Keng; Ping, Tan Tien

doi:10.1007/s10044-017-0624-9

A text representation model using Sequential Pattern-Growth method

Short paper
Published: 01 June 2017

Volume 21, pages 233–247, (2018)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Suraya Alias¹,
Siti Khaotijah Mohammad²,
Gan Keng Hoon² &
…
Tan Tien Ping²

507 Accesses
Explore all metrics

Abstract

Text representation is an essential task in transforming the input from text into features that can be later used for further Text Mining and Information Retrieval tasks. The commonly used text representation model is Bags-of-Words (BOW) and the N-gram model. Nevertheless, some known issues of these models, which are inaccurate semantic representation of text and high dimensionality of word size combination, should be investigated. A pattern-based model named Frequent Adjacent Sequential Pattern (FASP) is introduced to represent the text using a set of sequence adjacent words that are frequently used across the document collection. The purpose of this study is to discover the similarity of textual pattern between documents that can be later converted to a set of rules to describe the main news event. The FASP is based on the Pattern-Growth’s divide-and-conquer strategy where the main difference between FASP and the prior technique is in the Pattern Generation phase. This approach is tested against the BOW and N-gram text representation model using Malay and English language news dataset with different term weightings in the Vector Space Model (VSM). The findings demonstrate that the FASP model has a promising performance in finding similarities between documents with the average vector size reduction of 34% against the BOW and 77% against the N-gram model using the Malay dataset. Results using the English dataset is also consistent, indicating that the FASP approach is also language independent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1):4–20
Google Scholar
Zhang W, Yoshida T, Tang X (2011) A comparative study of TF* IDF, LSI and multi-words for text classification. Expert Syst Appl 38(3):2758–2765
Article Google Scholar
Lewis DD (1992) Text representation for intelligent text retrieval: a classification-oriented view. Text-based intelligent systems: current research and practice in information extraction and retrieval. Lawrence Erlbaum, Hillsdale
Google Scholar
Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. J Mach Learn Res 32
Kalogeratos A, Likas A (2012) Text document clustering using global term context vectors. Knowl Inf Syst 31(3):455–474
Article Google Scholar
Guthrie D, Allison B, Liu W, Guthrie L, Wilks Y (2006) A closer look at skip-gram modelling. In: Proceedings of the 5th international Conference on language resources and evaluation (LREC-2006), pp 1–4
Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, Chanona-Hernández L (2014) Syntactic n-grams as machine learning features for natural language processing. Expert Syst Appl 41(3):853–860
Article Google Scholar
Tan C-M, Wang Y-F, Lee C-D (2002) The use of bigrams to enhance text categorization. Inf Process Manag 38(4):529–546
Article MATH Google Scholar
Hernández-Reyes E, García-Hernández RA, Carrasco-Ochoa JA, Martínez-Trinidad JF (2006) Document Clustering Based on Maximal Frequent Sequences. In: Salakoski T, Ginter F, Pyysalo S, Pahikkala T(eds) Advances in Natural Language Processing. Lecture Notes in Computer Science, vol 4139. Springer, Berlin, Heidelberg, pp 257–267.
Kim HD, Park DH, Lu Y, Zhai C (2012) Enriching text representation with frequent pattern mining for probabilistic topic modeling. Proc Am Soc Inf Sci Technol 49(1):1–10. doi:10.1002/meet.14504901209
Google Scholar
Ning Z, Yuefeng L, Sheng-Tang W (2012) Effective pattern discovery for text mining. IEEE Trans Knowl Data Eng 24(1):30–44. doi:10.1109/TKDE.2010.211
Article Google Scholar
Chim H, Deng X (2008) Efficient phrase-based document similarity for clustering. IEEE Trans Knowl Data Eng 20(9):1217–1229
Article Google Scholar
Li Y, Chung SM, Holt JD (2008) Text document clustering based on frequent word meaning sequences. Data Knowl Eng 64(1):381–404
Article Google Scholar
Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, 1992, ACM, pp 37–50
Fürnkranz J (1998) A study using n-gram features for text categorization. Austrian Res Inst Artif Intell 3(1998):1–10
Google Scholar
Gupta M, Han J (2011) Applications of pattern discovery using sequential data mining. In: Kumar P, Krishna PR, Raju SB (eds) Pattern discovery using sequence data mining: applications and studies. IGI Global, Hershey, pp 1–23
Google Scholar
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440
Article Google Scholar
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284
Article Google Scholar
Torkkola K (2004) Discriminative features for text document classification. Pattern Anal Appl 6(4):301–308
Article MathSciNet Google Scholar
Steinberger J, Ježek K (2009) Text summarization: an old challenge and new approaches. In: Abraham A, Hassanien A-E, de Leon F, de Carvalho A, Snášel V (eds) Foundations of computational intelligence, vol 206. Springer, Berlin, pp 127–149. doi:10.1007/978-3-642-01091-0_6
Google Scholar
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New Orleans, pp 19–25. doi:10.1145/383952.383955
Wallach HM (2006) Topic modeling: beyond Bag-of-words. In: Proceedings of the 23rd international conference on machine learning, New York, ICML ‘06. ACM, pp 977–984. doi:10.1145/1143844.1143967
Lent B, Agrawal R, Srikant R (1997) Discovering trends in text databases. In: Proceedings of the 3rd international conference on knowledge discovery and data mining (KDD’97), CA, pp 227–230
Baralis E, Cagliero L, Fiori A, Jabeen S (2011) PatTexSum: a pattern-based text summarizer. In: Proceedings of the workshop on mining complex patterns, pp 14–14
García-Hernández RA, Ledeneva Y (2009) Word sequence models for single text summarization. 2009 Second international conferences on advances in computer–human interactions: pp 44–48. doi:10.1109/ACHI.2009.58
Ahonen-Myka H (1999) Finding all maximal frequent sequences in text. In: Proceedings of the ICML99 workshop on machine learning in text data analysis. Citeseer, pp 11–17
Ahonen-Myka H (2002) Discovery of frequent word sequences in text. In: Proceedings of the ESF exploratory workshop on pattern detection and discovery {LNCS} 24 (Teollisuuskatu 23): pp 180–189
Agrawal R, Srikant R (1995) Mining sequential patterns. In: 11th international conference on data engineering (ICDE’95), Taipei
Mabroukeh N, Ezeife CI (2010) A taxonomy of Sequential Pattern Mining algorithms. ACM Comput Surv (CSUR) 43(1):1–41. doi:10.1145/1824795.1824798
Article Google Scholar
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Disc 15(1):55–86. doi:10.1007/s10618-006-0059-1
Article MathSciNet Google Scholar
Mooney CH, Roddick JF (2013) Sequential Pattern Mining—approaches and algorithms. ACM Comput Surv 45(2):1–39. doi:10.1145/2431211.2431218
Article MATH Google Scholar
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the fifth international conference on extending database technology, Avignon
Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn J 42(1):31–60
Article MATH Google Scholar
Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu M-C (2000) FreeSpan: frequent pattern-projected Sequential Pattern Mining. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 355–359
Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Disc 8(1):53–87
Article MathSciNet Google Scholar
Song F, Liu S, Yang J (2005) A comparative study on text representation schemes in text categorization. Pattern Anal Appl 8(1–2):199–209
Article MathSciNet Google Scholar
Nenkova A, McKeownK (2012) A survey of text summarization techniques. In Aggarwal CC, Zhai C (eds) Mining text data. Springer, pp 43–76.

Download references

Acknowledgement

This work is supported by Universiti Sains Malaysia (USM), Research University Grant (RU) by project number 1001/PKOMP/811295.

Author information

Authors and Affiliations

Faculty of Computing and Informatics, UMS, 88400, Kota Kinabalu, Sabah, Malaysia
Suraya Alias
School of Computer Sciences, USM, 11800, Gelugor, Penang, Malaysia
Siti Khaotijah Mohammad, Gan Keng Hoon & Tan Tien Ping

Authors

Suraya Alias
View author publications
You can also search for this author in PubMed Google Scholar
Siti Khaotijah Mohammad
View author publications
You can also search for this author in PubMed Google Scholar
Gan Keng Hoon
View author publications
You can also search for this author in PubMed Google Scholar
Tan Tien Ping
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suraya Alias.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alias, S., Mohammad, S.K., Hoon, G.K. et al. A text representation model using Sequential Pattern-Growth method. Pattern Anal Applic 21, 233–247 (2018). https://doi.org/10.1007/s10044-017-0624-9

Download citation

Received: 14 September 2015
Accepted: 24 May 2017
Published: 01 June 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s10044-017-0624-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A text representation model using Sequential Pattern-Growth method

Abstract

Access this article

Subscribe and save

Buy Now

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation