Abstract
Log-linear models have been widely used in text mining tasks because it can incorporate a large number of possibly correlated features. In text mining, these possibly correlated features are generated by conjunction of features. They are usually used with log-linear models to estimate robust conditional distributions. To avoid manual construction of conjunction of features, we propose a new algorithmic framework called F-tree for automatically generating and storing conjunctions of features in text mining tasks. This compact graph-based data structure allows fast one-vs-all matching of features in the feature space which is crucial for many text mining tasks. Based on this hierarchical data structure, we propose a systematic method for removing redundant features to further reduce memory usage and improve performance. We do large-scale experiments on three publicly-available datasets and show that this automatic method can get state-of-the-art performance achieved by manual construction of features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Settles, B.: Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), pp. 104–107 (2004)
Kudo, T.: CRF++: Yet another CRF toolkit (2006), http://chasen.org/~taku/software/CRF++
Phan, X.H., Nguyen, L.M., Nguyen, C.T.: Flexcrfs: Flexible conditional random field toolkit (2005), http://flexcrfs.sourceforge.net
McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: NAACL 2003: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141 (2003)
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Morristown, NJ, USA, pp. 188–191. Association for Computational Linguistics (2003)
Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)
Xu, L.: A trend on regularization and model selection in statistical learning: A bayesian ying yang learning perspective. Challenges for Computational Intelligence, 365–406 (2007)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289 (2001)
Kudo, T., Matsumoto, Y.: Chunking with support vector machines. In: NAACL 2001: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, Morristown, NJ, USA, pp. 1–8. Association for Computational Linguistics (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chan, SK., Lam, W. (2009). An Efficient Method for Generating, Storing and Matching Features for Text Mining. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-01307-2_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)