Skip to main content

An Efficient Method for Generating, Storing and Matching Features for Text Mining

  • Conference paper
Book cover Advances in Knowledge Discovery and Data Mining (PAKDD 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Included in the following conference series:

  • 3136 Accesses

Abstract

Log-linear models have been widely used in text mining tasks because it can incorporate a large number of possibly correlated features. In text mining, these possibly correlated features are generated by conjunction of features. They are usually used with log-linear models to estimate robust conditional distributions. To avoid manual construction of conjunction of features, we propose a new algorithmic framework called F-tree for automatically generating and storing conjunctions of features in text mining tasks. This compact graph-based data structure allows fast one-vs-all matching of features in the feature space which is crucial for many text mining tasks. Based on this hierarchical data structure, we propose a systematic method for removing redundant features to further reduce memory usage and improve performance. We do large-scale experiments on three publicly-available datasets and show that this automatic method can get state-of-the-art performance achieved by manual construction of features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Settles, B.: Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), pp. 104–107 (2004)

    Google Scholar 

  2. Kudo, T.: CRF++: Yet another CRF toolkit (2006), http://chasen.org/~taku/software/CRF++

  3. Phan, X.H., Nguyen, L.M., Nguyen, C.T.: Flexcrfs: Flexible conditional random field toolkit (2005), http://flexcrfs.sourceforge.net

  4. McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu

  5. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: NAACL 2003: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141 (2003)

    Google Scholar 

  6. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Morristown, NJ, USA, pp. 188–191. Association for Computational Linguistics (2003)

    Google Scholar 

  7. Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)

    Book  MATH  Google Scholar 

  8. Xu, L.: A trend on regularization and model selection in statistical learning: A bayesian ying yang learning perspective. Challenges for Computational Intelligence, 365–406 (2007)

    Google Scholar 

  9. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289 (2001)

    Google Scholar 

  10. Kudo, T., Matsumoto, Y.: Chunking with support vector machines. In: NAACL 2001: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, Morristown, NJ, USA, pp. 1–8. Association for Computational Linguistics (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chan, SK., Lam, W. (2009). An Efficient Method for Generating, Storing and Matching Features for Text Mining. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-01307-2_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-01306-5

  • Online ISBN: 978-3-642-01307-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics