Abstract
The categorization and novelty mining of chronologically ordered documents is an important data mining problem. This paper focuses on the entire process of Chinese novelty mining, from preprocessing and categorization to the actual detection of novel information, which has rarely been studied. First, preprocessing techniques for detecting novel Chinese text are discussed and compared. Next, we investigate the categorization and novelty mining performance between English and Chinese sentences and also discuss the novelty mining performance based on the retrieval results. Moreover, we propose new novelty mining evaluation measures, Novelty-Precision, Novelty-Recall, Novelty-F Score, and Sensitivity, which measures the sensitivity of the novelty mining system to the incorrectly classified sentences. The results indicate that Chinese novelty mining at the sentence level is similar to English if the sentences are perfectly categorized. Using our new evaluation measures of Novelty-Precision, Novelty-Recall, Novelty-F Score, and Sensitivity, we can more fairly assess how the performance of novelty mining is influenced by the retrieval results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence level. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 314–321 (2003)
Diaz, F., Metzler, D.: Improving the estimation of relevance models using large external corpora. In: SIGIR 2006, Seattle, USA, pp. 154–161 (2006)
Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics 31(4), 531–574 (2005)
Kwee, A.T., Tsai, F.S., Tang, W.: Sentence-level novelty detection in english and malay. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 40–51. Springer, Heidelberg (2009)
Li, Y., Taylor, J.S.: The SVM with uneven margins and Chinese document categorisation. In: Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pp. 216–227 (2003)
Liang, H., Tsai, F.S., Kwee, A.T.: Detecting novel business blogs. In: ICICS 2009 - Conference Proceedings of the 7th International Conference on Information, Communications and Signal Processing (2009)
Ng, K.W., Tsai, F.S., Chen, L., Goh, K.C.: Novelty detection for text documents using named entity recognition. In: 2007 6th International Conference on Information, Communications and Signal Processing, ICICS (2007)
Ong, C.L., Kwee, A., Tsai, F.: Database optimization for novelty detection. In: ICICS 2009 - Conference Proceedings of the 7th International Conference on Information, Communications and Signal Processing (2009)
PKU and CAS, Chinese POS tagging criterion (1999), http://icl.pku.edu.cn/icl_groups/corpus/addition.htm
Rocchio, J.: Relevance feedback in information retrieval. In: The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323 (1971)
Soboroff, I.: Overview of the TREC 2004 Novelty Track. In: Proceedings of TREC 2004 - the 13th Text Retrieval Conference, pp. 1–16 (2004)
Tan, R., Tsai, F.S.: Authorship identification for online text. In: International Conference on Cyberworlds, pp. 155–162 (2010)
Tang, W., Tsai, F.S., Chen, L.: Blended metrics for novel sentence mining. Expert Syst. Appl. 37(7), 5172–5177 (2010)
Tsai, F.S.: Review of techniques for intelligent novelty mining. Information Technology Journal 9(6), 1255–1261 (2010)
Tsai, F.S.: Dimensionality reduction techniques for blog visualization. Expert Systems With Applications 38(3), 2766–2773 (2011)
Tsai, F.S.: A tag-topic model for blog mining. Expert Systems With Applications 38(5), 5330–5335 (2011)
Tsai, F.S., Chan, K.L.: Detecting cyber security threats in weblogs using probabilistic models. In: Yang, C.C., Zeng, D., Chau, M., Chang, K., Yang, Q., Cheng, X., Wang, J., Wang, F.-Y., Chen, H. (eds.) PAISI 2007. LNCS, vol. 4430, pp. 46–57. Springer, Heidelberg (2007)
Tsai, F.S., Chan, K.L.: Dimensionality reduction techniques for data exploration. In: 2007 6th International Conference on Information, Communications and Signal Processing, ICICS 2007, pp. 1568–1572 (2007)
Tsai, F.S., Chan, K.L.: Redundancy and novelty mining in the business blogosphere. The Learning Organization 17(6), 490–499 (2010)
Tsai, F.S., Chan, K.L.: An intelligent system for sentence retrieval and novelty mining. International Journal of Knowledge Engineering and Data Mining 1(3), 235–253 (2011)
Tsai, F.S., Tang, W., Chan, K.L.: Evaluation of metrics for sentence-level novelty mining. Information Sciences 180(12), 2359–2374 (2010)
Tsai, F.S., Zhang, Y.: D2S: Document-to-sentence framework for novelty detection. Knowledge and Information Systems (2011)
Zhang, H.-P., Liu, Q., Cheng, X.-Q., Zhang, H., Yu, H.-K.: Chinese lexical analysis using hierarchical hidden markov model. In: Second SIGHAN Workshop Affiliated with 41th ACL, pp. 63–70 (2003)
Zhang, Y., Tsai, F.S.: Combining named entities and tags for novel sentence detection. In: Proceedings of the WSDM 2009 ACM Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR 2009, pp. 30–34 (2009)
Zheng, W., Zhang, Y., Zou, B., Hong, Y., Liu, T.: Research of Chinese topic tracking based on relevance model (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tsai, F.S., Zhang, Y. (2011). Chinese Categorization and Novelty Mining. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6635. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20847-8_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-20847-8_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20846-1
Online ISBN: 978-3-642-20847-8
eBook Packages: Computer ScienceComputer Science (R0)