Chinese Categorization and Novelty Mining

Tsai, Flora S.; Zhang, Yi

doi:10.1007/978-3-642-20847-8_24

Flora S. Tsai²² &
Yi Zhang²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6635))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2497 Accesses

Abstract

The categorization and novelty mining of chronologically ordered documents is an important data mining problem. This paper focuses on the entire process of Chinese novelty mining, from preprocessing and categorization to the actual detection of novel information, which has rarely been studied. First, preprocessing techniques for detecting novel Chinese text are discussed and compared. Next, we investigate the categorization and novelty mining performance between English and Chinese sentences and also discuss the novelty mining performance based on the retrieval results. Moreover, we propose new novelty mining evaluation measures, Novelty-Precision, Novelty-Recall, Novelty-F Score, and Sensitivity, which measures the sensitivity of the novelty mining system to the incorrectly classified sentences. The results indicate that Chinese novelty mining at the sentence level is similar to English if the sentences are perfectly categorized. Using our new evaluation measures of Novelty-Precision, Novelty-Recall, Novelty-F Score, and Sensitivity, we can more fairly assess how the performance of novelty mining is influenced by the retrieval results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Document-to-Sentence Level Technique for Novelty Detection

Word-level human interpretable scoring mechanism for novel text detection using Tsetlin Machines

Article Open access 02 April 2022

A Classifier to Predict Document Novelty Using Association Rule Mining

References

Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence level. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 314–321 (2003)
Google Scholar
Diaz, F., Metzler, D.: Improving the estimation of relevance models using large external corpora. In: SIGIR 2006, Seattle, USA, pp. 154–161 (2006)
Google Scholar
Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics 31(4), 531–574 (2005)
Article MATH Google Scholar
Kwee, A.T., Tsai, F.S., Tang, W.: Sentence-level novelty detection in english and malay. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 40–51. Springer, Heidelberg (2009)
Chapter Google Scholar
Li, Y., Taylor, J.S.: The SVM with uneven margins and Chinese document categorisation. In: Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pp. 216–227 (2003)
Google Scholar
Liang, H., Tsai, F.S., Kwee, A.T.: Detecting novel business blogs. In: ICICS 2009 - Conference Proceedings of the 7th International Conference on Information, Communications and Signal Processing (2009)
Google Scholar
Ng, K.W., Tsai, F.S., Chen, L., Goh, K.C.: Novelty detection for text documents using named entity recognition. In: 2007 6th International Conference on Information, Communications and Signal Processing, ICICS (2007)
Google Scholar
Ong, C.L., Kwee, A., Tsai, F.: Database optimization for novelty detection. In: ICICS 2009 - Conference Proceedings of the 7th International Conference on Information, Communications and Signal Processing (2009)
Google Scholar
PKU and CAS, Chinese POS tagging criterion (1999), http://icl.pku.edu.cn/icl_groups/corpus/addition.htm
Rocchio, J.: Relevance feedback in information retrieval. In: The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323 (1971)
Google Scholar
Soboroff, I.: Overview of the TREC 2004 Novelty Track. In: Proceedings of TREC 2004 - the 13th Text Retrieval Conference, pp. 1–16 (2004)
Google Scholar
Tan, R., Tsai, F.S.: Authorship identification for online text. In: International Conference on Cyberworlds, pp. 155–162 (2010)
Google Scholar
Tang, W., Tsai, F.S., Chen, L.: Blended metrics for novel sentence mining. Expert Syst. Appl. 37(7), 5172–5177 (2010)
Article Google Scholar
Tsai, F.S.: Review of techniques for intelligent novelty mining. Information Technology Journal 9(6), 1255–1261 (2010)
Article Google Scholar
Tsai, F.S.: Dimensionality reduction techniques for blog visualization. Expert Systems With Applications 38(3), 2766–2773 (2011)
Article Google Scholar
Tsai, F.S.: A tag-topic model for blog mining. Expert Systems With Applications 38(5), 5330–5335 (2011)
Article Google Scholar
Tsai, F.S., Chan, K.L.: Detecting cyber security threats in weblogs using probabilistic models. In: Yang, C.C., Zeng, D., Chau, M., Chang, K., Yang, Q., Cheng, X., Wang, J., Wang, F.-Y., Chen, H. (eds.) PAISI 2007. LNCS, vol. 4430, pp. 46–57. Springer, Heidelberg (2007)
Chapter Google Scholar
Tsai, F.S., Chan, K.L.: Dimensionality reduction techniques for data exploration. In: 2007 6th International Conference on Information, Communications and Signal Processing, ICICS 2007, pp. 1568–1572 (2007)
Google Scholar
Tsai, F.S., Chan, K.L.: Redundancy and novelty mining in the business blogosphere. The Learning Organization 17(6), 490–499 (2010)
Article Google Scholar
Tsai, F.S., Chan, K.L.: An intelligent system for sentence retrieval and novelty mining. International Journal of Knowledge Engineering and Data Mining 1(3), 235–253 (2011)
Article Google Scholar
Tsai, F.S., Tang, W., Chan, K.L.: Evaluation of metrics for sentence-level novelty mining. Information Sciences 180(12), 2359–2374 (2010)
Article Google Scholar
Tsai, F.S., Zhang, Y.: D2S: Document-to-sentence framework for novelty detection. Knowledge and Information Systems (2011)
Google Scholar
Zhang, H.-P., Liu, Q., Cheng, X.-Q., Zhang, H., Yu, H.-K.: Chinese lexical analysis using hierarchical hidden markov model. In: Second SIGHAN Workshop Affiliated with 41th ACL, pp. 63–70 (2003)
Google Scholar
Zhang, Y., Tsai, F.S.: Combining named entities and tags for novel sentence detection. In: Proceedings of the WSDM 2009 ACM Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR 2009, pp. 30–34 (2009)
Google Scholar
Zheng, W., Zhang, Y., Zou, B., Hong, Y., Liu, T.: Research of Chinese topic tracking based on relevance model (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore
Flora S. Tsai & Yi Zhang

Authors

Flora S. Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, 518055, Shenzhen, China
Joshua Zhexue Huang
Faculty of Engineering and Information Technology, Center for Quantum Computation and Intelligent Systems, Data Sciences and Knowledge Discovery Lab, University of Technology Sydney, 2007, Sydney, NSW, Australia
Longbing Cao
Department of Computer Science and Engineering, University of Minnesota, 55455, Minneapolis, MN, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tsai, F.S., Zhang, Y. (2011). Chinese Categorization and Novelty Mining. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6635. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20847-8_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-20847-8_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20846-1
Online ISBN: 978-3-642-20847-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics