Abstract
Novelty detection aims at identifying novel information from an incoming stream of documents. In this paper, we propose a new framework for document-level novelty detection using document-to-sentence (D2S) annotations and discuss the applicability of this method. D2S first segments a document into sentences, determines the novelty of each sentence, then computes the document-level novelty score based on a fixed threshold. Experimental results on APWSJ data show that D2S outperforms standard document-level novelty detection in terms of redundancy-precision (RP) and redundancy-recall (RR). We applied D2S on the document-level data from the TREC 2004 and TREC 2003 Novelty Track and find that D2S is useful in detecting novel information in data with a high percentage of novel documents. However, D2S shows a strong capability to detect redundant information regardless of the percentage of novel documents. D2S has been successfully integrated in a real-world novelty detection system.
Similar content being viewed by others
References
Aggarwal CC, Yu PS (2009) On clustering massive text and categorical data streams. Knowl Inf Syst
Allan J, Wade C, Bolivar A (2003) Retrieval and novelty detection at the sentence level. In: SIGIR 2003: proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval 314–321
Bendersky M, Croft W (2009) Finding text reuse on the web. In: WSDM 2009, Barcelona, Spain 262–271
Chen Y, Tsai FS, Chan KL (2008) Machine learning techniques for business blog search and mining. Expert Syst Appl 35(3): 581–590
Harman D (2002) Overview of the TREC 2002 novelty track. In: Proceedings of TREC 2002—the 11th text retrieval conference 46–55
Jacquenet F, Largeron C (2009) Discovering unexpected documents in corpora. Knowl Based Syst 22(6): 421–429
Kojiro A, Shimizu S (2006) Sentence segmenter. http://www.eng.ritsumei.ac.jp/asao/resources/sentseg/
Kwee AT, Tsai FS, Tang W (2009) Sentence-level novelty detection in English and Malay. Lecture Notes in Computer Science (LNCS) 5476, 40–51 (2009)
Lee L (1999) Measures of distributional similarity. In: Proceedings of the 37th annual meeting of the association for computational linguistics 25–32
Li X, Croft WB (2005) Novelty detection based on sentence level patterns. In: CIKM 2005 744–751
Li X, Croft WB (2008) An information-pattern-based approach to novelty detection. Inf Process Manage Int J 44(3): 1159–1188
Liang H, Tsai FS, Kwee AT (2009) Detecting novel business blogs. In: ICICS 2009—conference proceedings of the 7th international conference on information, communications and signal processing
Munoz M, Nagarajan R (2001) Sentence splitter. http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=SS
Ng KW, Tsai FS, Chen L, Goh KC (2007) Novelty detection for text documents using named entity recognition. In: 2007 6th international conference on information, communications and signal processing, ICICS
Obeid N, Rao RBKN (2009) On integrating event definition and event detection. Knowl Inf Syst
Soboroff I (2004) Overview of the TREC 2004 novelty track. In: Proceedings of TREC 2004—the 13th text retrieval conference 1–16
Soboroff I, Harman D (2003) Overview of the TREC 2003 novelty track. In: Proceedings of TREC 2003—the 12th text retrieval conference 38–53
Tamine-Lechani L, Boughanem M, Daoud M (2009) Evaluation of contextual information retrieval effectiveness: overview of issues and research. Knowl Inf Syst
Tang W, Tsai FS (2009) Threshold setting and performance monitoring for novel text mining. In: Society for industrial and applied mathematics—9th SIAM international conference on data mining proceedings in applied mathematics 3:1310–1319
Tang W, Tsai FS, Chen L (2010) Blended metrics for novel sentence mining. Expert Syst Appl 37(7): 5172–5177
Tsai FS (2010) Review of techniques for intelligent novelty mining. Inf Technol J 9(6): 1255–1261
Tsai FS, Chan KL (2007) Dimensionality reduction techniques for data exploration. In: 2007 6th international conference on information, communications and signal processing, ICICS 1568–1572
Tsai FS, Chan KL (2010) Redundancy and novelty mining in the business blogosphere. Learn Organiz 17(6): 490–499
Tsai FS, Han W, Xu J, Chua HC (2009) Design and development of a mobile peer-to-peer social networking application. Expert Syst Appl 36(8): 11,077–11,087
Tsai FS, Kwee AT (2011) Information services for novelty mining. Knowl Eng Rev
Tsai FS, Zhang Y, Kwee AT, Tang W (2011) Multilingual novelty detection. Expert Syst Appl 38(1): 652–658
Wei F, Li W, Lu Q, He Y (2009) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst
Zhang HP, Sun J, Wang B, Bai S (2005) Computation on sentence semantic distance for novelty detection. J Comput Sci Technol 20(3): 331–337
Zhang Y, Callan J, Minka T (2002) Novelty and redundancy detection in adaptive filtering. In: SIGIR 2002: proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval 81–88
Zhang Y, Tsai FS (2009) Combining named entities and tags for novel sentence detection. In: Proceedings of the WSDM’2009 ACM workshop on exploiting semantic annotations in information retrieval, ESAIR 2009 30–34
Zhang Y, Tsai FS (2009) Chinese novelty mining. In: EMNLP 2009: proceedings of the conference on empirical methods in natural language processing 1561–1570
Zhang Y, Tsai FS, Kwee AT (2011) Multilingual sentence categorization and novelty mining. Inf Process Manage Int J
Zhao L, Zheng M, Ma S (2006) The nature of novelty detection. Inf Retr 9: 527–541
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tsai, F.S., Zhang, Y. D2S: Document-to-sentence framework for novelty detection. Knowl Inf Syst 29, 419–433 (2011). https://doi.org/10.1007/s10115-010-0372-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0372-2