Skip to main content
Log in

D2S: Document-to-sentence framework for novelty detection

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Novelty detection aims at identifying novel information from an incoming stream of documents. In this paper, we propose a new framework for document-level novelty detection using document-to-sentence (D2S) annotations and discuss the applicability of this method. D2S first segments a document into sentences, determines the novelty of each sentence, then computes the document-level novelty score based on a fixed threshold. Experimental results on APWSJ data show that D2S outperforms standard document-level novelty detection in terms of redundancy-precision (RP) and redundancy-recall (RR). We applied D2S on the document-level data from the TREC 2004 and TREC 2003 Novelty Track and find that D2S is useful in detecting novel information in data with a high percentage of novel documents. However, D2S shows a strong capability to detect redundant information regardless of the percentage of novel documents. D2S has been successfully integrated in a real-world novelty detection system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal CC, Yu PS (2009) On clustering massive text and categorical data streams. Knowl Inf Syst

  2. Allan J, Wade C, Bolivar A (2003) Retrieval and novelty detection at the sentence level. In: SIGIR 2003: proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval 314–321

  3. Bendersky M, Croft W (2009) Finding text reuse on the web. In: WSDM 2009, Barcelona, Spain 262–271

  4. Chen Y, Tsai FS, Chan KL (2008) Machine learning techniques for business blog search and mining. Expert Syst Appl 35(3): 581–590

    Article  Google Scholar 

  5. Harman D (2002) Overview of the TREC 2002 novelty track. In: Proceedings of TREC 2002—the 11th text retrieval conference 46–55

  6. Jacquenet F, Largeron C (2009) Discovering unexpected documents in corpora. Knowl Based Syst 22(6): 421–429

    Article  Google Scholar 

  7. Kojiro A, Shimizu S (2006) Sentence segmenter. http://www.eng.ritsumei.ac.jp/asao/resources/sentseg/

  8. Kwee AT, Tsai FS, Tang W (2009) Sentence-level novelty detection in English and Malay. Lecture Notes in Computer Science (LNCS) 5476, 40–51 (2009)

  9. Lee L (1999) Measures of distributional similarity. In: Proceedings of the 37th annual meeting of the association for computational linguistics 25–32

  10. Li X, Croft WB (2005) Novelty detection based on sentence level patterns. In: CIKM 2005 744–751

  11. Li X, Croft WB (2008) An information-pattern-based approach to novelty detection. Inf Process Manage Int J 44(3): 1159–1188

    Article  Google Scholar 

  12. Liang H, Tsai FS, Kwee AT (2009) Detecting novel business blogs. In: ICICS 2009—conference proceedings of the 7th international conference on information, communications and signal processing

  13. Munoz M, Nagarajan R (2001) Sentence splitter. http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=SS

  14. Ng KW, Tsai FS, Chen L, Goh KC (2007) Novelty detection for text documents using named entity recognition. In: 2007 6th international conference on information, communications and signal processing, ICICS

  15. Obeid N, Rao RBKN (2009) On integrating event definition and event detection. Knowl Inf Syst

  16. Soboroff I (2004) Overview of the TREC 2004 novelty track. In: Proceedings of TREC 2004—the 13th text retrieval conference 1–16

  17. Soboroff I, Harman D (2003) Overview of the TREC 2003 novelty track. In: Proceedings of TREC 2003—the 12th text retrieval conference 38–53

  18. Tamine-Lechani L, Boughanem M, Daoud M (2009) Evaluation of contextual information retrieval effectiveness: overview of issues and research. Knowl Inf Syst

  19. Tang W, Tsai FS (2009) Threshold setting and performance monitoring for novel text mining. In: Society for industrial and applied mathematics—9th SIAM international conference on data mining proceedings in applied mathematics 3:1310–1319

  20. Tang W, Tsai FS, Chen L (2010) Blended metrics for novel sentence mining. Expert Syst Appl 37(7): 5172–5177

    Article  Google Scholar 

  21. Tsai FS (2010) Review of techniques for intelligent novelty mining. Inf Technol J 9(6): 1255–1261

    Article  Google Scholar 

  22. Tsai FS, Chan KL (2007) Dimensionality reduction techniques for data exploration. In: 2007 6th international conference on information, communications and signal processing, ICICS 1568–1572

  23. Tsai FS, Chan KL (2010) Redundancy and novelty mining in the business blogosphere. Learn Organiz 17(6): 490–499

    Article  Google Scholar 

  24. Tsai FS, Han W, Xu J, Chua HC (2009) Design and development of a mobile peer-to-peer social networking application. Expert Syst Appl 36(8): 11,077–11,087

    Google Scholar 

  25. Tsai FS, Kwee AT (2011) Information services for novelty mining. Knowl Eng Rev

  26. Tsai FS, Zhang Y, Kwee AT, Tang W (2011) Multilingual novelty detection. Expert Syst Appl 38(1): 652–658

    Article  Google Scholar 

  27. Wei F, Li W, Lu Q, He Y (2009) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst

  28. Zhang HP, Sun J, Wang B, Bai S (2005) Computation on sentence semantic distance for novelty detection. J Comput Sci Technol 20(3): 331–337

    Article  Google Scholar 

  29. Zhang Y, Callan J, Minka T (2002) Novelty and redundancy detection in adaptive filtering. In: SIGIR 2002: proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval 81–88

  30. Zhang Y, Tsai FS (2009) Combining named entities and tags for novel sentence detection. In: Proceedings of the WSDM’2009 ACM workshop on exploiting semantic annotations in information retrieval, ESAIR 2009 30–34

  31. Zhang Y, Tsai FS (2009) Chinese novelty mining. In: EMNLP 2009: proceedings of the conference on empirical methods in natural language processing 1561–1570

  32. Zhang Y, Tsai FS, Kwee AT (2011) Multilingual sentence categorization and novelty mining. Inf Process Manage Int J

  33. Zhao L, Zheng M, Ma S (2006) The nature of novelty detection. Inf Retr 9: 527–541

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Flora S. Tsai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tsai, F.S., Zhang, Y. D2S: Document-to-sentence framework for novelty detection. Knowl Inf Syst 29, 419–433 (2011). https://doi.org/10.1007/s10115-010-0372-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0372-2

Keywords

Navigation