Skip to main content

Efficient Online Novelty Detection in News Streams

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8180))

Abstract

Novelty detection in text streams is a challenging task that emerges in quite a few different scenarii, ranging from email threads to RSS news feeds on a cell phone. An efficient novelty detection algorithm can save the user a great deal of time when accessing interesting information. Most of the recent research for the detection of novel documents in text streams uses either geometric distances or distributional similarities with the former typically performing better but being slower as we need to compare an incoming document with all the previously seen ones. In this paper, we propose a new novelty detection algorithm based on the Inverse Document Frequency (IDF) scoring function. Computing novelty based on IDF enables us to avoid similarity comparisons with previous documents in the text stream, thus leading to faster execution times. At the same time, our proposed approach outperforms several commonly used baselines when applied on a real-world news articles dataset.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allan, J.: Introduction to topic detection and tracking. In: Allan, J. (ed.) Topic Detection and Tracking. The Information Retrieval Series, vol. 12, pp. 1–16. Springer, US (2002)

    Chapter  Google Scholar 

  2. Allan, J., Lavrenko, V., Jin, H.: First story detection in tdt is hard. In: CIKM 2000, pp. 374–381. ACM (2000)

    Google Scholar 

  3. Allan, J., Lavrenko, V., Malin, D., Swan, R.: Detections, bounds, and timelines: Umass and tdt-3. In: Topic Detection and Tracking Workshop, TDT-3 (2000)

    Google Scholar 

  4. Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence level. In: SIGIR 2003, pp. 314–321. ACM (2003)

    Google Scholar 

  5. Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: SIGIR 2004, pp. 49–56. ACM (2004)

    Google Scholar 

  6. Fiscus, J.G., Doddington, G.R.: Topic detection and tracking. In: Allan, J. (ed.) Topic Detection and Tracking, ch. 1, pp. 17–31. Kluwer Academic Publishers (2002)

    Google Scholar 

  7. Harman, D.: Overview of the trec 2002 novelty track. In: TREC 2002, pp. 46–55. NIST Special Publication 500-251 (2002)

    Google Scholar 

  8. Kwee, A.T., Tsai, F.S., Tang, W.: Sentence-level novelty detection in English and Malay. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 40–51. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  9. Li, X., Croft, W.B.: Novelty detection based on sentence level patterns. In: CIKM 2005, pp. 744–751. ACM (2005)

    Google Scholar 

  10. Luo, G., Tang, C., Yu, P.S.: Resource-adaptive real-time new event detection. In: SIGMOD 2007, pp. 497–508. ACM (2007)

    Google Scholar 

  11. Manmatha, R., Feng, A., Allan, J.: A critical examination of tdt’s cost function. In: SIGIR 2002, pp. 403–404. ACM (2002)

    Google Scholar 

  12. Markou, M., Singh, S.: Novelty detection a review–part 1: statistical approaches. Signal Process. 83(12), 2481–2497 (2003)

    Article  MATH  Google Scholar 

  13. Markou, M., Singh, S.: Novelty detection a review-part 2: neural network based approaches. Signal Process. 83(12), 2499–2521 (2003)

    Article  MATH  Google Scholar 

  14. Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The det curve in assessment of detection task performance. In: 5th European Conference on Speech Communication and Technology, pp. 1895–1898 (1997)

    Google Scholar 

  15. Ohgaya, R., Shimmura, A., Takagi, T., Aizawa, A.N.: Meiji university web and novelty track experiments at trec 2003. In: TREC 2003, pp. 399–407 (2003)

    Google Scholar 

  16. Petrović, S., Osborne, M., Lavrenko, V.: Streaming first story detection with application to twitter. In: HLT 2010, pp. 181–189. ACL (2010)

    Google Scholar 

  17. Robertson, S.E., Walker, S.: On relevance weights with little relevance information. SIGIR Forum 31(SI), 16–24 (1997)

    Article  Google Scholar 

  18. Robertson, S.E., Walker, S., Sparck Jones, K., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: TREC-3, pp. 109–126 (1994)

    Google Scholar 

  19. Singhal, A., Salton, G., Buckley, C.: Length normalization in degraded text collections. Technical report, Cornell University, Ithaca, NY, USA (1995)

    Google Scholar 

  20. Soboroff, I.: Overview of the trec 2004 novelty track. In: TREC 2004. NIST Special Publication, pp. 500–251 (2004)

    Google Scholar 

  21. Soboroff, I., Harman, D.: Overview of the trec 2003 novelty track. In: TREC 2003. NIST Special Publication, pp. 500–251 (2003)

    Google Scholar 

  22. Soboroff, I., Harman, D.: Novelty detection: the trec experience. In: HLT 2005, pp. 105–112. ACL (2005)

    Google Scholar 

  23. Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–20 (1972)

    Article  Google Scholar 

  24. Tsai, F.S.: Review of techniques for intelligent novelty mining. Information Technology Journal 9, 1255–1261 (2010)

    Article  Google Scholar 

  25. Tsai, F.S., Kwee, A.T.: Experiments in term weighting for novelty mining. Expert Systems with Applications 38(11), 14094–14101 (2011)

    Google Scholar 

  26. Tsai, F.S., Tang, W., Chan, K.L.: Evaluation of novelty metrics for sentence-level novelty mining. Inf. Sci. 180(12), 2359–2374 (2010)

    Article  Google Scholar 

  27. Verheij, A., Kleijn, A., Frasincar, F., Hogenboom, F.: A comparison study for novelty control mechanisms applied to web news stories. In: WI 2012, pp. 431–436. IEEE Computer Society (2012)

    Google Scholar 

  28. Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: KDD 2002, pp. 688–693. ACM (2002)

    Google Scholar 

  29. Zhang, K., Zi, J., Wu, L.G.: New event detection based on indexing-tree and named entity. In: SIGIR 2007, pp. 215–222. ACM, New York (2007)

    Google Scholar 

  30. Zhang, Y., Callan, J., Minka, T.: Novelty and redundancy detection in adaptive filtering. In: SIGIR 2002, pp. 81–88. ACM (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Karkali, M., Rousseau, F., Ntoulas, A., Vazirgiannis, M. (2013). Efficient Online Novelty Detection in News Streams. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41230-1_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41229-5

  • Online ISBN: 978-3-642-41230-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics