Skip to main content

Detecting Vital Documents Using Negative Relevance Feedback in Distributed Realtime Computation Framework

  • Conference paper
  • First Online:
Computational Linguistics (PACLING 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 593))

Included in the following conference series:

  • 637 Accesses

Abstract

Existing knowledge bases including Wikipedia are typically written and maintained by a group of voluntary editors. Meanwhile, numerous web documents are being published partly due to the popularization of online news and social media. Some of the web documents contain novel information, called “vital documents”, that should be taken into account to update articles of the knowledge bases. However, it is virtually impossible for the editors to manually monitor all the relevant web documents. As a result, there is a considerable time lag between an edit to knowledge base and the publication dates of the web documents. This paper proposes a realtime detection framework of web documents containing novel information flowing in massive document streams. The framework consists of two-step filter using statistical language models. Further, the framework is implemented on the distributed and fault-tolerant realtime computation system, Apache Storm, in order to process the sheer amount of web documents. The validity of the proposed framework is demonstrated on a publicly available web document data set, the TREC KBA Stream Corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://en.wikipedia.org/wiki/List_of_Wikipedias.

  2. 2.

    http://storm.apache.org/.

  3. 3.

    http://trec-kba.org/trec-kba-2014.

  4. 4.

    http://googleresearch.blogspot.jp/2006/08/all-our-n-gram-are-belong-to-you.html.

  5. 5.

    http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html.

  6. 6.

    More precisely, it was also allowed to do hourly batch processing.

  7. 7.

    http://en.wikipedia.org/wiki/Wikipedia:Redirect.

  8. 8.

    http://s3.amazonaws.com/aws-publicdatasets/trec/kba/enwiki-20120104/index.html.

References

  1. Abbes, R., Pinel-Sauvagnat, K., Hernandez, N., Boughanem, M.: IRIT at TREC knowledge base acceleration 2013: cumulative citation recommendation task. In: Proceedings of the Text REtrieval Conference (TREC) (2013)

    Google Scholar 

  2. Balog, K., Ramampiaro, H., Takhirov, N., Nørvåg, K.: Multi-step classification approaches to cumulative citation recommendation. In: Proceedings of the 10th Conference on Open Research Areas in Information Retrieval, pp. 121–128 (2013)

    Google Scholar 

  3. Balog, K., Serdyukov, P., Vries, A.P.d.: Overview of the TREC 2011 entity track. In: Proceedings of the Text REtrieval Conference (TREC) (2011)

    Google Scholar 

  4. Bellogín, A., Gebremeskel, G.G., He, J., Lin, J., Said, A., Samar, T., de Vries, A.P., Vuurens, J.B.: CWI and TU Delft at TREC 2013: Contextual suggestion, federated web search, KBA, and web tracks. In: Proceedings of the Text REtrieval Conference (TREC) (2013)

    Google Scholar 

  5. Bonnefoy, L., Bouvier, V., Bellot, P.: A weakly-supervised detection of entity central documents in a stream. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 769–772. ACM Press (2013)

    Google Scholar 

  6. Dang, H.T., Kelly, D., Lin, J.J.: Overview of the TREC 2007 question answering track. In: Proceedings of the Text REtrieval Conference (TREC) (2007)

    Google Scholar 

  7. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  8. Dietz, L., Dalton, J.: UMass at TREC 2013 knowledge base acceleration track: bi-directional entity linking and time-aware evaluation. In: Proceedings of the Text REtrieval Conference (TREC) (2013)

    Google Scholar 

  9. Elsas, J.L., Arguello, J., Callan, J., Carbonell, J.G.: Retrieval and feedback models for blog feed search. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 347–354 (2008)

    Google Scholar 

  10. Frank, J.R., Bauer, S.J., Kleiman-Weiner, M., Roberts, D.A., Tripuraneni, N., Zhang, C., Re, C., Voorhees, E., Soboroff, I.: Evaluating stream filtering for entity profile updates for TREC 2013 (KBA Track Overview). In: Proceedings of the Text REtrieval Conference (TREC) (2013)

    Google Scholar 

  11. Frank, J.R., Kleiman-Weiner, M., Roberts, D.A., Niu, F., Zhang, C., Ré, C., Soboroff, I.: Building an entity-centric stream filtering test collection for TREC 2012. In: Proceedings of the Text REtrieval Conference (TREC) (2012)

    Google Scholar 

  12. Kenter, T.: Filtering documents over time for evolving topics-the university of amsterdam at TREC 2013 KBA CCR. In: Proceedings of the Text REtrieval Conference (TREC) (2013)

    Google Scholar 

  13. Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 111–119 (2001)

    Google Scholar 

  14. Liu, X., Darko, J., Fang, H.: A related entity based approach for knowledge base acceleration. In: Proceedings of the Text REtrieval Conference (TREC) (2013)

    Google Scholar 

  15. McCreadie, R., Macdonald, C., Ounis, I., Osborne, M., Petrovic, S.: Scalable distributed event detection for twitter. In: 2013 IEEE International Conference on Big Data, pp. 543–549. IEEE (2013)

    Google Scholar 

  16. Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 233–242 (2007)

    Google Scholar 

  17. Porter, M.F.: An algorithm for suffix stripping. Prog. Electron. Libr. Inf. Syst. 14(3), 130–137 (1980)

    Google Scholar 

  18. Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J.M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., et al.: Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156. ACM (2014)

    Google Scholar 

  19. Wang, J., Song, D., Lin, C.Y., Liao, L.: BIT and MSRA at TREC KBA CCR Track 2013. In: Proceedings of the Text REtrieval Conference (TREC) (2013)

    Google Scholar 

  20. Wang, X., Fang, H., Zhai, C.: A study of methods for negative relevance feedback. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 219–226 (2008)

    Google Scholar 

  21. Xu, Y., Jones, G.J., Wang, B.: Query dependent pseudo-relevance feedback based on wikipedia. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 59–66 (2009)

    Google Scholar 

  22. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to Ad Hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank Sayaka Kitaguchi at Kobe University for processing the KBA corpus. This work is partially supported by JSPS KAKENHI Grant Numbers 25330363 and MEXT, Japan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shun Kawahara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media Singapore

About this paper

Cite this paper

Kawahara, S., Seki, K., Uehara, K. (2016). Detecting Vital Documents Using Negative Relevance Feedback in Distributed Realtime Computation Framework. In: Hasida, K., Purwarianti, A. (eds) Computational Linguistics. PACLING 2015. Communications in Computer and Information Science, vol 593. Springer, Singapore. https://doi.org/10.1007/978-981-10-0515-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-0515-2_14

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-0514-5

  • Online ISBN: 978-981-10-0515-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics