Detecting Vital Documents Using Negative Relevance Feedback in Distributed Realtime Computation Framework

Kawahara, Shun; Seki, Kazuhiro; Uehara, Kuniaki

doi:10.1007/978-981-10-0515-2_14

Shun Kawahara¹²,
Kazuhiro Seki¹³ &
Kuniaki Uehara¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 593))

Included in the following conference series:

Conference of the Pacific Association for Computational Linguistics

637 Accesses

Abstract

Existing knowledge bases including Wikipedia are typically written and maintained by a group of voluntary editors. Meanwhile, numerous web documents are being published partly due to the popularization of online news and social media. Some of the web documents contain novel information, called “vital documents”, that should be taken into account to update articles of the knowledge bases. However, it is virtually impossible for the editors to manually monitor all the relevant web documents. As a result, there is a considerable time lag between an edit to knowledge base and the publication dates of the web documents. This paper proposes a realtime detection framework of web documents containing novel information flowing in massive document streams. The framework consists of two-step filter using statistical language models. Further, the framework is implemented on the distributed and fault-tolerant realtime computation system, Apache Storm, in order to process the sheer amount of web documents. The validity of the proposed framework is demonstrated on a publicly available web document data set, the TREC KBA Stream Corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://en.wikipedia.org/wiki/List_of_Wikipedias.
2.
http://storm.apache.org/.
3.
http://trec-kba.org/trec-kba-2014.
4.
http://googleresearch.blogspot.jp/2006/08/all-our-n-gram-are-belong-to-you.html.
5.
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html.
6.
More precisely, it was also allowed to do hourly batch processing.
7.
http://en.wikipedia.org/wiki/Wikipedia:Redirect.
8.
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/enwiki-20120104/index.html.

References

Abbes, R., Pinel-Sauvagnat, K., Hernandez, N., Boughanem, M.: IRIT at TREC knowledge base acceleration 2013: cumulative citation recommendation task. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Google Scholar
Balog, K., Ramampiaro, H., Takhirov, N., Nørvåg, K.: Multi-step classification approaches to cumulative citation recommendation. In: Proceedings of the 10th Conference on Open Research Areas in Information Retrieval, pp. 121–128 (2013)
Google Scholar
Balog, K., Serdyukov, P., Vries, A.P.d.: Overview of the TREC 2011 entity track. In: Proceedings of the Text REtrieval Conference (TREC) (2011)
Google Scholar
Bellogín, A., Gebremeskel, G.G., He, J., Lin, J., Said, A., Samar, T., de Vries, A.P., Vuurens, J.B.: CWI and TU Delft at TREC 2013: Contextual suggestion, federated web search, KBA, and web tracks. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Google Scholar
Bonnefoy, L., Bouvier, V., Bellot, P.: A weakly-supervised detection of entity central documents in a stream. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 769–772. ACM Press (2013)
Google Scholar
Dang, H.T., Kelly, D., Lin, J.J.: Overview of the TREC 2007 question answering track. In: Proceedings of the Text REtrieval Conference (TREC) (2007)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Dietz, L., Dalton, J.: UMass at TREC 2013 knowledge base acceleration track: bi-directional entity linking and time-aware evaluation. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Google Scholar
Elsas, J.L., Arguello, J., Callan, J., Carbonell, J.G.: Retrieval and feedback models for blog feed search. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 347–354 (2008)
Google Scholar
Frank, J.R., Bauer, S.J., Kleiman-Weiner, M., Roberts, D.A., Tripuraneni, N., Zhang, C., Re, C., Voorhees, E., Soboroff, I.: Evaluating stream filtering for entity profile updates for TREC 2013 (KBA Track Overview). In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Google Scholar
Frank, J.R., Kleiman-Weiner, M., Roberts, D.A., Niu, F., Zhang, C., Ré, C., Soboroff, I.: Building an entity-centric stream filtering test collection for TREC 2012. In: Proceedings of the Text REtrieval Conference (TREC) (2012)
Google Scholar
Kenter, T.: Filtering documents over time for evolving topics-the university of amsterdam at TREC 2013 KBA CCR. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Google Scholar
Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 111–119 (2001)
Google Scholar
Liu, X., Darko, J., Fang, H.: A related entity based approach for knowledge base acceleration. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Google Scholar
McCreadie, R., Macdonald, C., Ounis, I., Osborne, M., Petrovic, S.: Scalable distributed event detection for twitter. In: 2013 IEEE International Conference on Big Data, pp. 543–549. IEEE (2013)
Google Scholar
Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 233–242 (2007)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Prog. Electron. Libr. Inf. Syst. 14(3), 130–137 (1980)
Google Scholar
Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J.M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., et al.: Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156. ACM (2014)
Google Scholar
Wang, J., Song, D., Lin, C.Y., Liao, L.: BIT and MSRA at TREC KBA CCR Track 2013. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Google Scholar
Wang, X., Fang, H., Zhai, C.: A study of methods for negative relevance feedback. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 219–226 (2008)
Google Scholar
Xu, Y., Jones, G.J., Wang, B.: Query dependent pseudo-relevance feedback based on wikipedia. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 59–66 (2009)
Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to Ad Hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)
Google Scholar

Download references

Acknowledgments

The authors would like to thank Sayaka Kitaguchi at Kobe University for processing the KBA corpus. This work is partially supported by JSPS KAKENHI Grant Numbers 25330363 and MEXT, Japan.

Author information

Authors and Affiliations

Graduate Schools of System Informatics, Kobe University, Kobe, Japan
Shun Kawahara & Kuniaki Uehara
Faculty of Intelligence and Informatics, Konan University, Kobe, Japan
Kazuhiro Seki

Authors

Shun Kawahara
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhiro Seki
View author publications
You can also search for this author in PubMed Google Scholar
Kuniaki Uehara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shun Kawahara .

Editor information

Editors and Affiliations

Graduate School of Information Science, The University of Tokyo, Bunkyo-ku, Tokyo, Japan
Kôiti Hasida
School of Electrical Eng and Informatics, Bandung Institute of Technology, Bandung, Indonesia
Ayu Purwarianti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kawahara, S., Seki, K., Uehara, K. (2016). Detecting Vital Documents Using Negative Relevance Feedback in Distributed Realtime Computation Framework. In: Hasida, K., Purwarianti, A. (eds) Computational Linguistics. PACLING 2015. Communications in Computer and Information Science, vol 593. Springer, Singapore. https://doi.org/10.1007/978-981-10-0515-2_14

Download citation

DOI: https://doi.org/10.1007/978-981-10-0515-2_14
Published: 20 February 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-0514-5
Online ISBN: 978-981-10-0515-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics