Abstract
Existing knowledge bases including Wikipedia are typically written and maintained by a group of voluntary editors. Meanwhile, numerous web documents are being published partly due to the popularization of online news and social media. Some of the web documents contain novel information, called “vital documents”, that should be taken into account to update articles of the knowledge bases. However, it is virtually impossible for the editors to manually monitor all the relevant web documents. As a result, there is a considerable time lag between an edit to knowledge base and the publication dates of the web documents. This paper proposes a realtime detection framework of web documents containing novel information flowing in massive document streams. The framework consists of two-step filter using statistical language models. Further, the framework is implemented on the distributed and fault-tolerant realtime computation system, Apache Storm, in order to process the sheer amount of web documents. The validity of the proposed framework is demonstrated on a publicly available web document data set, the TREC KBA Stream Corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
More precisely, it was also allowed to do hourly batch processing.
- 7.
- 8.
References
Abbes, R., Pinel-Sauvagnat, K., Hernandez, N., Boughanem, M.: IRIT at TREC knowledge base acceleration 2013: cumulative citation recommendation task. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Balog, K., Ramampiaro, H., Takhirov, N., Nørvåg, K.: Multi-step classification approaches to cumulative citation recommendation. In: Proceedings of the 10th Conference on Open Research Areas in Information Retrieval, pp. 121–128 (2013)
Balog, K., Serdyukov, P., Vries, A.P.d.: Overview of the TREC 2011 entity track. In: Proceedings of the Text REtrieval Conference (TREC) (2011)
Bellogín, A., Gebremeskel, G.G., He, J., Lin, J., Said, A., Samar, T., de Vries, A.P., Vuurens, J.B.: CWI and TU Delft at TREC 2013: Contextual suggestion, federated web search, KBA, and web tracks. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Bonnefoy, L., Bouvier, V., Bellot, P.: A weakly-supervised detection of entity central documents in a stream. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 769–772. ACM Press (2013)
Dang, H.T., Kelly, D., Lin, J.J.: Overview of the TREC 2007 question answering track. In: Proceedings of the Text REtrieval Conference (TREC) (2007)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39(1), 1–38 (1977)
Dietz, L., Dalton, J.: UMass at TREC 2013 knowledge base acceleration track: bi-directional entity linking and time-aware evaluation. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Elsas, J.L., Arguello, J., Callan, J., Carbonell, J.G.: Retrieval and feedback models for blog feed search. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 347–354 (2008)
Frank, J.R., Bauer, S.J., Kleiman-Weiner, M., Roberts, D.A., Tripuraneni, N., Zhang, C., Re, C., Voorhees, E., Soboroff, I.: Evaluating stream filtering for entity profile updates for TREC 2013 (KBA Track Overview). In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Frank, J.R., Kleiman-Weiner, M., Roberts, D.A., Niu, F., Zhang, C., Ré, C., Soboroff, I.: Building an entity-centric stream filtering test collection for TREC 2012. In: Proceedings of the Text REtrieval Conference (TREC) (2012)
Kenter, T.: Filtering documents over time for evolving topics-the university of amsterdam at TREC 2013 KBA CCR. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 111–119 (2001)
Liu, X., Darko, J., Fang, H.: A related entity based approach for knowledge base acceleration. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
McCreadie, R., Macdonald, C., Ounis, I., Osborne, M., Petrovic, S.: Scalable distributed event detection for twitter. In: 2013 IEEE International Conference on Big Data, pp. 543–549. IEEE (2013)
Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 233–242 (2007)
Porter, M.F.: An algorithm for suffix stripping. Prog. Electron. Libr. Inf. Syst. 14(3), 130–137 (1980)
Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J.M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., et al.: Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156. ACM (2014)
Wang, J., Song, D., Lin, C.Y., Liao, L.: BIT and MSRA at TREC KBA CCR Track 2013. In: Proceedings of the Text REtrieval Conference (TREC) (2013)
Wang, X., Fang, H., Zhai, C.: A study of methods for negative relevance feedback. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 219–226 (2008)
Xu, Y., Jones, G.J., Wang, B.: Query dependent pseudo-relevance feedback based on wikipedia. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 59–66 (2009)
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to Ad Hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)
Acknowledgments
The authors would like to thank Sayaka Kitaguchi at Kobe University for processing the KBA corpus. This work is partially supported by JSPS KAKENHI Grant Numbers 25330363 and MEXT, Japan.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Singapore
About this paper
Cite this paper
Kawahara, S., Seki, K., Uehara, K. (2016). Detecting Vital Documents Using Negative Relevance Feedback in Distributed Realtime Computation Framework. In: Hasida, K., Purwarianti, A. (eds) Computational Linguistics. PACLING 2015. Communications in Computer and Information Science, vol 593. Springer, Singapore. https://doi.org/10.1007/978-981-10-0515-2_14
Download citation
DOI: https://doi.org/10.1007/978-981-10-0515-2_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-0514-5
Online ISBN: 978-981-10-0515-2
eBook Packages: Computer ScienceComputer Science (R0)