Abstract
In the following, we present an approach using interactive topic graph extraction for the exploration of Web content. The initial information request, in the form of a query topic description, is issued online by a user to the system. The topic graph is then constructed from N Web snippets that are produced by a standard search engine. We consider the extraction of a topic graph to be a specific empirical collocation extraction task, where collocations are extracted between chunks. Our measure of association strength is based on the pointwise mutual information between chunk pairs which explicitly takes their distance into account. This topic graph can then be further analyzed by users so that they can request additional background information with the help of interesting nodes and pairs of nodes in the topic graph, e.g., explicit relationships extracted from Wikipedia or those automatically extracted from additional Web content as well as conceptual information of the topic in form of semantically oriented clusters of descriptive phrases. This information is presented to the users, who can investigate the identified information nuggets to refine their information search. An initial user evaluation shows that our approach is especially helpful for finding new interesting information on topics about which the user has only a vague idea or no idea, at all.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Actually, both languages are only supported in the i–GNSSMM mode. In the case of the i–MILREX mode, we currently only support the English Wikipedia.
- 2.
Consult, for example, the Web page http://nlp.uned.es/weps/ for more information about the problem space.
- 3.
The screenshots shows relations retrieved from Wikipedia infoboxes only. The component for detecting missing relationships is not yet integrated in the running system.
- 4.
For the remainder of the paper N = 1000. We are using Bing (http://www.bing.com/) for Web search.
- 5.
Concerning the English PoS tags, “word/PoS” expressions that match the following regular expression are considered as extended noun tag: “/(N(N∣P))∣/VB(N∣G)∣/IN∣/DT”. The English Verbs are those whose PoS tag start with VB. We are using the tag sets from the Penn treebank (English) and the Negra treebank (German).
- 6.
Currently, the main purpose of recognizing verb chunks is to improve proper recognition of noun groups. The verb chunks are ignored when building the topic graph.
- 7.
In fact we used the polynomials of the Taylor series for ln(1 + x). Note also that k is actually restricted by the number of chunks in a snippet.
- 8.
For “Jim Clark”, e.g., wikipedia’s infoboxes do not provide information for the relations: birthplace, place_of_death, or cause_of_death.
- 9.
The classification of NP chunks to argument types like times and dates is currently done by using simple regular expressions.
References
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M.S., Etzioni, O.: Open information extraction from the Web. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, pp. 2670–2676. (2007)
Baroni, M., Evert, S.: Statistical methods for corpus exploitation. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin (2008)
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia – a crystallization point for the Web of Data. Web Semant. 7(3), 154–165 (2009)
Bunescu, R.C., Mooney, R.J.: Learning to extract relations from the Web using minimal supervision. In: Proceedings of ACL’07, Prague, pp. 576–583. (2007)
Cui, H., Kan, M.Y., Chua T.S., Xiao, J.: A comparative study on sentence retrieval for definitional question answering. SIGIR Workshop on Information Retrieval for Question Answering (IR4QA), Sheffield (2004)
Downey, D., Schoenmackers, S., Etzioni, O.: Sparse information extraction: unsupervised language models to the rescue. In: Proceedings of ACL, Prague, pp. 696–703. (2007)
Eichler, K., Hemsen, H., Löckelt, M., Neumann, G., Reithinger, N.: Interactive dynamic information extraction. In: Proceedings of KI’2008, Kaiserslautern, pp. 54–61. (2008)
Etzioni, O.: Machine reading of Web text. In: Proceedings of the 4th International Conference on Knowledge Capture, Whistler, pp. 1–4. (2007)
Figueroa, A., Neumann, G.: Language independent answer prediction from the Web. In: Proceedings of the 5th FinTAL, Turku (2006)
Figueroa, A., Neumann, G., Atkinson, J.: Searching for definitional answers on the Web using surface patterns. IEEE Comput. 42(4), 68–76 (2009)
Giesbrecht, E., Evert, S.: Part-of-speech tagging – a solved task? An evaluation of PoS taggers for the Web as corpus. In: Proceedings of the 5th Web as Corpus Workshop, San Sebastian (2009)
Giménez, J., Màrquez, L.: SVMTool: a general PoS tagger generator based on Support Vector Machines. In: Proceedings of LREC’04, Lisbon (2004)
Greenwood, M.A., Stevenson, M.: Improving semi-supervised acquisition of relation extraction patterns. In: Proceedings of the Workshop on Information Extraction Beyond the Document, Sydney, pp. 12–19. (2006)
Hildebrandt, W., Katz, B., Lin, J.: Answering definition questions using multiple knowledge sources. In: Proceedings HLT-NAACL, Boston, pp. 49–56. (2004)
Joho, H., Liu, Y.K., Sanderson, M.: Large scale testing of a descriptive phrase finder. In: Proceedings 1st Human Language Technology Conference, San Diego, pp. 219–221. (2001)
Landauer, T., McNamara, D., Dennis, S., Kintsch, W.: Handbook of Latent Semantic Analysis. Lawrence Erlbaum, Mahwah (2007)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
McDonald, R., Kulick, S., Pereira, F., Winters, S., Jin, Y., White, P.: Simple algorithms for complex relation extraction with applications to biomedical IE. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, University of Michigan, pp. 491–498. (2005)
Rosenfeld, B., Feldman, R.: URES: an unsupervised Web relation extraction system. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, Sydney, pp. 667–674. (2006)
Shinyama, Y., Sekine, S.: Preemptive information extraction using unrestricted relation discovery. In: Proceedings of the Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, New York City, pp. 304–311. (2006)
Sekine, S.: On-demand information extraction. In: Proceedings of the COLING/ACL, Sydney, pp. 731–738. (2006)
Sudo, K., Sekine, S., Grishman, R.: An improved extraction pattern representation model for automatic IE pattern acquisition. In: Proceedings of ACL, Sapporo, pp. 224–231. (2003)
Turney, P.D.: Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of the 12th European Conference on Machine Learning. Freiburg, pp. 491–502. (2001)
Yates, A.: Information extraction from the Web: techniques and applications. Ph.D. Thesis, University of Washington, Computer Science and Engineering (2007)
Acknowledgements
The presented work was partially supported by grants from the German Federal Ministry of Economics and Technology (BMWi) to the Theseus project (FKZ: 01MQ07016).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Neumann, G., Schmeier, S. (2013). Interactive Topic Graph Extraction and Exploration of Web Content. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-28569-1_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28568-4
Online ISBN: 978-3-642-28569-1
eBook Packages: Computer ScienceComputer Science (R0)