Organizing the Web's Information Explosion to Discover Unknown Unknowns

Torisawa, Kentaro; De Saeger, Stijn; Kazama, Jun’ichi; Sumida, Asuka; Noguchi, Daisuke; Kakizawa, Yasunori; Murata, Masaki; Kuroda, Kow; Yamada, Ichiro

doi:10.1007/s00354-009-0087-7

Organizing the Web's Information Explosion to Discover Unknown Unknowns

Published: 14 August 2010

Volume 28, pages 217–236, (2010)
Cite this article

New Generation Computing Aims and scope Submit manuscript

Kentaro Torisawa¹,
Stijn De Saeger¹,
Jun’ichi Kazama¹,
Asuka Sumida^1,2,
Daisuke Noguchi^1,3,
Yasunori Kakizawa¹,
Masaki Murata¹,
Kow Kuroda¹ &
…
Ichiro Yamada¹

553 Accesses
7 Citations
Explore all metrics

Abstract

This paper introduces the TORISHIKI-KAI project, which aims to construct a million-word-scale semantic network from the Web using state of the art knowledge acquisition methods. The resulting network can be browsed as a Web search directory, and we show that the directory is useful for finding “unknown unknowns” — in the infamous words of D.H. Rumsfeld: things “we don't know we don't know.” Because typically we have no way to look for information we don't even know is missing, a crucial characteristic of unknown unknowns is that they are very difficult to discover through keyword-based Web search. Some examples of the unknown unknowns we have found include unexpected troubles associated with commercial products, surprising new combinations of ingredients in new recipes, unexpected tools or methods for commiting suicide, and so on. We expect such information to be useful for risk management, innovation support, and the detection of harmful information on the Web.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Utilizing Big Data Analytics for Automatic Building of Language-agnostic Semantic Knowledge Bases

A Corpus of Realistic Known-Item Topics with Associated Web Pages in the ClueWeb09

Integrating Concepts and Knowledge in Large Content Networks

Article 27 August 2014

References

Abe, S., Inui, K. and Matsumoto, Y., “Two-phased event relation acquisition: Coupling the relation-oriented and argument-oriented approaches,” in Proc. of the 22nd International Conference on Computational Linguistics (COLING-2008), pp.1–8, 2008.
Ando, M., Sekine, S. and Ishizaki, S., “Automatic extraction of hyponyms from newspaper using lexicosyntactic patterns,” in IPSJ SIG Technical Report 2003-NL-157 (in Japanese), pp.77–82, 2003.
Baeza-Yates, R., Hurtado, C. and Mendoza, M., “Query recommendation using query logs in search engines,” in International Workshop on Clustering Information over the Web (ClustWeb, in conjunction with EDBT), Creete, pp.588–596, Springer, 2004.
Blum, A. and Mitchell, T., “Combining labeled and unlabeled data with co-training,” in Proc. of the eleventh annual conference on Computational Learning Theory (COLT'98), pp.92–100, 1998.
Caraballo, S. A., “Automatic construction of a hypernym-labeled noun hierarchy from text,” in Proc. of the 37th annual meeting of The Association for Computational Linguistics, pp.120–126, 1999.
Dagan, I., Lee, L. and Pereira, F., “Similarity-based models of co-occurrence probabilities,” Machine Learning, Kluwer Academic Publishers, Boston, pp.43–69, 1999.
Google Scholar
De Saeger, S., Torisawa, K. and Kazama, J., “Looking for trouble,” in Proc. of The 22nd International Conference on Computational Linguistics (Coling2008), 2008.
Dempster, A. P., Laird, N. M. and Rubin, D. B., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist, Soc. B, 39, pp.185–197, 1977.
MathSciNet Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D. and Yates, A., “Unsupervised named-entity extraction from the web: An experimental study,” Artificial Intelligence, Elsevier B.V., pp.91–134, 2005.
Gliozzo, A. M., Pennacchiotti, M. and Pantel, P., “The domain restriction hypothesis: Relating term similarity and semantic consistency,” in Proc. of Human Language Technology Conference/North Americal Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL07), pp.131–138, 2007.
Harris, Z., “Distributional Structure,” in Word, 10, 23, pp.146–162, 1954.
Hearst, M., “Automatic acquisition of hyponyms from large text corpora,” in Proc. of the 14th International Conference on Computational Lnguistics (COLING 1992), pp.539–545, 1992.
Imasumi, K., “Automatic acqusition of hyponymy relations from coordinated noun phrases and appositions,” Master's Thesis, Kyushu Institute of Technology, 2001.
Kazama, J., De Saeger, S., Torisawa, K. and Murata, M., “Generating a large-scale analogy list using a probabilistic clustering based on noun-verb dependency profiles,” in 15th Annual Meeting of The Association for Natural Language Processing (in Japanese), 2009.
Kazama, J. and Torisawa, K., “Exploiting Wikipedia as external knowledge for named entity recognition,” in Proc. of the Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning (EMNLP-CoNLL 2007), pp.698–707, 2007.
Kazama, J. and Torisawa, K., “Inducing gazetteers for named entity recognition by largescale clustering of dependency relations,” in Proc. of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT), pp.407–415, 2008.
Kozareva, Z., Riloff, E. and Hovy, E., “Semantic class learning from the web with hyponym pattern linkage graphs,” in Proc. of Association for Computational Linguistics (ACL-08: HLT), pp.1048–1056, Columbus, Ohio, June 2008.
Manning, C. and Schütze, H., Foundations of Statistical Natural Language Processing, ISBN 4-13-065404-7, MIT Press, 1999.
Oh, J., Uchimoto, K. and Torisawa, K., “Bilingual co-training for monolingual hyponymyrelation acquisition,” in Proc. of the Joint conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL-IJCNLP 2009), pp.432–440, 2009.
Pantel, P. and Pennacchiotti, M., “Espresso: Leveraging generic patterns for automatically harvesting semantic relations,” in Proc. of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06), pp.113–120, 2006.
Pantel, P. and Ravichandran, D., “Automatically labeling semantic classes,” inProc. of the Human Language Technology and North American Chapter of the Association for Computational Linguistic Conference, pp.321–328, 2004.
Pasca, M., “Acquisition of categorized named entities for web search,” in Proc. of the 2004 ACM CIKM International Conference on Information and Knowledge Management, pp.137–145, 2004.
Ponzetto, S.P. and Strube M., “Deriving a large scale taxonomy from Wikipedia,” in Proc. of the 22nd National Conference on Artificial Intelligence, pp.1440–1445, 2007.
Riloff, E. and Jones, R., “Learning dictionaries for information extraction by multi-level bootstrapping,” in Proc. of the Sixteenth National Conference on Artificial Intelligence, 1999.
De Saeger, S., Torisawa, K., Kazama, J., Kuroda, K. and Murata M., “Large scale relation acquisition using class dependent patterns,” in Proc. of the 9th IEEE International Conference on Data Mining (ICDM 2009), 2009.
Shinzato, K., Shibata, T., Kawahara, D., Hashimoto, C. and Kurohashi S., “Tsubaki: An open search engine infrastructure for developing new information access,” in Proc. of the 3rd International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (IJCNLP 2008), pp.189–196, 2008.
Shinzato, K. and Torisawa, K., “Acquiring hyponymy relations from Web documents,” in Proc. of Human Language Technology Conference/North Americal Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL04), pp.73–80, 2004.
Snow, R., Jurafsky, D. and Ng, A. Y., “Semantic taxonomy induction from heterogenous evidence,” in Proc. of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics (COLING-ACL-06), pp.801–808, 2006.
Suchanek, F., Kasneci, G. and Weikum, G., “YAGO: A core of semantic knowledge - unifying WordNet and Wikipedia,” in 16th International World Wide Web Conference (WWW 2007), pp.697–706, ACM, 2007.
Sumida, A., Yoshinaga, N. and Torisawa, K., “Boosting precision and recall of hyponymy relation acquisition from hierarchical layouts in Wikipedia,” in Proc. of the Sixth International Language Resources and Evaluation (LREC'08), pp.2462–2469, 2008.
Torisawa, K., “An unsupervised method for canonicalization of Japanese postpositions,” in Proc. of the 6th Natural Language Proceesing Pacific Rim Symposiumu (NLPRS 2001), pp.211–218, 2001.
Torisawa, K., “Automatic acquisition of expressions representing preparation and utilization of an object,” in Proc. of the Recent Advances in Natural Language Processing (RANLP05), pp.556–560, 2005.
Vapnik, V. N., Statistical Learning Theory, Wiley-Interscience, 1998.
Yamada, I., Torisawa, K., Kazama, J., Kuroda, K., Murata, M., De Saeger, S., Bond, F. and Sumida, A., “Hypernym discovery based on distributional similarity and hierarchical structures,” in Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), pp.929–937, 2009.
Zeng, Q., Crowell, J., Plovnick, R., Kim, E., Ngo, L. and Dibble, E., “Assisting consumer health information retrieval with query recommendation,” Journal of the American Medical Informatics Association, 13, 80–90, 2006.
Article Google Scholar
Zhang, Z. and Nasraoui, O., “Mining search engine query logs for query recommendation,” in Proc. of the 15th International Conference on World Wide Web (WWW'06), pp.1039–1040. ACM Press, 2006.

Download references

Author information

Authors and Affiliations

Language Infrastructure Group, MASTAR Project, National Institute of Information and Communications Technology (NICT), 4-2-1, Hikaridai, Seikacho, Kyoto, 619-0288, Japan
Kentaro Torisawa, Stijn De Saeger, Jun’ichi Kazama, Asuka Sumida, Daisuke Noguchi, Yasunori Kakizawa, Masaki Murata, Kow Kuroda & Ichiro Yamada
Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi-shi, Ishikawa, 923-1211, Japan
Asuka Sumida
NEC BIGLOBE Ltd., 11-1, Ohsaki 1-chome, Shinagawa-ku, Tokyo, 141-0032, Japan
Daisuke Noguchi

Authors

Kentaro Torisawa
View author publications
You can also search for this author in PubMed Google Scholar
Stijn De Saeger
View author publications
You can also search for this author in PubMed Google Scholar
Jun’ichi Kazama
View author publications
You can also search for this author in PubMed Google Scholar
Asuka Sumida
View author publications
You can also search for this author in PubMed Google Scholar
Daisuke Noguchi
View author publications
You can also search for this author in PubMed Google Scholar
Yasunori Kakizawa
View author publications
You can also search for this author in PubMed Google Scholar
Masaki Murata
View author publications
You can also search for this author in PubMed Google Scholar
Kow Kuroda
View author publications
You can also search for this author in PubMed Google Scholar
Ichiro Yamada
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kentaro Torisawa.

Additional information

Based on “TORISHIKI-KAI, An Autogenerated Web Search Directory,” by Torisawa, K., De Saeger, S., Kakizawa, Y., Kazama, J., Murata, M., Noguchi, D., Sumida, A., which appeared in the proceedings of the Second International Symposium on Universal Communication (ISUC 2008). © 2008 IEEE.

About this article

Cite this article

Torisawa, K., De Saeger, S., Kazama, J. et al. Organizing the Web's Information Explosion to Discover Unknown Unknowns. New Gener. Comput. 28, 217–236 (2010). https://doi.org/10.1007/s00354-009-0087-7

Download citation

Received: 09 June 2009
Revised: 08 September 2009
Published: 14 August 2010
Issue Date: July 2010
DOI: https://doi.org/10.1007/s00354-009-0087-7

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Organizing the Web's Information Explosion to Discover Unknown Unknowns

Abstract

Access this article

Similar content being viewed by others

Utilizing Big Data Analytics for Automatic Building of Language-agnostic Semantic Knowledge Bases

A Corpus of Realistic Known-Item Topics with Associated Web Pages in the ClueWeb09

Integrating Concepts and Knowledge in Large Content Networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Keywords:

Navigation

Organizing the Web's Information Explosion to Discover Unknown Unknowns

Abstract

Access this article

Similar content being viewed by others

Utilizing Big Data Analytics for Automatic Building of Language-agnostic Semantic Knowledge Bases

A Corpus of Realistic Known-Item Topics with Associated Web Pages in the ClueWeb09

Integrating Concepts and Knowledge in Large Content Networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Share this article

Keywords:

Search

Navigation