Skip to main content
Log in

Organizing the Web's Information Explosion to Discover Unknown Unknowns

  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

This paper introduces the TORISHIKI-KAI project, which aims to construct a million-word-scale semantic network from the Web using state of the art knowledge acquisition methods. The resulting network can be browsed as a Web search directory, and we show that the directory is useful for finding “unknown unknowns” — in the infamous words of D.H. Rumsfeld: things “we don't know we don't know.” Because typically we have no way to look for information we don't even know is missing, a crucial characteristic of unknown unknowns is that they are very difficult to discover through keyword-based Web search. Some examples of the unknown unknowns we have found include unexpected troubles associated with commercial products, surprising new combinations of ingredients in new recipes, unexpected tools or methods for commiting suicide, and so on. We expect such information to be useful for risk management, innovation support, and the detection of harmful information on the Web.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abe, S., Inui, K. and Matsumoto, Y., “Two-phased event relation acquisition: Coupling the relation-oriented and argument-oriented approaches,” in Proc. of the 22nd International Conference on Computational Linguistics (COLING-2008), pp.1–8, 2008.

  2. Ando, M., Sekine, S. and Ishizaki, S., “Automatic extraction of hyponyms from newspaper using lexicosyntactic patterns,” in IPSJ SIG Technical Report 2003-NL-157 (in Japanese), pp.77–82, 2003.

  3. Baeza-Yates, R., Hurtado, C. and Mendoza, M., “Query recommendation using query logs in search engines,” in International Workshop on Clustering Information over the Web (ClustWeb, in conjunction with EDBT), Creete, pp.588–596, Springer, 2004.

  4. Blum, A. and Mitchell, T., “Combining labeled and unlabeled data with co-training,” in Proc. of the eleventh annual conference on Computational Learning Theory (COLT'98), pp.92–100, 1998.

  5. Caraballo, S. A., “Automatic construction of a hypernym-labeled noun hierarchy from text,” in Proc. of the 37th annual meeting of The Association for Computational Linguistics, pp.120–126, 1999.

  6. Dagan, I., Lee, L. and Pereira, F., “Similarity-based models of co-occurrence probabilities,” Machine Learning, Kluwer Academic Publishers, Boston, pp.43–69, 1999.

    Google Scholar 

  7. De Saeger, S., Torisawa, K. and Kazama, J., “Looking for trouble,” in Proc. of The 22nd International Conference on Computational Linguistics (Coling2008), 2008.

  8. Dempster, A. P., Laird, N. M. and Rubin, D. B., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist, Soc. B, 39, pp.185–197, 1977.

    MathSciNet  Google Scholar 

  9. Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D. and Yates, A., “Unsupervised named-entity extraction from the web: An experimental study,” Artificial Intelligence, Elsevier B.V., pp.91–134, 2005.

  10. Gliozzo, A. M., Pennacchiotti, M. and Pantel, P., “The domain restriction hypothesis: Relating term similarity and semantic consistency,” in Proc. of Human Language Technology Conference/North Americal Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL07), pp.131–138, 2007.

  11. Harris, Z., “Distributional Structure,” in Word, 10, 23, pp.146–162, 1954.

  12. Hearst, M., “Automatic acquisition of hyponyms from large text corpora,” in Proc. of the 14th International Conference on Computational Lnguistics (COLING 1992), pp.539–545, 1992.

  13. Imasumi, K., “Automatic acqusition of hyponymy relations from coordinated noun phrases and appositions,” Master's Thesis, Kyushu Institute of Technology, 2001.

  14. Kazama, J., De Saeger, S., Torisawa, K. and Murata, M., “Generating a large-scale analogy list using a probabilistic clustering based on noun-verb dependency profiles,” in 15th Annual Meeting of The Association for Natural Language Processing (in Japanese), 2009.

  15. Kazama, J. and Torisawa, K., “Exploiting Wikipedia as external knowledge for named entity recognition,” in Proc. of the Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning (EMNLP-CoNLL 2007), pp.698–707, 2007.

  16. Kazama, J. and Torisawa, K., “Inducing gazetteers for named entity recognition by largescale clustering of dependency relations,” in Proc. of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT), pp.407–415, 2008.

  17. Kozareva, Z., Riloff, E. and Hovy, E., “Semantic class learning from the web with hyponym pattern linkage graphs,” in Proc. of Association for Computational Linguistics (ACL-08: HLT), pp.1048–1056, Columbus, Ohio, June 2008.

  18. Manning, C. and Schütze, H., Foundations of Statistical Natural Language Processing, ISBN 4-13-065404-7, MIT Press, 1999.

  19. Oh, J., Uchimoto, K. and Torisawa, K., “Bilingual co-training for monolingual hyponymyrelation acquisition,” in Proc. of the Joint conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL-IJCNLP 2009), pp.432–440, 2009.

  20. Pantel, P. and Pennacchiotti, M., “Espresso: Leveraging generic patterns for automatically harvesting semantic relations,” in Proc. of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06), pp.113–120, 2006.

  21. Pantel, P. and Ravichandran, D., “Automatically labeling semantic classes,” inProc. of the Human Language Technology and North American Chapter of the Association for Computational Linguistic Conference, pp.321–328, 2004.

  22. Pasca, M., “Acquisition of categorized named entities for web search,” in Proc. of the 2004 ACM CIKM International Conference on Information and Knowledge Management, pp.137–145, 2004.

  23. Ponzetto, S.P. and Strube M., “Deriving a large scale taxonomy from Wikipedia,” in Proc. of the 22nd National Conference on Artificial Intelligence, pp.1440–1445, 2007.

  24. Riloff, E. and Jones, R., “Learning dictionaries for information extraction by multi-level bootstrapping,” in Proc. of the Sixteenth National Conference on Artificial Intelligence, 1999.

  25. De Saeger, S., Torisawa, K., Kazama, J., Kuroda, K. and Murata M., “Large scale relation acquisition using class dependent patterns,” in Proc. of the 9th IEEE International Conference on Data Mining (ICDM 2009), 2009.

  26. Shinzato, K., Shibata, T., Kawahara, D., Hashimoto, C. and Kurohashi S., “Tsubaki: An open search engine infrastructure for developing new information access,” in Proc. of the 3rd International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (IJCNLP 2008), pp.189–196, 2008.

  27. Shinzato, K. and Torisawa, K., “Acquiring hyponymy relations from Web documents,” in Proc. of Human Language Technology Conference/North Americal Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL04), pp.73–80, 2004.

  28. Snow, R., Jurafsky, D. and Ng, A. Y., “Semantic taxonomy induction from heterogenous evidence,” in Proc. of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics (COLING-ACL-06), pp.801–808, 2006.

  29. Suchanek, F., Kasneci, G. and Weikum, G., “YAGO: A core of semantic knowledge - unifying WordNet and Wikipedia,” in 16th International World Wide Web Conference (WWW 2007), pp.697–706, ACM, 2007.

  30. Sumida, A., Yoshinaga, N. and Torisawa, K., “Boosting precision and recall of hyponymy relation acquisition from hierarchical layouts in Wikipedia,” in Proc. of the Sixth International Language Resources and Evaluation (LREC'08), pp.2462–2469, 2008.

  31. Torisawa, K., “An unsupervised method for canonicalization of Japanese postpositions,” in Proc. of the 6th Natural Language Proceesing Pacific Rim Symposiumu (NLPRS 2001), pp.211–218, 2001.

  32. Torisawa, K., “Automatic acquisition of expressions representing preparation and utilization of an object,” in Proc. of the Recent Advances in Natural Language Processing (RANLP05), pp.556–560, 2005.

  33. Vapnik, V. N., Statistical Learning Theory, Wiley-Interscience, 1998.

  34. Yamada, I., Torisawa, K., Kazama, J., Kuroda, K., Murata, M., De Saeger, S., Bond, F. and Sumida, A., “Hypernym discovery based on distributional similarity and hierarchical structures,” in Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), pp.929–937, 2009.

  35. Zeng, Q., Crowell, J., Plovnick, R., Kim, E., Ngo, L. and Dibble, E., “Assisting consumer health information retrieval with query recommendation,” Journal of the American Medical Informatics Association, 13, 80–90, 2006.

    Article  Google Scholar 

  36. Zhang, Z. and Nasraoui, O., “Mining search engine query logs for query recommendation,” in Proc. of the 15th International Conference on World Wide Web (WWW'06), pp.1039–1040. ACM Press, 2006.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kentaro Torisawa.

Additional information

Based on “TORISHIKI-KAI, An Autogenerated Web Search Directory,” by Torisawa, K., De Saeger, S., Kakizawa, Y., Kazama, J., Murata, M., Noguchi, D., Sumida, A., which appeared in the proceedings of the Second International Symposium on Universal Communication (ISUC 2008). © 2008 IEEE.

About this article

Cite this article

Torisawa, K., De Saeger, S., Kazama, J. et al. Organizing the Web's Information Explosion to Discover Unknown Unknowns. New Gener. Comput. 28, 217–236 (2010). https://doi.org/10.1007/s00354-009-0087-7

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00354-009-0087-7

Keywords:

Navigation