Abstract
In this paper, we propose algorithms to increase the accuracy of hazardous Web page detection by correcting the detection errors of typical keyword-based algorithms based on the dependency relations between the hazardous keywords and their neighboring segments. Most typical text-based filtering systems ignore the context where the hazardous keywords appear. Our algorithms automatically obtain segment pairs that are in dependency relations and appear to characterize hazardous documents. In addition, we also propose a practical approach to expanding segment pairs with a thesaurus. Experiments with a large number of Web pages show that our algorithms increase the detection F value by 7.3% compared to the conventional algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Yanagihara, T., Ikeda, K., Matsumoto, K., Takishima, Y.: Fast n-gram Assortment Construction for Filtering Hazardous Information. IPSJ SIG Technical Reports, vol. 3, pp. 1–5 (2009)
Hoashi, K., Matsumoto, K., Inoue, N., Hashimoto, K.: Document Filtering Method Using Non-Relevant Information Profile. In: Proc. of SIGIR 2000, pp. 176–183 (2000)
Matsumura, A., Takasu, A., Adachi, J.: The Effect of Information Retrieval Method Using Dependency Relationship Between Words. In: Proc. of RIAO 2000, pp. 1043–1058 (2000)
Sun, R., Ong, C.H., Chua, T.S.: Mining Dependency Relations for Query Expansion in Passage Retrieval. In: Proc. of SIGIR 2006, pp. 382–389 (2006)
Liu, Y., Scheuermann, P., Li, X., Zhu, X.: Using WordNet to Disambiguate Word Senses for Text Classification. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4489, pp. 780–788. Springer, Heidelberg (2007)
Hsu, M.H., Tsai, M.F., Chen, H.H.: Combining WordNet and ConceptNet for Automatic Query Expansion: A Learning Approach. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 213–224. Springer, Heidelberg (2008)
Yoshioka, M., Haraguchi, M.: On a Combination of Probabilistic and Boolean IR Models for WWW Document Retrieval. In: Proc. of TALIP 2005, vol. 4(4), pp. 340–356 (2005)
Li, S.L., Otsuka, M., Kitsuregawa, M.: Finding Related Search Engine Queries by Web Community Based Query Enrichment. In: Proc. of WWW 2010, pp. 121–142 (2010)
Ikeda, K., Yanagihara, T., Matsumoto, K., Takisima, Y.: Detection of Illegal and Hazardous Information Using Dependency Relations and Keyword Abstraction (in Japanese). In: Proc. of the Second Forum on Data Engineering and Information Management, C9-5 (2010)
Akaike, H.: A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19(6), 716–723 (2003)
Matsumoto, K., Hashimoto, K.: Schema Design for Causal Law Mining from Incomplete Database. In: Arikawa, S., Furukawa, K. (eds.) DS 1999. LNCS (LNAI), vol. 1721, pp. 92–102. Springer, Heidelberg (1999)
National Institute of Information and Communications Technology, “EDR Thesaurus”, http://www2.nict.go.jp/r/r312/EDR/index.html
Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying Conditional Random Fields to Japanese Morphological Analysis. In: Proc. of EMNLP 2004, pp. 230–237 (2004), http://mecab.sourceforge.net/
Kudo, T., Yamamoto, K., Matsumoto, Y.: Japanese Dependency Analysis using Cascaded Chunking. In: Proc. of COLING 2002, pp. 63–69 (2002)
Kawahara, D., Kurohashi, S.: A fully-lexicalized probabilistic model for Japanese syntactic and case structure analysis. In: Proc. of NAACL 2010, pp. 176–183 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ikeda, K., Yanagihara, T., Hattori, G., Matsumoto, K., Takisima, Y. (2010). Hazardous Document Detection Based on Dependency Relations and Thesaurus. In: Li, J. (eds) AI 2010: Advances in Artificial Intelligence. AI 2010. Lecture Notes in Computer Science(), vol 6464. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17432-2_46
Download citation
DOI: https://doi.org/10.1007/978-3-642-17432-2_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17431-5
Online ISBN: 978-3-642-17432-2
eBook Packages: Computer ScienceComputer Science (R0)