Abstract
This paper presents a novel method for mining suspicious websites from World Wide Web by using state-of-the-art pattern mining and machine learning methods. In this document, the term “suspicious website” is used to mean any website that contains known or suspected violations of law. Although, we present our evaluation on illegal online organ trading, the method described in this paper is generic and can be used to detect any specific kind of websites. We use an iterative setting in which at each iterations we unearth both normal and suspicious websites. These newly detected websites are augmented in our training examples and used in next iterations. The first iteration uses user supplied seed normal and suspicious websites. We show that the accuracy increases in intial iterations but decreases with further increase in iterations. This is due to the bias caused by adding large number of normal websites and also due to the automatic addition of noise in training examples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Heyes, J.D.: Global organ harvesting a booming black market business; a kidney harvested every hour, http://www.naturalnews.com/036052_organ_harvesting_kidneys_black_market.html (accessed January 30, 2013)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998), http://citeseer.ist.psu.edu/joachims97text.html
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston (1999)
Li, Y., Zhang, C., Swan, J.: An information filtering model on the web and its application in jobagent. Knowledge-Based Systems 13(5), 285–296 (2000), http://www.sciencedirect.com/science/article/pii/S0950705100000885
Robertson, S., Soboroff, I.: The trec 2002 filtering track report. In: Text Retrieval Conference (2002)
Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the Workshop on Speech and Natural Language, HLT 1991, pp. 212–217. Association for Computational Linguistics, Stroudsburg (1992)
Scott, S., Matwin, S.: Feature engineering for text classification. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 379–388. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Wang, Z., Zhang, D.: Feature selection in text classification via svm and lsi. In: Wang, J., Yi, Z., Żurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3971, pp. 1381–1386. Springer, Heidelberg (2006)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, ACL 2003, vol. 1, pp. 423–430. Association for Computational Linguistics, Stroudsburg (2003), http://dx.doi.org/10.3115/1075096.1075150
De Marneffe, M.C., Maccartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: LREC 2006 (2006)
Matsumoto, S., Takamura, H., Okumura, M.: Sentiment classification using word sub-sequences and dependency sub-trees. In: Ho, T.B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 301–311. Springer, Heidelberg (2005)
Data mining for path traversal patterns in a web environment. In: Proceedings of the 16th International Conference on Distributed Computing Systems, ICDCS 1996, pp. 385–392. IEEE Computer Society, Washington, DC (1996)
Pei, J., Han, J., Mortazavi-Asl, B., Zhu, H.: Mining access patterns efficiently from web logs. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS (LNAI), vol. 1805, pp. 396–407. Springer, Heidelberg (2000)
Wu, S.-T., Li, Y., Xu, Y.: Deploying approaches for pattern refinement in text mining. In: Proceedings of the Sixth International Conference on Data Mining, ICDM 2006, pp. 1157–1161. IEEE Computer Society, Washington, DC (2006)
Jindal, N., Liu, B.: Identifying comparative sentences in text documents. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 244–251. ACM, New York (2006)
Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering, ICDE 1995, pp. 3–14. IEEE Computer Society, Washington, DC (1995)
Feldman, R.: Mining associations in text in the presence of background knowledge. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 1996, pp. 343–346 (1996)
Holt, J.D., Chung, S.M.: Multipass algorithms for mining association rules in text databases. Knowl. Inf. Syst. 3, 168–183 (2001)
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: TREC (1994)
Sahami, M., Heilman, T.: A web-based kernel function for matching short text snippets. In: International Workshop Located at the 22nd International Conference on Machine Learning (ICML), pp. 2–9 (2005)
Abhishek, V., Hosanagar, K.: Keyword generation for search engine advertising using semantic similarity between terms. In: Proceedings of the Ninth International Conference on Electronic Commerce, ICEC 2007, pp. 89–94. ACM, New York (2007)
Joshi, A., Motwani, R.: Keyword generation for search engine advertising. In: Sixth IEEE International Conference on Data Mining Workshops, ICDM Workshops 2006, pp. 490–496 (December 2006)
Moschitti, A., Quarteroni, S., Basili, R., Manandhar, S.: Exploiting syntactic and shallow semantic kernels for question answer classification. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2007)
Joshi, M., Pedersen, T., Maclin, R., Pakhomov, S.: Kernel methods for word sense disambiguation and acronym expansion. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 2, pp. 1879–1880. AAAI Press (2006), http://portal.acm.org/citation.cfm?id=1597348.1597488
Lee, Y.K., Ng, H.T., Chia, T.K.: Supervised word sense disambiguation with support vector machines and multiple knowledge sources. In: Mihalcea, R., Edmonds, P. (eds.) Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 137–140. Association for Computational Linguistics, Barcelona (2004)
Zelenko, D., Aone, C., Richardella, A.: Kernel methods for relation extraction. J. Mach. Learn. Res. 3, 1083–1106 (2003), http://portal.acm.org/citation.cfm?id=944919.944964
Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jung Pandey, S., Manandhar, S., Kleszcz, A. (2013). Using Sub-sequence Patterns for Detecting Organ Trafficking Websites. In: Dziech, A., Czyżewski, A. (eds) Multimedia Communications, Services and Security. MCSS 2013. Communications in Computer and Information Science, vol 368. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38559-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-38559-9_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38558-2
Online ISBN: 978-3-642-38559-9
eBook Packages: Computer ScienceComputer Science (R0)