Abstract
Web pages often embed scripts for a variety of purposes, including advertising and dynamic interaction. Understanding embedded scripts and their purpose can often help to interpret or provide crucial information about the web page. We have developed a functionality-based categorization of JavaScript, the most widely used web page scripting language. We then view understanding embedded scripts as a text categorization problem. We show how traditional information retrieval methods can be augmented with the features distilled from the domain knowledge of JavaScript and software analysis to improve classification performance. We perform experiments on the standard WT10G web page corpus, and show that our techniques eliminate over 50% of errors over a standard text classification baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baxter, I.D., Yahin, A., Moura, L.M.D., SantAnna, M., Bier, L.: Clone detection using abstract syntax trees. In: ICSM, pp. 368–377 (1998)
Blazy, S., Facon, P.: Partial evaluation for program comprehension. ACM Computing Surveys 30(3) (1998)
Kapser, C., Godfrey, M.W.: Aiding Comprehension of Cloning Through Categorization. In: Proc. of 2004 International Workshop on Software Evolution (IWPSE 2004), Kyoto, Japan (2004)
Hawking, D.: Web Research Collection (June 2004), http://es.csiro.au/TRECWeb/
Krsul, I., Spafford, E.H.: Authorship Analysis: Identifying the Author of a Program. In: Proc. 18th NIST-NCSC National Information Systems Security Conference, pp. 514–524 (1995)
Kamiya, T., Kusumoto, S., Inoue, K.: Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28(7), 654–670 (2002)
Kontogiannis, K.: Evaluation experiments on the detection of programming patterns using software metrics. In: Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE 1997), pp. 44–54. IEEE Computer Society, Washington (1997)
Maletic, J.I., Marcus, A.: Using latent semantic analysis to identify similarities in source code to support program understanding. In: Proceedings of the 12th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2000), p. 46 (2000)
Mathias, K.S., Cross II, J.H., Hendrix, T.D., Barowski, L.A.: The role of software measures and metrics in studies of program comprehension. In: ACM Southeast Regional Conference (1999)
Rowe, N., Laitinen, K.: Semiautomatic disabbreviation of technical text. Information Processing and Management 31(6), 851–857 (1995)
Ugurel, S., Krovetz, B., Giles, C.L., Pennock, D., Glover, E., Zha, H.: What is the code? Automatic Classification of Source Code Archives. In: Eighth ACM International Conference on Knowledge and Data Discovery (KDD 2002), pp. 623–638 (2002) (poster)
von Mayrhauser, A., Vans, A.M.: Dynamic code cognition behaviors for large scale code. In: Proceedings of the 3rd Workshop on Program Comprehension, pp. 74–81 (1994)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2000)
Wong, W.-C., Fu, A.W.-C.: Finding structures of web documents. In: ACM SIGMOD Workshop on Research Issues in DataMining and Knowledge Discovery (DMKD) (2000)
Yang, W.: Identifying syntactic differences between two programs. Software - Practice and Experience 21(7), 739–755 (1991)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lu, W., Kan, MY. (2005). Supervised Categorization of JavaScriptTM Using Program Analysis Features. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_13
Download citation
DOI: https://doi.org/10.1007/11562382_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)