Abstract
Software analysis techniques, and in particular software “design recovery”, have been highly successful at both technical and businesslevel semantic markup of large scale software systems written in a wide variety of programming languages, and in particular have proven e.cient and scalable in assisting the resolution of the “year 2000” problem for billions of lines of legacy source code. In this work we describe a first experiment in applying the same technical solutions and tools that have proven so successful in software markup to the more general problem of semantic markup of text documents. In this early report we describe our adaptation of the software analysis techniques, propose a general domain-independent architecture for semantic markup using them, and demonstrate its feasibility in a limited but realistic domain of application by comparison with both raw and tool-assisted human semantic markers.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Daconta, L., Orbst, L., Smith, K.: The Semantic Web: A guide to the future of XML, web services and knowledge management (2003)
Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284, 34–43 (2001)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K., Rajagopalan, S., Tomkins, A., Tomlin, J., Zien, J.: A case for automated large-scale semantic annotation. J. Web Semantics 1, 115–132 (2003)
Biggerstaff, T.: Design recovery for maintenance and reuse. IEEE Computer 22, 36–49 (1989)
Dean, T., Cordy, J., Schneider, K., Malton, A.: Experience using design recovery techniques to transform legacy systems. In: Proc. 17th Int. Conference on Software Maintenance, pp. 622–631 (2001)
Cordy, J., Dean, T., Malton, A., Schneider, K.: Source transformation in software engineering using the TXL transformation system. J. Information and Software Technology 44, 827–837 (2002)
Cordy, J.: TXL – a language for programming language tools and applications. In: Proc. 4th Int. Workshop on Language Descriptions, Tools and Applications, Electronic Notes in Theoretical Computer Science, vol. 110, pp. 3–31 (2004)
Dean, T., Cordy, J., Malton, A., Schneider, K.: Agile parsing in TXL. J. Automated Software Engineering 10, 311–336 (2003)
Cordy, J., Schneider, K., Dean, T., Malton, A.: HSML: Design-directed source code hotspots. In: Proc. 9th International Workshop on Program Comprehension, pp. 145–154 (2001)
Yang, Y.: An evaluation of statistical approaches to text categorization. J. Information Retrieval 1, 67–88 (1999)
Sean, L., Lee, S., Rager, D., Handler, J.: Ontology-based web agents. In: Proc. 1st International Conference on Autonomous Agents, pp. 59–68 (1997)
Decker, S., Erdmann, M., Fensel, D., Studer, R.: Ontobroker: Ontology-based access to distributed and semi-structured information. In: Proc. 8th Working Conference on Database Semantics, pp. 351–369 (1999)
Kogut, P., Holmes, W.: AeroDAML: Applying information extraction to generate DAML annotations from web pages. In: Proc. KCAP-2001 Workshop on Knowledge Markup and Semantic Annotation (2001)
Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A.: KIM: a semantic platform for information extaction and retrieval. J. Web Semantics 10, 375–392 (2004)
Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM: Semi-automatic CREAtion of Metadata. In: Proc. 13th Int. Conference on Knowledge Engineering and Management, pp. 358–372 (2002)
Muslea, I.: Extraction patterns for information extraction tasks: A survey. In: Proc. AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 1–6 (1999)
Nobata, C., Sekine, S.: Towards automatic acquisition of patterns for information extraction. In: Proc. International Conference on Computer Processing of Oriental Languages (1999)
Etzioni, O., Cafarella, M.J., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence 165, 91–134 (2005)
Wessman, A., Liddle, S.W., Embley, D.W.: A generalized framework for an ontology-based data-extraction system. In: Proc. 4th Int. Conference on Information Systems Technology and its Applications, pp. 239–253 (2005)
Muslea, I., Minton, S., Knoblock, C.A.: Active learning with strong and weak views: A case study on wrapper induction. In: Proc. 18th Int. Joint Conference on Artificial Intelligence, pp. 415–420 (2003)
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proc. 17th National Conference on Artificial Intelligence, pp. 577–583 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kiyavitskaya, N., Zeni, N., Cordy, J.R., Mich, L., Mylopoulos, J. (2005). Applying Software Analysis Technology to Lightweight Semantic Markup of Document Text. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds) Pattern Recognition and Data Mining. ICAPR 2005. Lecture Notes in Computer Science, vol 3686. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551188_65
Download citation
DOI: https://doi.org/10.1007/11551188_65
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28757-5
Online ISBN: 978-3-540-28758-2
eBook Packages: Computer ScienceComputer Science (R0)