Skip to main content

From Legacy Documents to XML: A Conversion Framework

  • Conference paper
Research and Advanced Technology for Digital Libraries (ECDL 2005)

Abstract

We present an integrated framework for the document conversion from legacy formats to XML format. We describe the LegDoC project, aimed at automating the conversion of layout annotations layout-oriented formats like PDF, PS and HTML to semantic-oriented annotations. A toolkit of different components covers complementary techniques the logical document analysis and semantic annotations with the methods of machine learning. We use a real case conversion project as a driving example to exemplify different techniques implemented in the project.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. IJDAR 4(1), 2–17 (2001)

    Article  Google Scholar 

  2. Berger, A.L., Della Pietra, S., Della Pietra, V.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1), 39–71 (1996)

    Google Scholar 

  3. Le Bourgeois, F., Emptoz, H., Bensafi, S.: Document understanding using probabilistic relaxation: Application on tables of contents of periodicals. In: ICDAR (2001)

    Google Scholar 

  4. Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.M.: Geometric layout analysis techniques for document image understanding: a review. Technical Report #9703-09, ITC-IRST (1997)

    Google Scholar 

  5. Sundaresan, N., Chung, C.Y., Gertz, M.: Reverse engineering for web data: From visual to semantic structures. In: 18th Intern. Conf Data Eng, ICDE (2002)

    Google Scholar 

  6. Curran, J.R., Wong, R.K.: Transformation-based learning for automatic translation from HTML to XML. In: Proc. 4th Austral. Doc. Comp. Symp, ADCS (1999)

    Google Scholar 

  7. Penttonen, M., Kuikka, E., Leinonen, P.: Towards automating of document structure transformations. In: Proc. ACM Sym. on Doc. Eng., pp. 103–110 (2002)

    Google Scholar 

  8. Ha, J., Haralick, R.M., Phillips, I.T.: Recursive X-Y cut using bounding boxes of connected components. In: ICDAR (1995)

    Google Scholar 

  9. He, F., Ding, X., Peng, L.: Hierarchical logical structure extraction of book documents by analyzing tables of contents. In: Proc. of SPIE-IS&T Elect. Imaging. SPIE, vol. 5296 (1995)

    Google Scholar 

  10. Ishitani, Y.: Document transformation system from papers to xml data based on pivot xml document method. In: ICDAR (2003)

    Google Scholar 

  11. Kurgan, L., Swiercz, W., Cios, K.J.: Semantic mapping of XML tags using inductive machine learning. In: Proc. Intern. Conf. Machine Learn. and Applic., pp. 99–109 (2002)

    Google Scholar 

  12. Lin, X.: Text-mining based journal splitting. In: ICDAR (2003)

    Google Scholar 

  13. Lin, X.: Automatic document navigation for digital content re-mastering. Master’s thesis, HP, Technical report (2003)

    Google Scholar 

  14. Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: Intern. Conf. Pattern Recogn. (1984)

    Google Scholar 

  15. Ramakrishnan, I.V., Mukherjee, S., Yang, G.: Automatic annotation of content-rich web documents: Structural and semantic analysis. In: Intern. Sem. Web Conf. (2003)

    Google Scholar 

  16. Wang, Y., Phillips, I.T., Haralick, R.: From image to SGML/XML representation: One method. In: Intern. Workshop Doc. Layout Interpr. and Its Applic., DLIAP (1999)

    Google Scholar 

  17. XQuery 1.0: An XML query language, http://www.w3c.org/TR/xquery/

  18. XSL Transformations (XSLT) version 1.0, http://www.w3c.org/TR/xslt/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chanod, JP. et al. (2005). From Legacy Documents to XML: A Conversion Framework. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2005. Lecture Notes in Computer Science, vol 3652. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551362_9

Download citation

  • DOI: https://doi.org/10.1007/11551362_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28767-4

  • Online ISBN: 978-3-540-31931-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics