Skip to main content

Transforming the arχiv to XML

  • Conference paper
Intelligent Computer Mathematics (CICM 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5144))

Included in the following conference series:

Abstract

We describe an experiment of transforming large collections of documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arXiv) using the to XML converter which is currently under development.

The main technical task of our arXMLiv project is to supply LaTeXML bindings for the (thousands of) classes and packages used in the arXiv collection. For this we have developed a distributed build system that reiteratively runs LaTeXML over the arXiv collection and collects statistics about e.g. the most sorely missing LaTeXML bindings and clusters common error events. This creates valuable feedback to both the developers of the LaTeXML package and to binding implementers. We have now processed the complete arXiv collection of more than 400,000 documents from 1993 until 2006 (one run is a processor-year-size undertaking) and have continuously improved our success rate to more than 56% (i.e. over 56% of the documents that are have been converted by LaTeXML without noticing an error and are available as XHTML+MathML documents).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anghelache, R.: Hermes - a semantic xml+mathml+unicode e-publishing/self-archiving tool for latex authored scientific articles (2007), http://hermes.roua.org/

  2. arXiv.org e-Print archive (December, 2007), http://www.arxiv.org

  3. Kohlhase, M.: s: Using / as a semantic markup format. Mathematics in Computer Science; Special Issue on Management of Mathematical Knowledge (accepted, 2008)

    Google Scholar 

  4. Math Web Search (June 2007), http://kwarc.info/projects/mws/

  5. Miller, B.: LaTeXML: A to xml converter. Web Manual (September 2007), http://dlmf.nist.gov/LaTeXML/

  6. van den Brand, M., Stuber, J.: Extracting mathematical semantics from latex documents. In: Bry, F., Henze, N., Małuszyński, J. (eds.) PPSWR 2003. LNCS, vol. 2901, pp. 160–173. Springer, Heidelberg (2003)

    Google Scholar 

  7. Zentralblatt MATH (December 2007), http://www.zentralblatt-math.org

Download references

Author information

Authors and Affiliations

Authors

Editor information

Serge Autexier John Campbell Julio Rubio Volker Sorge Masakazu Suzuki Freek Wiedijk

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Stamerjohanns, H., Kohlhase, M. (2008). Transforming the arχiv to XML. In: Autexier, S., Campbell, J., Rubio, J., Sorge, V., Suzuki, M., Wiedijk, F. (eds) Intelligent Computer Mathematics. CICM 2008. Lecture Notes in Computer Science(), vol 5144. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85110-3_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85110-3_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85109-7

  • Online ISBN: 978-3-540-85110-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics