Abstract
We describe an experiment of transforming large collections of documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arXiv) using the to XML converter which is currently under development.
The main technical task of our arXMLiv project is to supply LaTeXML bindings for the (thousands of) classes and packages used in the arXiv collection. For this we have developed a distributed build system that reiteratively runs LaTeXML over the arXiv collection and collects statistics about e.g. the most sorely missing LaTeXML bindings and clusters common error events. This creates valuable feedback to both the developers of the LaTeXML package and to binding implementers. We have now processed the complete arXiv collection of more than 400,000 documents from 1993 until 2006 (one run is a processor-year-size undertaking) and have continuously improved our success rate to more than 56% (i.e. over 56% of the documents that are have been converted by LaTeXML without noticing an error and are available as XHTML+MathML documents).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Anghelache, R.: Hermes - a semantic xml+mathml+unicode e-publishing/self-archiving tool for latex authored scientific articles (2007), http://hermes.roua.org/
arXiv.org e-Print archive (December, 2007), http://www.arxiv.org
Kohlhase, M.: s: Using / as a semantic markup format. Mathematics in Computer Science; Special Issue on Management of Mathematical Knowledge (accepted, 2008)
Math Web Search (June 2007), http://kwarc.info/projects/mws/
Miller, B.: LaTeXML: A to xml converter. Web Manual (September 2007), http://dlmf.nist.gov/LaTeXML/
van den Brand, M., Stuber, J.: Extracting mathematical semantics from latex documents. In: Bry, F., Henze, N., Małuszyński, J. (eds.) PPSWR 2003. LNCS, vol. 2901, pp. 160–173. Springer, Heidelberg (2003)
Zentralblatt MATH (December 2007), http://www.zentralblatt-math.org
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stamerjohanns, H., Kohlhase, M. (2008). Transforming the arχiv to XML. In: Autexier, S., Campbell, J., Rubio, J., Sorge, V., Suzuki, M., Wiedijk, F. (eds) Intelligent Computer Mathematics. CICM 2008. Lecture Notes in Computer Science(), vol 5144. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85110-3_46
Download citation
DOI: https://doi.org/10.1007/978-3-540-85110-3_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85109-7
Online ISBN: 978-3-540-85110-3
eBook Packages: Computer ScienceComputer Science (R0)